Nowadays, Microsoft, Google, Facebook, and OpenAI are sharing lots of state-of-the-art models in the field of Natural Language Processing. However, fewer materials exist how to use these models from R. In this post, we will show how R users can access and benefit from these models as well
The Transformers repository from “Hugging Face”
contains a lot of ready to use, state-of-the-art models, which are straightforward to download and fine-tune with Tensorflow & Keras.
For this purpose the users usually need to get:
The model itself (e.g. Bert, Albert, RoBerta, GPT-2 and etc.)
The tokenizer object
The weights of the model
In this post, we will work on a classic binary classification task and train our dataset on 3 models:
We already mentioned that to train a data on the specific model, users should download the model, its tokenizer object and weights. For example, to get a RoBERTa model one has to do the following:
1
2
3
4
5
# get Tokenizertransformer$RobertaTokenizer$from_pretrained('roberta-base',do_lower_case=TRUE)# get Model with weightstransformer$TFRobertaModel$from_pretrained('roberta-base')
Until now, we’ve just covered data import and train-test split. To feed input to the network we have to turn our raw text into indices via the imported tokenizer. And then adapt the model to do binary classification by adding a dense layer with a single unit at the end.
However, we want to train our data for 3 models GPT-2, RoBERTa, and Electra. We need to write a loop for that.
# list of 3 modelsai_m=list(c('TFGPT2Model','GPT2Tokenizer','gpt2'),c('TFRobertaModel','RobertaTokenizer','roberta-base'),c('TFElectraModel','ElectraTokenizer','google/electra-small-generator'))# parametersmax_len=50Lepochs=2batch_size=10# create a list for model resultsgather_history=list()for(iin1:length(ai_m)){# tokenizertokenizer=glue::glue("transformer${ai_m[[i]][2]}$from_pretrained('{ai_m[[i]][3]}',
do_lower_case=TRUE)")%>%rlang::parse_expr()%>%eval()# modelmodel_=glue::glue("transformer${ai_m[[i]][1]}$from_pretrained('{ai_m[[i]][3]}')")%>%rlang::parse_expr()%>%eval()# inputstext=list()# outputslabel=list()data_prep=function(data){for(iin1:nrow(data)){txt=tokenizer$encode(data[['comment_text']][i],max_length=max_len,truncation=T)%>%t()%>%as.matrix()%>%list()lbl=data[['target']][i]%>%t()text=text%>%append(txt)label=label%>%append(lbl)}list(do.call(plyr::rbind.fill.matrix,text),do.call(plyr::rbind.fill.matrix,label))}train_=data_prep(train)test_=data_prep(test)# slice datasettf_train=tensor_slices_dataset(list(train_[[1]],train_[[2]]))%>%dataset_batch(batch_size=batch_size,drop_remainder=TRUE)%>%dataset_shuffle(128)%>%dataset_repeat(epochs)%>%dataset_prefetch(tf$data$experimental$AUTOTUNE)tf_test=tensor_slices_dataset(list(test_[[1]],test_[[2]]))%>%dataset_batch(batch_size=batch_size)# create an input layerinput=layer_input(shape=c(max_len),dtype='int32')hidden_mean=tf$reduce_mean(model_(input)[[1]],axis=1L)%>%layer_dense(64,activation='relu')# create an output layer for binary classificationoutput=hidden_mean%>%layer_dense(units=1,activation='sigmoid')model=keras_model(inputs=input,outputs=output)# compile with AUC scoremodel%>%compile(optimizer=tf$keras$optimizers$Adam(learning_rate=3e-5,epsilon=1e-08,clipnorm=1.0),loss=tf$losses$BinaryCrossentropy(from_logits=F),metrics=tf$metrics$AUC())print(glue::glue('{ai_m[[i]][1]}'))# train the modelhistory=model%>%keras::fit(tf_train,epochs=epochs,#steps_per_epoch=len/batch_size,validation_data=tf_test)gather_history[[i]]<-historynames(gather_history)[i]=ai_m[[i]][1]}
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Both the RoBERTa and Electra models show some additional improvements after 2 epochs of training, which cannot be said of GPT-2. In this case, it is clear that it can be enough to train a state-of-the-art model even for a single epoch.
In this post, we showed how to use state-of-the-art NLP models from R.
To understand how to apply them to more complex tasks, it is highly recommended to review the transformers tutorial
.
We encourage readers to try out these models and share their results below in the comments section!
The mall 0.2.0 update for R and Python introduces support for external LLM providers like OpenAI and Gemini. This version also features parallel processing for R users, the ability to run NLP on string vectors in Python, and a brand new cheatsheet.
We are proud to introduce the {mall} package. With {mall}, you can use a local LLM to run NLP operations across a data frame. (sentiment, summarization, translation, etc). {mall} has been simultaneously released to CRAN and PyPi (as an extension to Polars).
Implementing a language model from scratch is, arguably, the best way to develop an accurate idea of how its engine works. Here, we use torch to code GPT-2, the immediate successor to the original GPT. In the end, you’ll dispose of an R-native model that can make direct use of Hugging Face’s pre-trained GPT-2 model weights.
Stay Connected
Get the latest updates on Posit open source projects and insights from our community.