State-of-the-art NLP models from R

Introduction#

The Transformers repository from “Hugging Face” contains a lot of ready to use, state-of-the-art models, which are straightforward to download and fine-tune with Tensorflow & Keras.

For this purpose the users usually need to get:

The model itself (e.g. Bert, Albert, RoBerta, GPT-2 and etc.)
The tokenizer object
The weights of the model

In this post, we will work on a classic binary classification task and train our dataset on 3 models:

GPT-2 from Open AI
RoBERTa from Facebook
Electra from Google Research/Stanford University

However, readers should know that one can work with transformers on a variety of down-stream tasks, such as:

feature extraction
sentiment analysis
text classification
question answering
summarization
translation and many more .

Prerequisites#

Our first job is to install the transformers package via reticulate.

1

reticulate::py_install('transformers', pip = TRUE)

Then, as usual, load standard ‘Keras’, ‘TensorFlow’ >= 2.0 and some classic libraries from R.

1
2
3
4
5
6


library(keras)
library(tensorflow)
library(dplyr)
library(tfdatasets)

transformer = reticulate::import('transformers')

Note that if running TensorFlow on GPU one could specify the following parameters in order to avoid memory issues.

1
2
3
4


physical_devices = tf$config$list_physical_devices('GPU')
tf$config$experimental$set_memory_growth(physical_devices[[1]],TRUE)

tf$keras$backend$set_floatx('float32')

Template#

We already mentioned that to train a data on the specific model, users should download the model, its tokenizer object and weights. For example, to get a RoBERTa model one has to do the following:

1
2
3
4
5


# get Tokenizer
transformer$RobertaTokenizer$from_pretrained('roberta-base', do_lower_case=TRUE)

# get Model with weights
transformer$TFRobertaModel$from_pretrained('roberta-base')

Data preparation#

A dataset for binary classification is provided in text2vec package. Let’s load the dataset and take a sample for fast model training.

1
2
3
4
5


library(text2vec)
data("movie_review")
df = movie_review %>% rename(target = sentiment, comment_text = review) %>% 
  sample_n(2000) %>% 
  data.table::as.data.table()

Split our data into 2 parts:

1
2
3
4


idx_train = sample.int(nrow(df)*0.8)

train = df[idx_train,]
test = df[!idx_train,]

Data input for Keras#

Until now, we’ve just covered data import and train-test split. To feed input to the network we have to turn our raw text into indices via the imported tokenizer. And then adapt the model to do binary classification by adding a dense layer with a single unit at the end.

However, we want to train our data for 3 models GPT-2, RoBERTa, and Electra. We need to write a loop for that.

Note: one model in general requires 500-700 MB

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78


# list of 3 models
ai_m = list(
  c('TFGPT2Model',       'GPT2Tokenizer',       'gpt2'),
   c('TFRobertaModel',    'RobertaTokenizer',    'roberta-base'),
   c('TFElectraModel',    'ElectraTokenizer',    'google/electra-small-generator')
)

# parameters
max_len = 50L
epochs = 2
batch_size = 10

# create a list for model results
gather_history = list()

for (i in 1:length(ai_m)) {
  
  # tokenizer
  tokenizer = glue::glue("transformer${ai_m[[i]][2]}$from_pretrained('{ai_m[[i]][3]}',
                         do_lower_case=TRUE)") %>% 
    rlang::parse_expr() %>% eval()
  
  # model
  model_ = glue::glue("transformer${ai_m[[i]][1]}$from_pretrained('{ai_m[[i]][3]}')") %>% 
    rlang::parse_expr() %>% eval()
  
  # inputs
  text = list()
  # outputs
  label = list()
  
  data_prep = function(data) {
    for (i in 1:nrow(data)) {
      
      txt = tokenizer$encode(data[['comment_text']][i],max_length = max_len, 
                             truncation=T) %>% 
        t() %>% 
        as.matrix() %>% list()
      lbl = data[['target']][i] %>% t()
      
      text = text %>% append(txt)
      label = label %>% append(lbl)
    }
    list(do.call(plyr::rbind.fill.matrix,text), do.call(plyr::rbind.fill.matrix,label))
  }
  
  train_ = data_prep(train)
  test_ = data_prep(test)
  
  # slice dataset
  tf_train = tensor_slices_dataset(list(train_[[1]],train_[[2]])) %>% 
    dataset_batch(batch_size = batch_size, drop_remainder = TRUE) %>% 
    dataset_shuffle(128) %>% dataset_repeat(epochs) %>% 
    dataset_prefetch(tf$data$experimental$AUTOTUNE)
  
  tf_test = tensor_slices_dataset(list(test_[[1]],test_[[2]])) %>% 
    dataset_batch(batch_size = batch_size)
  
  # create an input layer
  input = layer_input(shape=c(max_len), dtype='int32')
  hidden_mean = tf$reduce_mean(model_(input)[[1]], axis=1L) %>% 
    layer_dense(64,activation = 'relu')
  # create an output layer for binary classification
  output = hidden_mean %>% layer_dense(units=1, activation='sigmoid')
  model = keras_model(inputs=input, outputs = output)
  
  # compile with AUC score
  model %>% compile(optimizer= tf$keras$optimizers$Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0),
                    loss = tf$losses$BinaryCrossentropy(from_logits=F),
                    metrics = tf$metrics$AUC())
  
  print(glue::glue('{ai_m[[i]][1]}'))
  # train the model
  history = model %>% keras::fit(tf_train, epochs=epochs, #steps_per_epoch=len/batch_size,
                validation_data=tf_test)
  gather_history[[i]]<- history
  names(gather_history)[i] = ai_m[[i]][1]
}

Reproduce in a Notebook

Extract results to see the benchmarks:

1
2
3
4
5
6
7
8


res = sapply(1:3, function(x) {
  do.call(rbind,gather_history[[x]][["metrics"]]) %>% 
    as.data.frame() %>% 
    tibble::rownames_to_column() %>% 
    mutate(model_names = names(gather_history[x])) 
}, simplify = F) %>% do.call(plyr::rbind.fill,.) %>% 
  mutate(rowname = stringr::str_extract(rowname, 'loss|val_loss|auc|val_auc')) %>% 
  rename(epoch_1 = V1, epoch_2 = V2)

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Both the RoBERTa and Electra models show some additional improvements after 2 epochs of training, which cannot be said of GPT-2. In this case, it is clear that it can be enough to train a state-of-the-art model even for a single epoch.

Conclusion#

In this post, we showed how to use state-of-the-art NLP models from R. To understand how to apply them to more complex tasks, it is highly recommended to review the transformers tutorial .

We encourage readers to try out these models and share their results below in the comments section!