This article translates Daniel Falbel
’s ‘Simple Audio Classification’
article from tensorflow/keras to torch/torchaudio. The main goal is to introduce torchaudio and illustrate its contributions to the torch ecosystem. Here, we focus on a popular dataset, the audio loader and the spectrogram transformer. An interesting side product is the parallel between torch and tensorflow, showing sometimes the differences, sometimes the similarities between them.
|
|
Downloading and Importing#
torchaudio has the speechcommand_dataset built in. It filters out background_noise by default and lets us choose between versions v0.01 and v0.02.
|
|
torch_tensor
0.0001 *
0.9155 0.3052 1.8311 1.8311 -0.3052 0.3052 2.4414 0.9155 -0.9155 -0.6104
[ CPUFloatType{1,10} ]
|
|
Classes#
|
|
[1] "bed" "bird" "cat" "dog" "down" "eight" "five"
[8] "four" "go" "happy" "house" "left" "marvin" "nine"
[15] "no" "off" "on" "one" "right" "seven" "sheila"
[22] "six" "stop" "three" "tree" "two" "up" "wow"
[29] "yes" "zero"
Generator Dataloader#
torch::dataloader has the same task as data_generator defined in the original article. It is responsible for preparing batches - including shuffling, padding, one-hot encoding, etc. - and for taking care of parallelism / device I/O orchestration.
In torch we do this by passing the train/test subset to torch::dataloader and encapsulating all the batch setup logic inside a collate_fn() function.
|
|
At this point, dataloader(train_subset) would not work because the samples are not padded. So we need to build our own collate_fn() with the padding strategy.
I suggest using the following approach when implementing the collate_fn():
- begin with
collate_fn <- function(batch) browser(). - instantiate
dataloaderwith thecollate_fn() - create an environment by calling
enumerate(dataloader)so you can ask to retrieve a batch from dataloader. - run
environment[[1]][[1]]. Now you should be sent inside collate_fn() with access tobatchinput object. - build the logic.
|
|
The final collate_fn() pads the waveform to length 16001 and then stacks everything up together. At this point there are no spectrograms yet. We going to make spectrogram transformation a part of model architecture.
|
|
Batch structure is:
- batch[[1]]: waveforms -
tensorwith dimension (32, 1, 16001) - batch[[2]]: targets -
tensorwith dimension (32, 1)
Also, torchaudio comes with 3 loaders, av_loader, tuner_loader, and audiofile_loader- more to come. set_audio_backend() is used to set one of them as the audio loader. Their performances differ based on audio format (mp3 or wav). There is no perfect world yet: tuner_loader is best for mp3, audiofile_loader is best for wav, but neither of them has the option of partially loading a sample from an audio file without bringing all the data into memory first.
For a given audio backend we need pass it to each worker through worker_init_fn() argument.
|
|
Model definition#
Instead of keras::keras_model_sequential(), we are going to define a torch::nn_module(). As referenced by the original article, the model is based on this architecture for MNIST from this tutorial
, and I’ll call it ‘DanielNN’.
|
|
An `nn_module` containing 2,226,846 parameters.
── Modules ──────────────────────────────────────────────────────
● spectrogram: <Spectrogram> #0 parameters
● conv1: <nn_conv2d> #320 parameters
● conv2: <nn_conv2d> #18,496 parameters
● conv3: <nn_conv2d> #73,856 parameters
● conv4: <nn_conv2d> #295,168 parameters
● dense1: <nn_linear> #1,835,136 parameters
● dense2: <nn_linear> #3,870 parameters
Model fitting#
Unlike in tensorflow, there is no model %>% compile(...) step in torch, so we are going to set loss criterion, optimizer strategy and evaluation metrics explicitly in the training loop.
|
|
Training loop#
|
|
|
|
Epoch 1/20
[W SpectralOps.cpp:590] Warning: The function torch.rfft is deprecated and will be removed in a future PyTorch release. Use the new torch.fft module functions, instead, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (function operator())
354/354 [=========================] - 1m - loss: 2.6102 - acc: 0.2333
Epoch 2/20
354/354 [=========================] - 1m - loss: 1.9779 - acc: 0.4138
Epoch 3/20
354/354 [============================] - 1m - loss: 1.62 - acc: 0.519
Epoch 4/20
354/354 [=========================] - 1m - loss: 1.3926 - acc: 0.5859
Epoch 5/20
354/354 [==========================] - 1m - loss: 1.2334 - acc: 0.633
Epoch 6/20
354/354 [=========================] - 1m - loss: 1.1135 - acc: 0.6685
Epoch 7/20
354/354 [=========================] - 1m - loss: 1.0199 - acc: 0.6961
Epoch 8/20
354/354 [=========================] - 1m - loss: 0.9444 - acc: 0.7181
Epoch 9/20
354/354 [=========================] - 1m - loss: 0.8816 - acc: 0.7365
Epoch 10/20
354/354 [=========================] - 1m - loss: 0.8278 - acc: 0.7524
Epoch 11/20
354/354 [=========================] - 1m - loss: 0.7818 - acc: 0.7659
Epoch 12/20
354/354 [=========================] - 1m - loss: 0.7413 - acc: 0.7778
Epoch 13/20
354/354 [=========================] - 1m - loss: 0.7064 - acc: 0.7881
Epoch 14/20
354/354 [=========================] - 1m - loss: 0.6751 - acc: 0.7974
Epoch 15/20
354/354 [=========================] - 1m - loss: 0.6469 - acc: 0.8058
Epoch 16/20
354/354 [=========================] - 1m - loss: 0.6216 - acc: 0.8133
Epoch 17/20
354/354 [=========================] - 1m - loss: 0.5985 - acc: 0.8202
Epoch 18/20
354/354 [=========================] - 1m - loss: 0.5774 - acc: 0.8263
Epoch 19/20
354/354 [==========================] - 1m - loss: 0.5582 - acc: 0.832
Epoch 20/20
354/354 [=========================] - 1m - loss: 0.5403 - acc: 0.8374
val_acc: 0.876705979296493
Making predictions#
We already have all predictions calculated for test_subset, let’s recreate the alluvial plot from the original article.
|
|
Model accuracy is 87,7%, somewhat worse than tensorflow version from the original post. Nevertheless, all conclusions from original post still hold.