Audio Classification using DeepLearning for Image Classification13 Nov 2018 by dzlab
Audio Classification using Image Classification
The following tutorial walk you through how to create a classfier for audio files that uses Transfer Learning technique form a DeepLearning network that was training on ImageNet.
YES we will use image classification to classify audios, deal with it.
In this dataset, there is a set of 9473
wav files for training in the
audio_train folder and a set of 9400
wav files that constitues the test set.
Sounds in this dataset are unequally distributed in the following 41 categories of the Google’s AudioSet Ontology:
"Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing"
Once you downloaded this audio dataset, we can then start playing with
These audio files are uncompressed PCM 16 bit, 44.1 kHz, mono audio files which make just perfect for a classification based on spectrogram. We will be using the very handy python library librosa to generate the spectrogram images from these audio files. Another option will be to use matplotlib specgram().
The following snippet converts an audio into a spectrogram image:
For instance, the sounds of a Drawer that opens or closes looks like:
In our case, we need to store those images, unfortunate we have to plot them then store the plot. This is going to be very slow considering that we few thousands images. Following is the snippet for storing the images:
Once the spectrogram files are generated for both training and test sets, we can have a look at them.
- load the labels from the csv file and have a look to the first 5
Now we can have a look at the data which will be piped into the DL model Note: there is no need to apply any transformation (cropping, flipping, rotating, light, etc.) to the images we will be classiying. In fact, they are spectrogram and will be always generate same way, unlike the images that someone would take with a camera where the condition can change drastically.
Following is an example of spectrograms with their corresponding labels:
Now the DL part can finally start
First, create a pre-trained ResNet-34 based model, and look for best learning rate that we will choose later when training the final layers of this network.
Plotting the recorded learning rate will give us somethine like this:
Now we can training the FeedFordward last layers with the learning slice that we choosed wisely from the previous plot. Choose the ones that bounds a steep decreasing plot.
We can keep training the entire net after unfreezing for more epochs as follows:
The training technique is based on the one cycle policy, here is the original ResNet paper.
Plot the top losses, i.e. the cases where the model uncorrectly predicted the labels:
Plot the confusion matrix, i.e. for each orginial label the distribution of number of times the model predicted images from this label to be of one fo the rest classes. The best matrix should have zeros except in the diagonal.
We can perform t-SNE on our model’s output vectors. As these vectors are from the final classification, we would expect them to cluster well.
An alternative for using
spectrogram images is generating Mel-frequency cepstral coefficients (MFCCs). Here is an example of training on MFCC for audio classification - link.
It is also possible to explore other techniques for coding sound, here is nice lecture about this topic - youtube.
Full jupyter notebooks: