Do you know why you can’t hear ugly ahem sounds on the podcast Data Science at Home ?
Because we remove them. Actually not us. A neural network does.
A nice example of deep learning for audio
Let me introduce the ahem detector, a deep convolutional neural network that is trained on transformed audio signals to recognize “ahem” sounds. The network has been trained to detect such signals on the episodes of Data Science at Home, the podcast about data science at datascienceathome.com/episodes/podcast/
You can find slides and technical details here. A few concepts before getting to the details.
The model requires two sets of audio files, much like a cohort study:
- a negative sample with clean voice/sound and
- a positive one with “ahem” sounds concatenated
The detector can work with any other audio input, provided enough data are available. It requires a minimum of ~10 seconds for the positive samples and ~3 minutes for the negative cohort. It shall adapt to the training data and can perform detection on different spoken voices.
How do I get set up?
Before running the script build the training and the testing set. Then just load the training set and train the network with the code in the ipython notebook. Make sure to create the local folder hardcoded in the script files below. Execute first
% python make_data_class_0.py % python make_data_class_1.py
I recommend using a GPU. That’s because, in this example, it takes at least 5 epochs to obtain ~81% accuracy. By the way, epoch is machine learning language for “iterations through the data set”. Use it to sound smarter.
How do I clean a new dirty audio file?
You have to transform an audio file in the same way as training files. You can do that with
% python make_data_newsample.py
Then follow the script in the ipython notebook. The script has enough comments to proceed without particular issues. The whole project is on github