Deep feature extraction and transfer learning

There’s no real news stating that Feature extraction [1] represents a fundamental step in any machine learning pipeline. Whenever a data scientist plays with a dataset made of a high number of variables, there are pretty good chances that some of them are redundant or noisy. That means such variables do not really carry any signal that might come useful as a predictor. Noise, as always, will affect the overall accuracy of any predictive model.

In such scenarios, feature extraction allows data scientists to construct a  new representation of the original data, facilitating the detection of potentially interesting patterns. Yes, principal component analysis (PCA) is one possible solution to the aforementioned problem of noisy variables. In fact, applying a PCA to the original data is the equivalent of creating an embedding with specific characteristics. In this new dimensional space (usually smaller than the original one) a few features called principal components will capture most of the information that is present in the original data. That’s why PCA is mainly used as a dimensionality reduction technique.

PCA allows to find an embedding of lower dimension that captures most of the variance in the original data.
PCA allows to find an embedding of lower dimension that captures most of the variance in the original data.

Until a decade ago, the performance of a machine learning model was heavily relying on the ability of experts to craft hand-engineered features. Such a process required a thorough understanding of the domain the data originated from, and a substantial amount of time. Everything started to change with the recent advances in deep learning, and concerned mainly computer vision and natural language processing tasks. As already explained in a previous post, since the early 2k, deep learning algorithms started to achieve remarkable results in several domains, from image recognition to language translation.
These successes were mainly due to three essential phenomena that occurred almost at the same time:

  • availability of more data
  • better CPUs and GPUs and
  • new optimization algorithms.

Researchers were suddenly able to train deep neural networks feeding them millions of images or text documents and achieve state-of-the-art results. Since then, an impressive amount of pre-trained models have been produced and made available to the community.

One interesting feature of pre-trained models is their capability to function as feature extractors. A deep neural network trained to recognize people from a large set of images, will show a number of features in its layers. From the first layers such features become more and more complex, starting from pixels, blobs, eyes, noses, faces, until clothes and entire scenes. Of course specific neurons will activate for each of these abstract concepts.
What’s amazing is that another classifier – say of animals, flowers, or vehicles – will utilize the same embedding especially in the first layers. After all, blobs of pixels, segments and other low level features are usually conserved across domains.

Internal representation of how the GoogleNet deep learning system builds its understanding of images, from edges and textures in the first layers to patterns, parts and objects in deeper layers https://distill.pub/2017/feature-visualization

Due to this property of conserving features across domains, researchers have noticed that switching to domains apparently different did not require to retrain their models from scratch. Transfer learning [2] is in fact the capability of neural networks to generalize across domains. As mentioned before, a consistent number of layers – especially the ones close to the input – of a people-classifier can be perfectly fine for a cats-and-dogs-classifier.

It is not surprising at all to reach high accuracy with e.g. a linear model on top of a pre-trained network used as feature extractor. As it has been shown  many times already, simpler models can beat fancy ones.

Here are some of the benefits of considering pre-trained models for your next project:

  • fast training. Training a simple model can take minutes instead of hours or days. This allows to quickly test hypotheses and increase productivity
  • good performance. Thanks to transfer learning, one can build accurate models on relatively small datasets
  • adaptability to existing pipelines. A company who have spent time and resources building the infrastructure around a specific modelling framework, can rapidly improve their model performance without changing their pipelines entirely. They would only need a preprocessing step performed with the pre-trained model.

As researchers and practitioners require higher and higher standards of classification with deep learning, the more data their models need to be successfully trained and employed in production environments. However, it should be clear now that not having a large amount of data does not prevent an organisation or a public institution from building powerful AI tools. In fact, even young startups can rely on transfer learning and adapt existing models trained on the massive datasets of big corporations to their specific use cases.
Let innovation begin!

References

[1] S. Ding, et al., “A survey on feature extraction for pattern recognition”, Artificial Intelligence Review, vol. 37 (3), pages 169 – 180, 2012.

[2] K. Weiss, et al., “A Survey on Transfer Learning”, Journal of Big Data, vol. 3, pages 1-40, 2016.

[3] https://www.basilica.ai/blog/the-unreasonable-effectiveness-of-deep-feature-extraction

Subscribe to our Newsletter

1 comment

Leave a Reply

Your email address will not be published. Required fields are marked *