Since their childhood, every single day of their life, humans develop new skills by interacting with the environment they live in. In a never ending process from crawling to walking such improvements occur at all levels, from speaking new words, tasting new foods and making new experiences. When this happens, new brain connections – referred to as synapses – are formed. Synapses are continuously changing, as new experiences flow in the life of an individual or organism. What synapses essentially do is storing all the new insights one gains during any form of interaction with the environment.
When Frank Rosenblatt invented the artificial neuron, also known as the perceptron, back in 1958, he was inspired by the way the human brain learns new abilities (details about his seminal work can be found in ).
As depicted in the figure below, Rosenblatt’s perceptron takes a weighted sum of inputs , applies a transformation called activation function and outputs 1 if the weighted sum is greater than a certain threshold and 0 otherwise. The weights represent a mathematical trick that mimics the behaviour of the synapses in a real brain, just without any chemistry involved. Initially, they are set to small random values and later they are modified such that the output produced by the network matches the desired output (i.e. the target). This process, referred to as training, is crucial. In fact, the weights are changed in such a way that the network learns a certain task, that is not feasible with the initial random values.
Learning suitable weights is the same mechanism behind modern deep learning algorithms, which are nothing more than layers of artificial neurons connected to each other. Indeed, as explained in a previous post, one of the reasons of the successes of deep learning is attributed to the possibility of training the weights of large neural networks on big datasets by means of innovative gradient descent-based techniques.
What if hours and hours of training neural networks on powerful GPUs were not necessary? In other words, could one just use some random values and have a model that is capable of providing meaningful predictions?
As crazy as it sounds, some researchers have showed this is definitely possible. After all, it’s time for God to play dice…
In their paper titled “Weight agnostic neural networks” and referenced in , Adam Gaier and David Ha have shown that some neural network architectures can reach state-of-the-art accuracy without learning any weights.
The core idea is based on a change of perspective, where training a neural network no longer means finding the optimal weights via gradient descent. In fact, training consists of searching for an optimal architecture the performance of which is not sensitive to the values of its weights. In this case, training is directly performed by the network topology alone. Searching for such an optimal architecture starts with generating an initial population of neural networks with few connections and no hidden layers. In each network only one single random shared weight is used and all the connections use the same value for the weights, which is just random.
After an initial population of networks is established and ranked according to their performance, the process continues with subsequent training iterations, during which a new population of networks is generated by modifying the best performing architecture thus far. The process continues until such a performance no longer improves. To be more specific, at each training epoch the networks are modified by randomly inserting new nodes, adding new connections, changing the activation function, or a combination of the three. As more and more iterations are performed, only the best architectures survive, just like with genetic algorithms (for the interested reader, an introduction to genetic algorithms is referenced in ).
The best network architecture is the one that performs consistently well with a wide range of random weights. This, in turn, means that such weights are not really critical in determining the accuracy of the model. Obviously, between two different network architectures with similar performance, the simpler architecture is preferred. What is surprising is that, in contrast to the conventional deep learning models that only perform correct predictions after extensively tuning their weights, weight agnostic neural networks (WANNs) have the tendency to perform well with just one random weight considered for all the connections. Moreover, they perform even better and reach state-of-the-art accuracy if the weights are trained like in a standard setting.
Researchers have tested this approach on the MNIST database of handwritten digits, the de-facto benchmark to compare machine learning methods on multi-class classification tasks, and shown that WANNs can classify MNIST digits as good as a single layer neural network with thousands of weights trained by gradient descent. The fact that no training of the weights was involved makes the entire experiment unique.
In the last 20 years, many powerful deep learning architectures have been proposed. From Long-short-term-memory networks (LSTMs) that allowed to reach breakthrough performances in sequence modeling tasks such as machine translation, speech recognition, time-series forecasting, to Convolutional Neural Networks (CNNs) and Residual Networks that represent the state-of-the-art in computer vision problems like object detection and image classification, what all these networks have in common is that they require substantial training effort. In contrast, it seems that WANNs do not require such massive training tasks to take place.
To conclude, this new approach has all the properties to facilitate the discovery of new architectures that can help solving more challenging problems in several business and scientific domains, just like LSTMs and CNNs have done so far.