Neural networks with infinite layers

antoine-dautry

Today’s story starts with residual networks, also known as Resnets.

Resnets have been proposed in a paper titled “Deep Residual Learning for Image Recognition” as a mechanism to overcome the vanishing and exploding gradients problems affecting very deep neural networks.
Let’s delve a little deeper.

A neural network with many layers is difficult to train not only due to the large number of layers, but also because whenever the weights are just slightly greater or slightly smaller than one, the activations (more specifically the outputs of neurons in each layer) can explode or decrease exponentially. It is well known that training a neural network is equivalent to minimizing the error that the network makes while predicting some output based on some input. In order to minimize such error, computing the gradient of the loss with respect to the weights is more than necessary. How such weights increase or decrease, and eventually explode or disappear, is going to affect the gradient descent overall. As a result, the numerical optimization algorithm will have a hard time to move towards the minimum. In simple words, the network is not going to learn much from the data.

In light of what has been said about exploding and vanishing gradients problems, let’s see how Resnets can help mitigate these issues. The core idea behind Resnets is using so-called skip connections or shortcuts. This means taking the activations from one layer and feeding them to another layer deeper in the network architecture. This architectural choice, depicted in the figure below, is called residual block.
Residual networks
are obtained by stacking many residual blocks on top of each other.

 

Schematic representation of a residual block.
Schematic representation of a residual block.

 

How can this help tackle the gradients problems mentioned above?

One undeniable fact is that a residual block easily learns the identity function.
Suppose \inline \mathbf{a}(l) is close to zero. Then if no skip connections are present, \inline \mathbf{a}(l+1) would also be close to zero, and so would be all other activations deeper in the network architecture. On the contrary, when skip connections are present, one would have  \inline \mathbf{a}(l-1) = \mathbf{a}(l+1). Not only has this been shown not to affect performance, but it also improves it as it allows a better descent along the gradient during the training process.
One possible interpretation of such an improvement is that the skip connections might allow the network to “remember” the information stored in the earlier layers. The idea of storing additional information from earlier layers is also behind the main principle of  Long Short Term Memory (LSTM) networks, a state-of-the-art deep learning model used for time-series analyses and language models [2].

But wait, there’s more! Resnets have another intriguing characteristic. In a paper titled “Bridging the gaps between residual learning, recurrent neural networks and visual cortex”, researchers have shown that the activations of a residual block for each layer l can be reformulated as:

\inline \mathbf{a}{(l+1)} = \mathbf{a}{(l)} + f(\mathbf{a}{(l)},\mathbf{\theta}{(l)}) (1), where \inline \mathbf{\theta}{(l)} are the parameters of the network (weights and biases) for layer l and \inline f is a generic nonlinear function associated with the network architecture.
Does this sound familiar?
As stated in “Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equation”, the previous formula looks like the Euler discretization of the following ordinary differential equation (ODE) involving the activations

\frac{d\mathbf{a}(l)}{dl} = f(\mathbf{a}(l),l,\mathbf{\theta}(l)) (2)

Before proceeding, let’s quickly revise the concept of differential equations and the Euler method. Those who are familiar with the concept of ODE can skip the next paragraph.

Differential Equations

A differential equation contains derivatives, i.e., variation of one quantity with respect to another (typically time or space). One familiar differential equation is Newton’s second law of motion, that states If an object of mass m is moving with acceleration a and receives a force F, then its dynamics can be described by the equation F = m\cdot a

In terms of the velocity of the object, Newton’s law can be written as

m\frac{dv}{dt} = F(v,t), which is an ordinary differential equation (ODE).

Given force F (which is known) and some initial conditions, e.g., the initial velocity \inline v_0 = v(t=t_0), the solution to the differential equation above gives the velocity at each consecutive time. For example  \inline v_1 = v(t_1),v_2 = v(t_2),\dots

How to solve such an equation? Sometimes it is not possible to find solutions analytically. In these cases numerical methods such as gradient descent come to the rescue and provide approximated solutions. Such methods go under the name of ODE solvers. One such tools is the Euler method. Suppose one needs to find the solution to the following differential equation

\begin{align*} \frac{dy}{dt} & = f(y,t)\end{align*} \begin{align*}y(t_0) & = y_0 \end{align*}

Euler method is an iterative procedure that, starting from an initial point \inline (t_0,y_0), moves iteratively in the direction of the gradient evaluated at the initial point (i.e., moving along the tangent line). The value of the tangent at time t_1, which is \inline \hat{y}_1, represents an approximate solution to the differential equation at that time, i.e., \inline y_1 = y(t_1)

 

Euler method schema
At each step of the Euler method, the solution is approximated with the value of the tangent line evaluated at the previous point

 

At each subsequent time step, the same procedure is repeated and described by the following recursive formula:

{\hat{y}_{n + 1}} = {\hat{y}_n} + f\left( {{t_n},{\hat{y}_n}} \right) \cdot \left( {{t_{n + 1}} - {t_n}} \right)

This formula, named Euler discretization, gives an approximation of the solution of the differential equation at each time step. That’s it for this quick recap of ODE and the Euler method. Let’s now go back to neural networks.

 

ODEs and neural networks

At this point, it should be clear that the equation of the activations of the residual block is very similar to the recursive Euler formula just described. It should also be clear that there is a differential equation associated to it. Indeed, adding more layers to the neural network and taking sufficiently smaller steps, in the limit where the step size tends to zero, it is possible to obtain the ODE given by Equation (2). This equation describes the evolution of the activations with respect to the depth of the network, which is now a continuous quantity.
Hence, it’s like having a network with an infinite number of layers. The hidden state at any depth can be evaluated by solving the integral

\mathbf{a}(l) = \int f(\mathbf{a}(l),l,\theta(l))dl

If one considers the input data \inline \mathbf{x} as the initial value of the ODE, i.e., \inline \mathbf{a}(l=0) = \mathbf{x}, and let the target vector \mathbf{y} be equal to the value of the hidden activation for some depth \inline L, i.e., \inline \mathbf{y} = \mathbf{a}(L) the ODE has a unique solution that can be found by means of any ODE solver of choice

\hat{\mathbf{y}} = \mathbf{a}(L) = ODESolve(\mathbf{a}(0),L,\theta(l))

This solution allows one to discover the complex relationship between \inline \mathbf{x} and \mathbf{y}. But a problem remains: how to choose both L and parameters \inline \theta(l)? Luckily backpropagation comes to the rescue again as in [5]. Just like with any standard deep learning model, one can use a specific form of backpropagation to compare the predictions produced by the network with the true target values, epoch by epoch. It is then that would one use the error to optimize all those free parameters.

What are the benefits of using ODEs over the more intuitive residual networks? They are summarized below:

  • ODE networks are memory efficient. Unlike standard deep learning models, neural ODE models can be trained with constant memory cost as a function of depth.
  • ODE networks require less parameters. Neural ODEs may require less parameters to achieve comparable or better accuracy than classical deep neural networks in supervised learning tasks. One important conclusion is that they can be trained using less data.
  • ODE networks are more flexible time-series models. Unlike recurrent neural networks, neural ODEs can naturally incorporate data arriving at arbitrary times, i.e., unequally spaced data points. This allows one to build more generic time-series models.

Moreover, neural ODEs have showed to learn differential equations directly from data. Since many phenomena in physics, biology and economics can be described by differential equations, researchers are looking at ODE networks as a natural solution to such problems, taking advantage of more powerful tools that may accelerate the pace of scientific discoveries.

 

References

[1] K. He, et al., “Deep Residual Learning for Image Recognition”, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770-778, 2016

[2] S. Hochreiter, et al., “Long short-term memory”, Neural Computation 9(8), pages 1735-1780, 1997.

[3] Q. Liao, et al.,”Bridging the gaps between residual learning, recurrent neural networks and visual cortex”, arXiv preprint, arXiv:1604.03640, 2016.

[4] Y. Lu, et al., “Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equation”, Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 2018.

[5] T. Q. Chen, et al., ” Neural Ordinary Differential Equations”, Advances in Neural Information Processing Systems 31, pages 6571-6583}, 2018

Subscribe to our Newsletter

1 comment

Leave a Reply

Your email address will not be published. Required fields are marked *