# Neural networks with infinite layers

Today’s story starts with residual networks, also known as Resnets. He et al. proposed resnets in [1] , as a mechanism to overcome the vanishing and exploding gradients problems affecting very deep neural networks.

A neural network with many layers is difficult to train not only due to the large number of layers, but also because whenever the weights are just slightly greater or slightly smaller than one, the activations (more specifically the outputs of neurons in each layer) can explode or decrease exponentially. It is well known that training a neural network is equivalent to minimizing the error that the network makes while predicting some output based on some input. In order to minimize such error, computing the gradient of the loss with respect to the weights is more than necessary. How such weights increase or decrease, and eventually explode or disappear, is going to affect the gradient descent overall. As a result, the numerical optimization algorithm will have a hard time to move towards the minimum. In simple words, the network is not going to learn much from the data.

In light of what has been said about exploding and vanishing gradients problems, let’s see how Resnets can help mitigate these issues. The core idea behind Resnets is using so-called skip connections or shortcuts. This means taking the activations from one layer and feeding them to another layer deeper in the network architecture. This architectural choice, depicted in the figure below, is called residual block.
Residual networks
are obtained by stacking many residual blocks on top of each other.

How can this help tackle the gradients problems mentioned above?

A residual block easily learns the identity function.
Suppose $\inline&space;\mathbf{a}(l)$ is close to zero. Then if no skip connections are present, $\inline&space;\mathbf{a}(l+1)$ would also be close to zero, and so would be all other activations deeper in the network architecture. On the contrary, when skip connections are present, one would have  $\inline&space;\mathbf{a}(l-1)&space;=&space;\mathbf{a}(l+1)$. Not only has this been shown not to affect performance, but it also improves it as it allows a better descent along the gradient during the training process.
One possible interpretation of such an improvement is that the skip connections might allow the network to “remember” the information stored in the earlier layers. The idea of storing additional information from earlier layers is also behind the main principle of  Long Short Term Memory (LSTM) networks, a state-of-the-art deep learning model used for time-series analyses and language models [2].

But wait, there’s more! Resnets have another intriguing characteristic. In [3], researchers have shown that it’s possible to write the activation functions of a residual block for each layer l  as:

$\inline&space;\mathbf{a}{(l+1)}&space;=&space;\mathbf{a}{(l)}&space;+&space;f(\mathbf{a}{(l)},\mathbf{\theta}{(l)})$ (1)

where $\inline&space;\mathbf{\theta}{(l)}$ are the parameters of the network (weights and biases) for layer l and $\inline&space;f$ is a generic nonlinear function associated with the network architecture.
Does this sound familiar?
As stated in “Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equation”, the previous formula looks like the Euler discretization of the following ordinary differential equation (ODE) involving the activations

$\frac{d\mathbf{a}(l)}{dl}&space;=&space;f(\mathbf{a}(l),l,\mathbf{\theta}(l))$ (2)

Before proceeding, let’s quickly revise the concept of differential equations and the Euler method. Those who are familiar with the concept of ODE can skip the next paragraph.

## Differential Equations

A differential equation contains derivatives, i.e., variation of one quantity with respect to another (typically time or space). One familiar differential equation is Newton’s second law of motion. The law states that the acceleration of an object is dependent upon two variables – the net force acting upon the object and the object’s mass.

$F&space;=&space;m\cdot&space;a$

In terms of velocity of the object, you can write Newton’s law as:

$m\frac{dv}{dt}&space;=&space;F(v,t)$, which is an ordinary differential equation (ODE).

Given a known force F and some initial conditions, e.g., the initial velocity $\inline&space;v_0&space;=&space;v(t=t_0)$, the solution to the differential equation above gives the velocity at each moment. For example  $\inline&space;v_1&space;=&space;v(t_1),v_2&space;=&space;v(t_2),\dots$

How to solve such an equation? Sometimes it is not possible to find solutions analytically. In these cases numerical methods such as gradient descent come to the rescue and provide approximated solutions. Such methods go under the name of ODE solvers. One such tools is the Euler method. Suppose one needs to find the solution to the following differential equation

\begin{align*}&space;\frac{dy}{dt}&space;&&space;=&space;f(y,t)\\&space;y(t_0)&space;&&space;=&space;y_0&space;\end{align*}

Euler method is an iterative procedure that, starting from an initial point $\inline&space;(t_0,y_0)$, moves iteratively in the direction of the gradient evaluated at the initial point (i.e., moving along the tangent line). The value of the tangent at time $t_1$, which is $\inline&space;\hat{y}_1$, represents an approximate solution to the differential equation at that time, i.e., $\inline&space;y_1&space;=&space;y(t_1)$

At each subsequent time step, the same procedure is repeated and described by the following recursive formula:

${\hat{y}_{n&space;+&space;1}}&space;=&space;{\hat{y}_n}&space;+&space;f\left(&space;{{t_n},{\hat{y}_n}}&space;\right)&space;\cdot&space;\left(&space;{{t_{n&space;+&space;1}}&space;-&space;{t_n}}&space;\right)$

This formula, named Euler discretization, gives an approximation of the solution of the differential equation at each time step. That’s it for this quick recap of ODE and the Euler method. Let’s now go back to neural networks.

## ODEs and neural networks

At this point, it should be clear that the equation of the activations of the residual block is very similar to the recursive Euler formula just described. It should also be clear that there is a differential equation associated to it. Indeed, adding more layers to the neural network and taking sufficiently smaller steps, in the limit where the step size tends to zero, it is possible to obtain the ODE given by Equation (2). This equation describes the evolution of the activations with respect to the depth of the network, which is now a continuous quantity.
Hence, it’s like having a network with an infinite number of layers. The hidden state at any depth can be evaluated by solving the integral

$\mathbf{a}(l)&space;=&space;\int&space;f(\mathbf{a}(l),l,\theta(l))dl$

If one considers the input data $\inline&space;\mathbf{x}$ as the initial value of the ODE, i.e., $\inline&space;\mathbf{a}(l=0)&space;=&space;\mathbf{x}$, and let the target vector $\mathbf{y}$ be equal to the value of the hidden activation for some depth $\inline&space;L$, i.e., $\inline&space;\mathbf{y}&space;=&space;\mathbf{a}(L)$ the ODE has a unique solution that can be found by means of any ODE solver of choice

$\hat{\mathbf{y}}&space;=&space;\mathbf{a}(L)&space;=&space;ODESolve(\mathbf{a}(0),L,\theta(l))$

This solution allows one to discover the relationship between $\inline&space;\mathbf{x}$ and $\mathbf{y}$. But a problem remains: choosing the parameters L and  $\inline&space;\theta(l)$. Luckily backpropagation comes to the rescue again (see [5]). Just like in standard deep learning models, one can use a specific form of backpropagation to compare the predictions produced by the network with the true target values, epoch by epoch.  It’s then possible to use the error to optimize the free parameters.

ODEs are less intuitive than residual networks. However, using them brings many benefits:

• ODE networks are memory efficient. Unlike standard deep learning models, the memory cost of training neural ODE models does not grow as a function of depth.
• ODE networks require less parameters. Neural ODEs may require less parameters to achieve comparable or better accuracy than classical deep neural networks in supervised learning tasks. One important conclusion is that training requires less data.
• ODE networks are more flexible time-series models. Unlike recurrent neural networks, neural ODEs can naturally incorporate data arriving at arbitrary times, i.e., unequally spaced data points. This allows one to build more generic time-series models.

Moreover, neural ODEs have showed to learn differential equations directly from data. Since the standard models for  many phenomena in physics, biology and economics use differential equations, researchers are looking at ODE networks as a natural solution to such problems.

## References

[1] K. He, et al., “Deep Residual Learning for Image Recognition”, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770-778, 2016

[2] S. Hochreiter, et al., “Long short-term memory”, Neural Computation 9(8), pages 1735-1780, 1997.

[3] Q. Liao, et al.,”Bridging the gaps between residual learning, recurrent neural networks and visual cortex”, arXiv preprint, arXiv:1604.03640, 2016.

[4] Y. Lu, et al., “Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equation”, Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 2018.

[5] T. Q. Chen, et al., ” Neural Ordinary Differential Equations”, Advances in Neural Information Processing Systems 31, pages 6571-6583}, 2018