The physics of optimization algorithms

In a previous post, we mentioned that one reason for the successes of deep learning methods in the last decade has certainly been the research effort for novel optimization algorithms. Indeed, popular techniques like the Adam optimizer offer various advantages, such as computational efficiency, low memory requirements and faster rate of convergence to the optimum, just to mention a few. This makes reliable neural networks possible even in scenarios with multiple layers, which in turn leads to more powerful deep learning models.

In a recent paper titled The Physical Systems Behind Optimization Algorithms, the authors emphasize some connections between gradient descent based optimizers and the dynamics of damped harmonic oscillators.[2] Thanks to this contribution, we now have a better theory for optimization algorithms.
Here is how all this works.

In physics, the simple harmonic oscillator is a mechanical system consisting of a particle of mass m connected to a fixed point by means of a spring. The system always tends to stay at the equilibrium position, namely at zero potential energy. Whenever the spring is stretched or compressed, there will be a force acting on the object that will tend to bring the mass towards the resting position again. The dynamics of the system is described by Newton’s second law of motion $\inline&space;F&space;=&space;m&space;\cdot&space;a$.
In this case, force F is the elastic or restoring force given by $\inline&space;F_e&space;=&space;-k&space;\cdot&space;x$, where x is the displacement of the object relative to the equilibrium and k is Hooke’s constant, which is a measure of the spring’s elasticity.

Newton’s law can also be rewritten as a differential equation:

$m&space;\cdot&space;\ddot{x}(t)&space;+&space;k&space;\cdot&space;x(t)&space;=&space;0$ where $\inline&space;\ddot{x}(t)&space;=&space;\frac{d^2x(t)}{dt^2}$ is the acceleration of the particle.

Those who are not very familiar with the concept of differential equation (ODE) can read Neural networks with infinite layers, in which ODEs are briefly summarized.
By compressing the spring and releasing it at point $\inline&space;x_0$, the system will start oscillating. The solution of the above differential equation gives the position of the particle over time, which is $\inline&space;x(t)&space;=&space;x_0\cdot\cos(\omega\cdot&space;t)$, where $\inline&space;\omega&space;=&space;\sqrt{\frac{k}{m}}$ is the oscillating frequency. In this system there are two forms of energy, namely the elastic potential energy given by $\inline&space;V(x(t))&space;=&space;\frac{1}{2}&space;\cdot&space;k&space;\cdot&space;x(t)^2$ and the kinetic energy $\inline&space;K(x(t))&space;=&space;\frac{1}{2}&space;\cdot&space;m&space;\cdot&space;\dot{x}(t)^2$, $\inline&space;\dot{x}(t)&space;=&space;\frac{dx(t)}{dt}$ being the velocity of the object.

Under to the well-known law of energy conservation, this system continues to oscillate indefinitely. In fact, the initial potential energy $\inline&space;V(x_0)$ is converted to kinetic energy, which is, in turn, converted back to potential energy and so on, just like in roller-coasters. In the real world, however, such systems do not exist and they behave more like a damped harmonic oscillator, where a frictional force, proportional to the negative velocity of the particle is also present.

Unlike the simple harmonic oscillator, such a system would reach an equilibrium where the particle eventually stops with zero potential energy, due to the dissipation of the kinetic energy caused by the friction. Such a dynamic is described by the following differential equation:

$m&space;\cdot&space;\ddot{x}(t)&space;+&space;c&space;\cdot&space;\dot{x}(t)&space;+&space;k&space;\cdot&space;x(t)&space;=&space;0$ where c is the viscous damping coefficient and $\inline&space;F_f&space;=&space;-c&space;\cdot&space;\dot{x}(t)$ denotes the frictional force. The behavior of the system is determined by the damping ratio $\inline&space;\zeta&space;=&space;\frac{c}{2&space;\cdot&space;\sqrt{m&space;\cdot&space;k}}$:

Depending on the value of $\inline&space;\zeta$, the aforementioned system can be categorized as:

• overdamped (ζ > 1) The system returns to steady state without oscillating. More precisely, the total energy (kinetic + potential) decays exponentially as $\inline&space;E(t)&space;\propto&space;\exp(-\frac{c&space;\cdot&space;t}{2&space;\cdot&space;m})$.
• critically damped (ζ = 1) The system returns to steady state without oscillating, and there may be transitory periods when it seems to move away from the equilibrium (overshooting).
• underdamped (ζ < 1) The system oscillates with a slightly different frequency than the undamped case, with amplitude gradually decreasing to zero.
• extremely overdamped (ζ >> 1) This decay does not depend on the particle mass. The system behaves as though the particle had no mass at all.

Optimization for neural networks

How is all this related to gradient-based algorithms?
As a matter of fact, there is an interesting connection between the damped harmonic oscillator and gradient-based optimization algorithms. In fact, we can interpret the problem of optimization as a system in which a particle is falling inside a given potential.
One can think of it as a ball bouncing in a landscape with hills and valleys. When it starts bouncing, the ball has high potential energy. Then, this energy gets converted into kinetic energy. In the long term, the ball will settle in a low elevation area and will have small potential energy (which will be zero at sea level).

The solution to the optimization problem corresponds to the point of zero energy attained when the particle is standing still. In order to show the connection between damped harmonic oscillators and optimization algorithms, here is an example of the gradient descent iteration step.

The value at the next iteration is calculated from the current one, following the direction of the negative gradient

$\inline&space;x_{n+1}&space;=&space;x_n&space;-&space;\eta&space;\cdot&space;\nabla&space;f(x_n)$

This equation can be rewritten as:

$\inline&space;x_{n+1}&space;-&space;x_n&space;+&space;\eta&space;\cdot&space;\nabla&space;f(x_n)&space;=&space;0$

Using the Taylor expansion of the objective function f around $\inline&space;x_n$, we get

$\inline&space;x_{n+1}&space;-&space;x_n&space;=&space;\dot{x}(t)&space;\cdot&space;h&space;+&space;\frac{1}{2}\ddot{x}(t)&space;\cdot&space;h^2&space;+&space;\dots$

By plugging this equation into the previous one, in the limit $\inline&space;h&space;\to&space;0$, we obtain the following differential equation:

$\frac{h^2}{2&space;\cdot&space;\eta}&space;\ddot{x}(t)&space;+&space;\frac{h}{\eta}\dot{x}(t)&space;+&space;\nabla&space;f(x(t))&space;=&space;0$

Does this look familiar? It should, because this is just the damped oscillator system we have seen before.
This reasoning can be extended to other variants of gradient descent algorithms, such as the Adam method mentioned in the beginning of this article.
In this new perspective, the various gradient descent optimization methods correspond to damped oscillators with different particle mass and damping coefficients, characterized by different dissipation mechanisms of their mechanical energy. Moreover, the decay rate of the mechanical energy is connected to the convergence rate of the related optimization algorithm. Yes, there is physics in deep learning.

Let’s now try to understand the implications of explaining optimization algorithms in terms of physical systems. One advantage is that by taking inspiration from other systems beyond the damped harmonic oscillator, one may come up with more powerful optimization techniques and even novel cost functions and regularization schemes.
Furthermore, if one were able to explain human intelligence from a physical point of view, transferring such a knowledge into new machine learning models would be straightforward.
As an example, let’s take the quantum mind or quantum consciousness group of hypotheses [3] into consideration.
This theory states that quantum mechanical phenomena, such as quantum entanglement and superposition, may play an important role in the cognitive functions of the brain. Moreover, this very theory could contribute to form the basis for one possible explanation of consciousness. If such a group of hypotheses were correct, it would be possible to use their equations as the basis for new quantum machine learning models.
This outcome would allow scientists to build machines way more intelligent than the ones we are used to today. In fact, such hypothetical machines would be equipped with the same intelligence mechanisms that characterize humans, finally paving the way for the more and more hyped Artificial General Intelligence (AGI).

References

[1] D.P. Kingma, et al., “Adam: A Method for Stochastic Optimization”,  The International Conference on Learning Representations (ICLR), San Diego, 2015

[2] Lin F. Yang, et al., “The Physical Systems Behind Optimization Algorithms”, 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

[3] R. Penrose, et al., “Consciousness in the Universe: Neuroscience, Quantum Space-Time Geometry and Orch OR Theory”, Journal of Cosmology 14, 2011.