The physics of optimization algorithms

john-moeses-bauan

In a previous post, we mentioned that one reason for the successes of deep learning methods in the last decade has certainly been the research effort for novel optimization algorithms. Indeed, popular techniques like the Adam optimizer offer various advantages, such as computational efficiency, low memory requirements and faster rate of convergence to the optimum, just to mention a few. This makes reliable neural networks possible even in scenarios with multiple layers, which in turn leads to more powerful deep learning models.

In a recent paper titled The Physical Systems Behind Optimization Algorithms, the authors emphasize some connections between gradient descent based optimizers and the dynamics of damped harmonic oscillators.[2] Thanks to this contribution, we now have a better theory for optimization algorithms.

Background: the simple harmonic oscillator

In physics, the simple harmonic oscillator is a mechanical system consisting of a particle of mass m connected to a fixed point by means of a spring. The system always tends to stay at the equilibrium position, namely at zero potential energy. Whenever the spring is stretched or compressed, there will be a force acting on the object. That force will tend to bring the mass towards the resting position again.  Newton’s second law of motion \inline F = m \cdot adescribes the dynamics of the system. 
In this case, force F is the elastic or restoring force given by \inline F_e = -k \cdot x, where x is the displacement of the object relative to the equilibrium and k is Hooke’s constant, which is a measure of the spring’s elasticity.

Schematic representation of a simple harmonic oscillator

Of course, you can write Newton’s law as a differential equation:

m \cdot \ddot{x}(t) + k \cdot x(t) = 0

where \inline \ddot{x}(t) = \frac{d^2x(t)}{dt^2} is the acceleration of the particle.

Those who are not very familiar with the concept of differential equation (ODE) can read Neural networks with infinite layers, in which ODEs are briefly summarized.
By compressing the spring and releasing it at point \inline x_0, the system will start oscillating. The solution of the above differential equation gives the position of the particle over time, which is \inline x(t) = x_0\cdot\cos(\omega\cdot t), where \inline \omega = \sqrt{\frac{k}{m}} is the oscillating frequency. In this system there are two forms of energy, namely the elastic potential energy given by \inline V(x(t)) = \frac{1}{2} \cdot k \cdot x(t)^2 and the kinetic energy \inline K(x(t)) = \frac{1}{2} \cdot m \cdot \dot{x}(t)^2, \inline \dot{x}(t) = \frac{dx(t)}{dt} being the velocity of the object.

Under to the well-known law of energy conservation, this system continues to oscillate indefinitely. In fact, the initial potential energy \inline V(x_0) turns into kinetic energy, which then turns back into potential energy and so on, just like in roller-coasters.

A more realistic oscillator

In the real world, however, simple harmonic oscillators do not really exist. Real oscillators behave more like a damped harmonic oscillator, where a frictional force, proportional to the negative velocity of the particle is also present.

Schematic representation of a damped harmonic oscillator.

Unlike the simple harmonic oscillator, such a system would reach an equilibrium where the particle eventually stops with zero potential energy, due to the dissipation of the kinetic energy caused by the friction. The following differential equation describes the dynamics of our system:

m \cdot \ddot{x}(t) + c \cdot \dot{x}(t) + k \cdot x(t) = 0

where c is the viscous damping coefficient and \inline F_f = -c \cdot \dot{x}(t) denotes the frictional force. The behavior of the system is determined by the damping ratio \inline \zeta = \frac{c}{2 \cdot \sqrt{m \cdot k}}:

The aforementioned system belongs to one of the following categories, depending on the value of \inline \zeta,:

  • overdamped (ζ > 1) The system returns to steady state without oscillating. More precisely, the total energy (kinetic + potential) decays exponentially as \inline E(t) \propto \exp(-\frac{c \cdot t}{2 \cdot m}).
  • critically damped (ζ = 1) The system returns to steady state without oscillating, and there may be transitory periods when it seems to move away from the equilibrium (overshooting).
  • underdamped (ζ < 1) The system oscillates with a slightly different frequency than the undamped case, with amplitude gradually decreasing to zero.
  • extremely overdamped (ζ >> 1) This decay does not depend on the particle mass. The system behaves as though the particle had no mass at all.

Optimization for neural networks

How is all this related to gradient-based algorithms?
As a matter of fact, there is an interesting connection between the damped harmonic oscillator and gradient-based optimization algorithms. In fact, we can interpret the problem of optimization as a system in which a particle is falling inside a given potential.
One can think of it as a ball bouncing in a landscape with hills and valleys. When it starts bouncing, the ball has high potential energy. Then, this energy turns into kinetic energy. In the long term, the ball will settle in a low elevation area and will have small potential energy (which will be zero at sea level).

The solution to the optimization problem corresponds to the point of zero energy attained when the particle is standing still.

Some mathematical machinery

In order to show the connection between damped harmonic oscillators and optimization algorithms, let’s see how gradient descent works.

We always start from guesswork, and estimate the solution to be x_n. Using our guess as a starting point, we compute a new solution from the current one, following the direction of the negative gradient

\inline x_{n+1} = x_n - \eta \cdot \nabla f(x_n)

Rearranging the above equation we get:

\inline x_{n+1} - x_n + \eta \cdot \nabla f(x_n) = 0

Then, using the Taylor expansion of the objective function f around \inline x_n, we can write:

\inline x_{n+1} - x_n = \dot{x}(t) \cdot h + \frac{1}{2}\ddot{x}(t) \cdot h^2 + \dots

By plugging this equation into the previous one, in the limit \inline h \to 0, we obtain the following differential equation:

\frac{h^2}{2 \cdot \eta} \ddot{x}(t) + \frac{h}{\eta}\dot{x}(t) + \nabla f(x(t)) = 0

Does this look familiar? It should, because this is just the damped oscillator system we have seen before.
This reasoning is also valid for other variants of gradient descent algorithms, such as the Adam method mentioned in the beginning of this article.
In this new perspective, the various gradient descent optimization methods correspond to damped oscillators with different particle mass and damping coefficients, characterized by different dissipation mechanisms of their mechanical energy. Moreover, the decay rate of the mechanical energy relates to the convergence rate of the related optimization algorithm. Yes, there is physics in deep learning.

Motivations and perspectives

Let’s now try to understand the implications of explaining optimization algorithms in terms of physical systems. By taking inspiration from other systems one may come up with more powerful optimization techniques. Furthermore, being able to explain human intelligence from a physical point of view makes transferring such knowledge into machine learning models straightforward.
As an example, let’s consider the “quantum mind” or “quantum consciousness” hypotheses [3].
This theory says that quantum mechanics phenomena play a key role in the cognitive functions of the brain. If such hypotheses were correct, it would be possible to use quantum mechanics equations for machine learning models.
This outcome would allow scientists to build machines way more intelligent than the ones we use today. Such machines would have the same intelligence mechanisms humans have. That would mean reaching Artificial General Intelligence (AGI).

References

[1] D.P. Kingma, et al., “Adam: A Method for Stochastic Optimization”,  The International Conference on Learning Representations (ICLR), San Diego, 2015 https://arxiv.org/abs/1412.6980

[2] Lin F. Yang, et al., “The Physical Systems Behind Optimization Algorithms”, 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.  https://arxiv.org/abs/1612.02803

[3] R. Penrose, et al., “Consciousness in the Universe: Neuroscience, Quantum Space-Time Geometry and Orch OR Theory“, Journal of Cosmology 14, 2011.

Subscribe to our Newsletter