In the first part of this post, we talked about some of the limitations of current deep learning architectures such as vulnerability to adversarial attacks, lack of interpretability and the need of a large amount of training data. We have also briefly mentioned some methods to tackle these challenges like meta-learning techniques, the generative query network, and the graph network.
In this second part, we will focus our attention on the graph network approach, which encompasses deep learning systems that have an innate bias towards representing things as objects and relations.
Before continuing, let’s have a quick flashback. Since the beginning of AI in the 1950s and until the 1980s, symbolic AI approaches have dominated the field and were extensively used to help doctors in making diagnoses based on patient’s symptoms, to support banks in managing loans applications, to monitor critical assets in industrial plants, etc. These approaches, also known as expert systems, used mathematical symbols to represent objects and the relationship between them, in order to depict the extensive knowledge bases built by humans.
The opposite of the symbolic AI paradigm is named connectionism, which is behind modern machine learning approaches, including deep neural networks. In fact, with connectionism, knowledge is built from data instead of being hard coded via rules or symbols. Since the 1980s, connectionist AI systems have turned out to be much better than symbolic AI at dealing with noisy or ambiguous input, processing large datasets or handling unstructured data like images and text.
One of the typical questions researchers have asked themselves has been “why not combining both symbolic AI and deep learning, instead of choosing one method exclusively?”  It comes without saying that as humans, we acquire new skills by interacting with the world and interpreting the collected data in terms of our existing structured representations made of entities, their interactions and the rules that compose them. If needed, we adjust the structure itself to better fit past and current knowledge. Let’s explain these concepts better.
An entity is an element with attributes, such as a physical object with size and mass.
A relation is a property between entities, such as same size as, heavier than, and distance from.
A rule is a function that maps entities and relations to other entities and relations, such as is entity X large? or Is entity X heavier than entity Y?
Somehow, nature provided us with a relational inductive bias, that is a mechanism that allows us to develop an intelligent behavior by reasoning about entities and relations.
In general, an inductive bias allows a learning algorithm to prioritize one solution over another, independently of the observed data, when searching a space of solutions during the learning process. Some examples of inductive biases in machine learning are
- the choice and parameterization of the prior distribution in a Bayesian setting
- a regularization term added to avoid overfitting
- the assumption of a linear relationship between predictors and response corrupted by additive Gaussian noise in case of the ordinary least square algorithm
A relational inductive bias refers to inductive biases which impose constraints on relationships and interactions among entities in a learning process. As an example, hidden Markov models (HMM) constrain latent states to be conditionally independent of others given the state at the previous time step, and observations to be conditionally independent given the latent state at the current time step.
Deep learning systems as well are endowed with implicit relational inductive bias. The fact that a deep neural network is composed of many layers stacked on top of each other provides a particular type of relational inductive bias referred to as hierarchical processing. Furthermore, each layer also carries various forms of relational inductive bias.
Let’s think about the fully connected layer, the basic building block of a multi-layer perceptron (MLP).
If we forget dropout for a moment, the entities are the units in the network, the relations are all-to-all (all units in one layer are connected to all units in the next layer), and the rules are specified by the weights and biases. In this case, the implicit relational inductive bias is very weak, because all input units can interact to determine any output unit’s value, independently across outputs. Stronger inductive biases are present in convolutional and recurrent layers in convolutional neural networks (CNN) and recurrent neural networks (RNN), respectively. The convolutional layer is implemented by convolving an input vector or tensor representing an image with a kernel acting as a feature detector, adding a bias term, and applying a point-wise non-linearity. The entities here are still individual units associated with each pixel in an image, but the relations are biased toward enforcing locality and translation invariance. Locality reflects the fact that the arguments to the relational rule are those entities in close proximity with one another, i.e. the kernel is local. Translation invariance means that the same local kernel function is reused multiple times across the input image. These biases are very effective for processing natural image data: usually, pixel values are very similar within a local neighborhood and their distribution is mostly stationary across an image, meaning that the same feature detector like a vertical edge detector can be useful for both the upper left corner and the lower right corner on the image. Analogously, in a recurrent layer the same weights are reused across different time steps.
All these relational inductive biases are implicit, i.e. they are determined by the fixed architecture.
The idea behind graph networks is to provide a building block which (instead of processing vectors or tensors like in case of fully connected, convolutional or recurrent layers) explicitly handles entities and relations by operating over directed graphs, where entities are represented by the graph’s nodes, relations by the edges, and system-level properties by global attributes.
This building block, referred as graph network (GN) block, takes a graph as input, performs computations over the structure, and returns a graph as output.
What do we exactly mean by a graph?
Graphs are defined as a set of nodes connected by edges. Nodes, edges and the entire graph can have attributes, i.e. properties associated with them that can be encoded as a vector, a set, or even another graph. For instance, in a social network, nodes can have properties like the age and gender of a person, the edges can reflect the number of times two people meet every month etc. Graphs are suitable to describe objects and relations because the set of nodes in a graph do not have a natural ordering, and because they allow for pairwise interactions between entities (when an edge exists between two nodes). Examples of systems that can be naturally represented as graph are molecules, in which each atom is associated with a node and edges correspond to bonds, prey-predator networks, the internet, where two web pages are connected if there is a link from one to the other etc.
Let’s now continue with our discussion on graph networks.
A GN block contains three update functions and three aggregation functions (which are invariant to any permutation of their inputs). The update functions compute new attribute values and the aggregate functions each take a set as input and reduce it to a single element which represents the aggregated information. For example, suppose we have a system of planets orbiting around a star and we want to predict both position and velocity of each planet over time. In this case, the update functions would compute the forces between each pair of planets at each time instant, the updated position, velocity, and kinetic energy of each planet etc. The aggregate functions may sum all the forces or potential energies acting on each planet and compute the total energy of the whole physical system and so on. Typical aggregate functions are summation, averages, minimum, maximum. On the other hand, any method can be used to compute update functions (obviously when neural networks are used for this purpose, the GN block is called a graph neural network block). For example, in case of vector attributes, an MLP or RNN is often used while for tensors such as image feature maps, CNNs may be more appropriate. Each block can be unshared, like in the different layers of a CNN or MLP, or shared, where aggregate and update functions and their parameters are reused in every layer, analogous to an unrolled RNN.
At this point, a legit question arises: how do we define the input graph that will be processed by a GN block?
For such a task, there are two possibilities: either the input explicitly specifies the relational structure or the graph itself must be inferred or assumed (e.g. a fully connected graph structure is hypothesized).
Examples of data with more explicitly specified entities and relations include knowledge graphs (like the ones used in the expert systems that dominated the early times of AI), social networks, physical systems with known interactions etc.
Converting raw sensory data, images and text into more structured representations like graphs, and properly modifying graph structures during computation to reflect novel information, is definitely an active area of research.
Despite these open issues, graph networks seem a very promising approach to tackle the main shortcomings of deep learning methods. Each GN block can be passed as input to another, similar to the tensor-to-tensor interface of the standard deep learning architectures. For each GN block within a broader architecture, the edge and node outputs typically correspond to lists of vectors or tensors, one per edge or node, and the global outputs correspond to a single vector or tensor. This allows a GN’s output to be passed to other deep learning building blocks such as MLPs, CNNs, and RNNs. In turn, this may result in more powerful tools able to learn with much less labeled data. More interpretable models can be created because predictions may be explained by looking at the relations discovered by the graph blocks.
Finally, methods that are more robust to adversarial attacks may be designed, because a system that represents things as objects, as opposed to patterns of pixels, isn’t going to be so easily fooled by an extraneous sticker in the corner of an image. In fact, such an advanced system would learn that sometimes, it is not impossible that an elephant may sit in a sofa!