In the first part we talked about some of the limitations of current *deep learning* architectures such as vulnerability to adversarial attacks, lack of interpretability and the need of a large amount of training data.[1] We talked about two methods to tackle these challenges: *meta-learning* techniques,[2] and *generative query networks *[3]. We introduced the *built-in inductive bias* and mentioned *graph networks.*[4] Now we’re going to talk about graph networks in more detail. Those networks have an innate bias towards representing things as objects and relations.

##### A bit of history

From the beginning of AI in the 1950s up to the 1980s, symbolic AI approaches have dominated the field. These approaches, known as *expert systems*, use mathematical symbols to represent objects and the relationship between them. This is a practical way to depict the knowledge bases built by humans.

C*onnectionism , *which is behind machine learning, is the opposite of the symbolic AI paradigm. In *connectionism* you build knowledge from data instead relying on hard-coded rules. Connectionist AI systems generally perform better than symbolic AI with noisy or ambiguous input. This feature comes handy when processing large datasets or handling unstructured data like images and text.

A typical question then is: *“why not combining both symbolic AI and deep learning, instead of choosing one method exclusively?”* [4]Of course as humans, we acquire new skills by interacting with the world and interpreting the collected data in terms of our existing structured representations. If needed, we adjust those structures to better fit past and current knowledge. Let’s explain these concepts better.

##### Some basic concepts

**An entity** is an element with attributes, such as a physical object with size and mass.**A relation** is a property between entities, such as *same size as*, *heavier tha*n, and *distance from*.**A rule** is a function that maps entities and relations to other entities and relations, such as *is entity X large*? or *Is entity X heavier than entity Y?*

An **inductive bias** allows a learning algorithm to prioritize one solution over another, independently of the observed data, when searching a space of solutions during the learning process. Some examples of inductive biases in machine learning are

- the choice and parameterization of the prior distribution in a Bayesian setting
- a regularization term added to avoid overfitting
- the assumption of a linear relationship between predictors and response corrupted by additive Gaussian noise in case of the ordinary least square algorithm

##### Relational inductive bias

A *relational inductive bias* is an inductive bias which impose constraints on relationships and interactions among entities in a learning process. As an example, hidden Markov models (HMM) constrain latent states to be conditionally independent of others given the state at the previous time step, and observations to be conditionally independent given the latent state at the current time step.

Somehow, nature provided us with a relational inductive bias, that is a mechanism that allows us to develop an intelligent behavior by reasoning about entities and relations.

For example, let’s consider a domestic LAN network. The router and the devices connected to it represent the entities. The amount of time each device exchanges data may represent an attribute. “Connected longer than” may represent a rule, e.g. tablet connects longer than the printer.

Deep learning systems as well have their implicit relational inductive bias. A deep neural network has many layers stacked on top of each other. This fact provides a particular type of relational inductive bias called *hierarchical processing*. Furthermore, each layer also carries various forms of relational inductive bias.

##### An example with the multi-layer perceptron (MLP)

Let’s think about the fully connected layer, the basic building block of a multi-layer perceptron (MLP).

Forgetting dropout for a moment, the entities are the units in the network, the relations are all-to-all (all units in one layer connect to all units in the next layer), and the rules are specified by the weights and biases. In this case, the implicit relational inductive bias is very weak, because all input units can interact to determine any output unit’s value, independently across outputs.

Stronger inductive biases are present in *convolutional* and *recurrent* layers in **convolutional neural networks (CNN)** and **recurrent neural networks (RNN)**, respectively. You can implement the convolutional layer by convolving an input vector or tensor representing an image with a kernel acting as a feature detector, adding a bias term, and applying a point-wise non-linearity. The entities here are still individual units associated with each pixel in an image, but the relations are biased toward enforcing *locality* and *translation invariance*.

**Locality** reflects the fact that the arguments to the relational rule are those entities in close proximity with one another, i.e. the kernel is local. **Translation invariance** means that the same local kernel function is reused multiple times across the input image.

These biases are very effective for processing natural image data: usually, pixel values are very similar within a local neighborhood and their distribution is mostly stationary across an image, meaning that the same feature detector like a vertical edge detector can be useful for both the upper left corner and the lower right corner on the image. Analogously, in a recurrent layer different time steps reuse the same weights.

All these relational inductive biases are implicit, i.e. determined in advance by the fixed architecture.

##### Graphs and graph networks

The idea behind *graph networks* is to provide a building block which explicitly handles entities and relations by operating over directed graphs. This is often more practical than processing vectors or tensors. *A graph network* (GN) block, takes a graph as input, performs computations over the structure, and returns a graph as output.

Graphs are defined as a set of nodes connected by edges. In a graph, nodes represent entities, edges represent relations and global attributes represent system-level properties. Nodes, edges and the entire graph can have attributes, (i.e. properties associated with them). For instance, in a social network, nodes can have properties like the age and gender of a person, the edges can reflect the number of times two people meet every month etc. Graphs are suitable to describe objects and relations because the set of nodes in a graph do not have a natural ordering, and because they allow for pairwise interactions between entities (when an edge exists between two nodes). Examples of systems easily respresentable as a graph are:

- Molecules. Nodes represent atoms and edges correspond to bonds.
- Prey-predator networks.
- The internet, where two web pages are connected if there is a link from one to the other etc.

##### Working on graph networks: update functions and aggregation functions.

A *GN block* contains three update functions and three aggregation functions. The update functions compute new attribute values. The aggregate functions each take a set as input and reduce it to a single element which represents the aggregated information. Typical aggregate functions are summation, averages, minimum, maximum. On the other hand, you can use any method to compute update functions.

For example, suppose we have a system of planets orbiting around a star and we want to predict the position and velocity of each planet over time. In this case, the update functions would compute the forces between each pair of planets at each instant, the updated position, velocity, and kinetic energy of each planet etc. The aggregate functions may sum all the forces or potential energies acting on each planet and compute the total energy of the whole physical system.

*A GN block *in a neural network is called a *graph neural network block*. The edge and node outputs of each GN block usually are lists of vectors or tensors. The global outputs correspond to a single vector or tensor. In case of vector attributes, MLP or RNN networks are most frequent. CNNs may be more appropriate for tensors, such as image feature maps. Each block can be *unshared*, like in the different layers of a CNN or MLP, or *shared*, where aggregate and update functions and their parameters are reused in every layer, analogous to an unrolled RNN.

##### Choosing the starting graph for a graph network block

To define the starting input graph for a GN block there are two possibilities: either the input explicitly specifies the relational structure of the graph or you must make some assumptions (e.g. you start from a fully connected graph). Examples of data with explicitly specified entities and relations include:

*Knowledge graphs*(like the ones used in the expert systems that dominated the early times of AI)- Social networks.
- Physical systems with known interactions

Converting raw sensory data, images and text into more structured representations like graphs, and properly modifying graph structures during computation to reflect novel information, is definitely an active area of research.

### Conclusions

Despite these limits, graph networks seem a promising approach to tackle the shortcomings of deep learning methods. Just like tensors, each GN block can be an input to another one. This allows a GN’s output to be passed to other deep learning building blocks such as MLPs, CNNs, and RNNs. In turn, this may result in powerful tools able to learn with less labeled data. Using graphs should make explainable models easier to create because the relations between graph blocks tell how our model works.

GNs shall also help increase robustness to adversarial attacks. A system where objects have properties and aren’t just patterns of pixels shall be hard to deceive with just an extraneous sticker. Such an advanced system would learn that sometimes, it is not impossible for an elephant to sit on a sofa!