# What graph neural networks can and cannot learn

A fundamental question in machine learning is to determine what a model can and cannot learn. In deep learning, there has been significant research effort in establishing positive results for feed-forward neural networks, RNN, and transformers.

My recent paper entitled:

What graph neural networks cannot learn: depth vs width” (to appear in ICLR 2020),

in turn, focused on the expressive power of message-passing graph neural networks (MPNN). The latter encompass several state-of-the-art neural networks, including GCN, ChebyNet, gated graph sequence networks, molecular fingerprints, interaction networks—among many others. The paper contains two main results.

First, the good news:

1. MPNN are Turing universal (read more in Section 1)

In other words, MPNN can be used as general purpose graph learning machines—given enough data and the right training procedure, a universal MPNN can solve any task that it is presented with.

As expected, universality hinges on some strong conditions on the network’s operation. A key condition is that the depth and width of the network has to be very large. Unfortunately, the latter condition is also necessary:

2. MPNN lose a significant portion of their power when their capacity –which I define as the product of depth and width– is restricted (read more in Section 2).

The proposed impossibility statements stem from a new technique that leads to lower bounds for an array of decision, optimization, verification, and estimation problems involving graphs (e.g., does my graph contain a cycle, find a minimum cut, or estimate the diameter). Strikingly, several of these problems are deemed impossible whenever the product of a MPNN’s depth and width is sub-linear on the number of nodes. The dependence remains significant even for tasks that appear simple or when considering approximation.

I discuss these results and their intriguing implications in depth below.

# 1. Turing universality

In the paper it is shown that, if parametrized correctly, a MPNN can compute any function computable by a Turing machine that has the same input.

Of course, to obtain such powerful networks three strong conditions should hold:

1. The network should have powerful layers.
2. The network should have sufficient depth and width.
3. Each node should have access to discriminative attributes that uniquely identify it.

In the following, I discuss the three sufficient conditions in more detail.

## 1.1 Powerful layers

Recall that, in each layer of a MPNN, each node $v_i$ updates its state $x_i$ by the following operations (roughly):

1. $v_i$ receives message $m_{ij} \leftarrow \textsf{MSG}(x_i, x_j)$ from each of its neighbors $v_j \in N_i$.
2. $v_i$ sums all received messages $z_i \leftarrow \sum_{v_j \in N_i} m_{ij}$;
3. $v_i$ updates its state as $x_i \leftarrow \textsf{UP}(z_i)$, where the update function is also a MLP.

The first Turing universality condition asserts that $\textsf{MSG}$ and $\textsf{UP}$ are general functions that map vectors onto vectors. In practice, these functions are instantiated by feed-forward neural networks. Thus, by the universal approximation theorem and its variants, given sufficient depth and/or width, these networks can approximate any continuous function to arbitrary precision.

## 1.2 Sufficient depth and width

Let us recall two basic definitions:

• The depth $d$ of a MPNN equals the number of layers it contains.
• The width $w$ of a MPNN equals the size of its node representations (i.e., the maximum length of any vector $x_i$ across all layers).

It is easy to understand why the depth should be large: the representation $x_i$ of any node in the output of the MPNN is a function of the input within a $d$-radius neighborhood of the node (the receptive field). It directly follows that a network whose depth is strictly smaller than the graph diameter cannot compute quantities that are inherently global. For instance, without a readout function, the network cannot learn to count the total number of nodes/edges of a graph.

The role of width is somewhat more subtle. Intuitively, in narrow networks information propagates more slowly. Thus, the nodes of a very narrow but deep MPNN may be unable to gather required information about the rest of the graph if the network.

The relation between depth and width becomes more clear when considering what MPNN cannot learn (more on that in Section 2).

## 1.3 Discriminative attributes (or beyond the Weisfeiler-Lehman test)

There is a striking difference between the power of anonymous MPNN and those in which nodes have the capacity to uniquely identify each other (e.g., based on unique ids or discriminative node attributes).

Previous analyses of MPNN expressivity focused on the anonymous case. It is perhaps of little surprise that in this setting MPNN are not particularly powerful. In fact, it was shown independently by Xu et al. (2018) and Morris et al. (2019) that (anonymous) MPNN are as powerful as the Weisfeiler-Lehman (WL) graph isomorphism test. This is disheartening as it implies that anonymous networks are blind to the many graph properties that WL-test cannot see: e.g., any two regular graphs with the same number of nodes are identical from the perspective of the WL test.

### Why? A thought experiment

To illustrate the problem with anonymity, I consider a thought experiment where a red node is tasked with reconstructing the graph topology by communicating with its neighbors.

The figure above depicts the red node’s understanding of the world after the first and second message exchange round.

• Left: each node possesses a unique identifier (color). At the end of the first round, each node is aware of its neighbors and after two rounds the entire graph has been successfully reconstructed in red’s memory.
• Right: nodes do not possess ids and cannot distinguish which of their neighbors are themselves adjacent. As such, the red node cannot tell whether the graph contains cycles: after two rounds there are at least two plausible topologies that could explain its observations.

### Empirically demonstrating the effect of anonymity

The detrimental effect of anonymity becomes apparent when testing how well a MPNN solves the 4-cycle detection task: classifying a graph as having a 4-cycle or not. To demonstrate this effect, I built a simple dataset by starting from a template graph and changing a subset of its edges stochastically. This resulted in 1200 graphs (1000 for training and 200 for testing) out of which exactly half contained a 4-cycle and the rest didn’t (read more in Appendix C of paper). I then constructed a (sufficiently deep and wide) MPNN and tasked it with learning to classify the former from the latter.

Figure 2 depicts the training and test curves for four MPNN variants.

As clearly seen, more discriminative node attributes yield significantly better performance. When the nodes were given non- or partially-discriminative attributes, the network could not detect cycles—even in the training set. The cycle detection problem was solved exactly when nodes were given unique attributes. Though, when there was no consistent way of assigning ids across graphs, the network could not learn to generalize.

# 2. What graph neural networks cannot learn

## 2.1 The importance of negative results in deep learning

Universality statements allow us to grasp the capacity of models in the limit. In theory, given enough data and the right training, a universal network can solve any task that it is presented with. Nevertheless, a universality result does not reveal much about how neural networks should be designed in practice. It also certainly cannot guarantee that said network will be able to solve a given task given a particular optimization procedure, such as stochastic gradient descent.

On the other hand, it may be easier to obtain insights about models by studying their limitations. After all, the knowledge of what cannot be learned by a network of specific characteristics applies independently of the training procedure. Further, impossibility results helps us comprehend the difficulty of a task in relation to a network and can yield practical advice on how to select model hyperparameters.

Take, for instance, the problem of graph classification. Training a graph classifier entails identifying what constitutes a class, i.e., finding properties shared by graphs in one class but not the other, and then deciding whether new graphs abide to said learned properties. However, if the aforementioned decision problem is shown to be impossible by a graph neural network of certain depth then we can be certain that the same network will not learn how to classify a sufficiently diverse test set correctly, independently of the training procedure employed. We should, therefore, focus on networks deeper that the lower bound when performing experiments.

## 2.2 Depth vs width bounds

The second contribution of the paper has to do with establishing what MPNN cannot compute (and thus learn) in terms of their capacity $dw$. The proposed impossibility statements concern an array of decision, verification, optimization, and estimation problems involving graphs:

A particularly intriguing result is proven:

No MPNN can solve the problems above if its capacity is $d w \geq c n^{p},$ where $n$ is the number of nodes of the input graph, $c$ is a constant, and $p \in [0.5, 2]$.

The following table presents a summary of the main results:

Let me highlight some implications these bounds give rise to:

• The bounds indicate a trade-off between depth and width. Of course, one should not deduce that depth and width are completely exchangeable—such a statement could be made if the bounds were not only necessary but also sufficient. Nevertheless, as we will see empirically in the following, depth and width can be exchangeable in some cases.
• None of the above problems can be solved by a network with linear memory footprint. Storing the entire state of the forward pass of a MPNN requires at least $\tilde{\Omega}(dwn)$ bits, even if the feed-forward networks $\textsf{MSG}$ l and $\textsf{UP}$ consist of a single layer. Hence, to avoid a super linear on the number of nodes, one should choose the depth and width satisfying $dw = O(1)$, which according to the bounds is impossible.
• The dependence of the necessary capacity on n can be significant even if the problem appears local in nature or one only looks for approximate solutions. For example, detecting whether $G$ contains a short cycle of odd length cannot be done unless $d w = \tilde{\Omega}(n)$. Approximation also does not help much; computing the graph diameter requires $d w = \tilde{\Omega}(n)$ and this reduces to $d w = \tilde{\Omega}(\sqrt{n})$ for any $\frac{3}{2}$-factor approximation. Further, it is impossible to approximate within any constant factor the shortest path, the minimum cut, and the minimum spanning tree, all three of which have polynomial time solutions, unless $d \sqrt{w} = \tilde{\Omega}(\sqrt{n})$.
• Finally, for truly hard problems, the network capacity may even need to be super-linear on n. Specifically, it is shown that, even if the layers of the MPNN are allowed to take exponential time, solving certain NP-hard problems, such as the minimum independent set, the minimum vertex cover, and the perfect coloring, necessitates $d = \tilde{\Omega}(n^2)$ depth for any constant-width network.

For a detailed discussion on the limitations of the bounds and on their relation to previous work, please see Section 1.1 in the paper.

## 2.3 Empirically verifying the lower bounds

Empirically evaluating lower bounds for neural networks is a very challenging task, as it involves searching over the space of all possible networks in order to find the ones that perform the best. For this reason, an experiment cannot be used to confirm the tightness of the bounds: we can never be certain whether the results obtained are the best possible or whether the optimization resulted in a local minimum (though, one may still invalidate the theoretical results empirically).

#### Task: 4-cycle classification

In the experiment, the neural network was trained to classify whether a graph contained a 4-cycle or not (as in “effect of anonymity” above). Following the lower bound construction described in the paper, I generated distributions over graphs with increasing size $n$ =(8, 16, 24, 32, 44) and an average diameter of (4,6,8,9,11), respectively. Four example graphs are shown below.

The experiment tested how well MPNN of different capacity could classify the test set. I performed grid search over the hyperparameters $w$ = (2, 10, 20) and $d$ = (5, 10, 20, 15). To reduce the dependence on the initial conditions and training length, I trained 4 networks independently for each hyperparameter combination.

#### Results

The question that I ask is: does the ability of a network to solve a task depend on the relation between its capacity $dw$ and the size of the input graph?

Figure 3 depicts the accuracy of the best performing networks. As seen, even small neural networks could consistently classify all test examples perfectly when $n\leq 16$. Moreover, there is a strong correlation between test accuracy, $d w$ and $n$ (recall that the theory predicts $d w= \tilde{\Omega}(\sqrt{n})$).

Networks of the same capacity were consistently less accurate on the test set as $n$ increased. It is also striking to observe that even the most powerful networks considered could not achieve a test accuracy above 95% for $n>16$ and for $n=40$ the best accuracy was below 80%.

Figure 4 examines further the relationship between depth, width, and test accuracy. This time networks are separated depending on their depth and width normalized by the square root of the critical capacity. For each $n$, the critical capacity is the minimum $dw$ of a network that was able to solve the task on a graph of $n$ nodes (here, solving amounts to a test accuracy above 95%). In this way, a network of depth $d$ and width $w$ tested on $n$ nodes corresponds to a point positioned at $x=d/\sqrt{\text{critical}}, y=w/\sqrt{\text{critical}}$ and no network positioned at $x y<1$ can solve the task (non-highlighted region).

It can be seen that there is a crisp phase transition between the regime of under- and super-critical capacity—almost every network meeting the condition $d * w \geq \text{critical}$ was able to solve the task, independently of whether the depth or width was larger.

# 3. Discussion

## 3.1 Further empirical results

Beyond 4-cycle classification, I have also considered two additional tasks: st-connectivity and spanning subgraph verification:

The results exhibit similar trends to cycle classification:

It can be seen that critical capacity increases with $n$, but more mildly (theory predicts a dependency of $\tilde{\Omega}(\sqrt{n})$). In particular, while even modestly sized networks could solve all tasks when $n$ was in the hundreds, no network was able to do the same for a graph of 1741 nodes.

If there is interest, I will consider expanding upon the above and open sourcing the datasets to the community.

## 3.2 Limitations of current results

No study is without its limitations. Here are three for this work:

• First, all lower bounds are of a worst-case nature: a problem is deemed impossible if there exists a graph for which it cannot be solved. The discovery of non worst-case depth-vs-width bounds remains an open problem.
• Second, rather than taking into account specific parametric functions, each layer is assumed to be sufficiently powerful to compute any function of its input. This strong assumption does not significantly limit the applicability of the results, simply because all lower bounds that hold with layers of unbounded capacity also apply to those limited computationally. What is perhaps surprising is that, even with unbounded capacity layers, many seemingly simple problems are shown to be unsolvable by networks of small depth and width.
• Lastly, it is assumed that nodes can uniquely identify each other. Node identification is compatible with permutation invariance/equivariance as long as the network output is asked to be invariant to the particular way the ids have been assigned. In the literature, one-hot encoded node ids have found use when working on a single graph or across graphs with same node sets but different edges. When attempting to learn functions across multiple arbitrary graphs, ids should be ideally substituted by sufficiently discriminative node attributes (attributes that uniquely identify each node within each receptive field it belongs to can serve as ids). Nevertheless, similar to the unbounded computation assumption, if a problem cannot be solved by a graph neural network in the studied setting, it also cannot be solved without identifiers and discriminative attributes. Thus, the presented bounds also apply to partially anonymous networks.

## 3.3 What’s next

I have become quite intrigued by the study of model expressivity in deep learning and plan to continue in this direction in the future.

(You might have seen our work with Jean-Baptiste Cordonnier and Martin Jaggi on the relation between self-attention and convolution.)

If you have ideas and wish to collaborate, do not hesitate to drop me an email. You can also find me at ICLR 2020.

Andreas

## Acknowledgements

My research was kindly supported by the Swiss National Science Foundation in the context of the grant “Deep Learning for Graph Structured Data” (grant number PZ00P2 179981).