What graph neural networks can and cannot learn

A fundamental question in machine learning is to determine what a model can and cannot learn. In deep learning, there has been significant research effort in establishing positive results for feed-forward neural networks, RNN, and transformers.

My recent paper entitled:

What graph neural networks cannot learn: depth vs width” (to appear in ICLR 2020),

in turn, focused on the expressive power of message-passing graph neural networks (MPNN). The latter encompass several state-of-the-art neural networks, including GCN, ChebyNet, gated graph sequence networks, molecular fingerprints, interaction networks—among many others. The paper contains two main results.

First, the good news:

1. MPNN are Turing universal (read more in Section 1)

In other words, MPNN can be used as general purpose graph learning machines—given enough data and the right training procedure, a universal MPNN can solve any task that it is presented with.

As expected, universality hinges on some strong conditions on the network’s operation. A key condition is that the depth and width of the network has to be very large. Unfortunately, the latter condition is also necessary:

2. MPNN lose a significant portion of their power when their capacity –which I define as the product of depth and width– is restricted (read more in Section 2).

The proposed impossibility statements stem from a new technique that leads to lower bounds for an array of decision, optimization, verification, and estimation problems involving graphs (e.g., does my graph contain a cycle, find a minimum cut, or estimate the diameter). Strikingly, several of these problems are deemed impossible whenever the product of a MPNN’s depth and width is sub-linear on the number of nodes. The dependence remains significant even for tasks that appear simple or when considering approximation.

I discuss these results and their intriguing implications in depth below.

1. Turing universality

In the paper it is shown that, if parametrized correctly, a MPNN can compute any function computable by a Turing machine that has the same input.

Of course, to obtain such powerful networks three strong conditions should hold:

  1. The network should have powerful layers.
  2. The network should have sufficient depth and width.
  3. Each node should have access to discriminative attributes that uniquely identify it.

In the following, I discuss the three sufficient conditions in more detail.

1.1 Powerful layers

Recall that, in each layer of a MPNN, each node v_i updates its state x_i by the following operations (roughly):

  1. v_i receives message m_{ij} \leftarrow \textsf{MSG}(x_i, x_j) from each of its neighbors v_j \in N_i.
  2. v_i sums all received messages z_i \leftarrow \sum_{v_j \in N_i} m_{ij};
  3. v_i updates its state as x_i \leftarrow \textsf{UP}(z_i), where the update function is also a MLP.

The first Turing universality condition asserts that \textsf{MSG} and \textsf{UP} are general functions that map vectors onto vectors. In practice, these functions are instantiated by feed-forward neural networks. Thus, by the universal approximation theorem and its variants, given sufficient depth and/or width, these networks can approximate any continuous function to arbitrary precision.

1.2 Sufficient depth and width

Let us recall two basic definitions:

  • The depth d of a MPNN equals the number of layers it contains.
  • The width w of a MPNN equals the size of its node representations (i.e., the maximum length of any vector x_i across all layers).

It is easy to understand why the depth should be large: the representation x_i of any node in the output of the MPNN is a function of the input within a d-radius neighborhood of the node (the receptive field). It directly follows that a network whose depth is strictly smaller than the graph diameter cannot compute quantities that are inherently global. For instance, without a readout function, the network cannot learn to count the total number of nodes/edges of a graph.

The role of width is somewhat more subtle. Intuitively, in narrow networks information propagates more slowly. Thus, the nodes of a very narrow but deep MPNN may be unable to gather required information about the rest of the graph if the network.

The relation between depth and width becomes more clear when considering what MPNN cannot learn (more on that in Section 2).

1.3 Discriminative attributes (or beyond the Weisfeiler-Lehman test)

There is a striking difference between the power of anonymous MPNN and those in which nodes have the capacity to uniquely identify each other (e.g., based on unique ids or discriminative node attributes).

Previous analyses of MPNN expressivity focused on the anonymous case. It is perhaps of little surprise that in this setting MPNN are not particularly powerful. In fact, it was shown independently by Xu et al. (2018) and Morris et al. (2019) that (anonymous) MPNN are as powerful as the Weisfeiler-Lehman (WL) graph isomorphism test. This is disheartening as it implies that anonymous networks are blind to the many graph properties that WL-test cannot see: e.g., any two regular graphs with the same number of nodes are identical from the perspective of the WL test.

Why? A thought experiment

To illustrate the problem with anonymity, I consider a thought experiment where a red node is tasked with reconstructing the graph topology by communicating with its neighbors.

Figure 1. MPNN with discriminative node attributes are significantly more powerful. Toy example of message exchange from the perspective of the red node. The arrows show where each received message comes from and the message content is shown in light gray boxes. Red’s knowledge of the graph topology is depicted at the bottom.

The figure above depicts the red node’s understanding of the world after the first and second message exchange round.

  • Left: each node possesses a unique identifier (color). At the end of the first round, each node is aware of its neighbors and after two rounds the entire graph has been successfully reconstructed in red’s memory.
  • Right: nodes do not possess ids and cannot distinguish which of their neighbors are themselves adjacent. As such, the red node cannot tell whether the graph contains cycles: after two rounds there are at least two plausible topologies that could explain its observations.

Empirically demonstrating the effect of anonymity

The detrimental effect of anonymity becomes apparent when testing how well a MPNN solves the 4-cycle detection task: classifying a graph as having a 4-cycle or not. To demonstrate this effect, I built a simple dataset by starting from a template graph and changing a subset of its edges stochastically. This resulted in 1200 graphs (1000 for training and 200 for testing) out of which exactly half contained a 4-cycle and the rest didn’t (read more in Appendix C of paper). I then constructed a (sufficiently deep and wide) MPNN and tasked it with learning to classify the former from the latter.

Figure 2 depicts the training and test curves for four MPNN variants.

Figure 2. GNNs are significantly more powerful when given discriminative node attributes. The four different curves correspond to different input node attributes: (a) All nodes had the same (non-discriminative) attribute (anonymous). (b) Each node started by having access to a one-hot encoding of its degree. This is a partially discriminative node attribute (degree). (c) Each node was given a one-hot encoding of its id. This is a fully discriminating node attribute and thus condition 3 is satisfied (unique id). (d) Similar to the unique id setting, but not the same node was given a different at each graph (random unique id).

As clearly seen, more discriminative node attributes yield significantly better performance. When the nodes were given non- or partially-discriminative attributes, the network could not detect cycles—even in the training set. The cycle detection problem was solved exactly when nodes were given unique attributes. Though, when there was no consistent way of assigning ids across graphs, the network could not learn to generalize.

2. What graph neural networks cannot learn

2.1 The importance of negative results in deep learning

Universality statements allow us to grasp the capacity of models in the limit. In theory, given enough data and the right training, a universal network can solve any task that it is presented with. Nevertheless, a universality result does not reveal much about how neural networks should be designed in practice. It also certainly cannot guarantee that said network will be able to solve a given task given a particular optimization procedure, such as stochastic gradient descent.

On the other hand, it may be easier to obtain insights about models by studying their limitations. After all, the knowledge of what cannot be learned by a network of specific characteristics applies independently of the training procedure. Further, impossibility results helps us comprehend the difficulty of a task in relation to a network and can yield practical advice on how to select model hyperparameters.

Take, for instance, the problem of graph classification. Training a graph classifier entails identifying what constitutes a class, i.e., finding properties shared by graphs in one class but not the other, and then deciding whether new graphs abide to said learned properties. However, if the aforementioned decision problem is shown to be impossible by a graph neural network of certain depth then we can be certain that the same network will not learn how to classify a sufficiently diverse test set correctly, independently of the training procedure employed. We should, therefore, focus on networks deeper that the lower bound when performing experiments.

2.2 Depth vs width bounds

The second contribution of the paper has to do with establishing what MPNN cannot compute (and thus learn) in terms of their capacity dw. The proposed impossibility statements concern an array of decision, verification, optimization, and estimation problems involving graphs:

A particularly intriguing result is proven:

No MPNN can solve the problems above if its capacity is d w \geq c n^{p}, where n is the number of nodes of the input graph, c is a constant, and p \in [0.5, 2].

The following table presents a summary of the main results:

Let me highlight some implications these bounds give rise to:

  • The bounds indicate a trade-off between depth and width. Of course, one should not deduce that depth and width are completely exchangeable—such a statement could be made if the bounds were not only necessary but also sufficient. Nevertheless, as we will see empirically in the following, depth and width can be exchangeable in some cases.
  • None of the above problems can be solved by a network with linear memory footprint. Storing the entire state of the forward pass of a MPNN requires at least \tilde{\Omega}(dwn) bits, even if the feed-forward networks \textsf{MSG} l and \textsf{UP} consist of a single layer. Hence, to avoid a super linear on the number of nodes, one should choose the depth and width satisfying dw = O(1), which according to the bounds is impossible.
  • The dependence of the necessary capacity on n can be significant even if the problem appears local in nature or one only looks for approximate solutions. For example, detecting whether G contains a short cycle of odd length cannot be done unless d w = \tilde{\Omega}(n). Approximation also does not help much; computing the graph diameter requires d w = \tilde{\Omega}(n) and this reduces to d w = \tilde{\Omega}(\sqrt{n}) for any \frac{3}{2}-factor approximation. Further, it is impossible to approximate within any constant factor the shortest path, the minimum cut, and the minimum spanning tree, all three of which have polynomial time solutions, unless d \sqrt{w} = \tilde{\Omega}(\sqrt{n}).
  • Finally, for truly hard problems, the network capacity may even need to be super-linear on n. Specifically, it is shown that, even if the layers of the MPNN are allowed to take exponential time, solving certain NP-hard problems, such as the minimum independent set, the minimum vertex cover, and the perfect coloring, necessitates d = \tilde{\Omega}(n^2) depth for any constant-width network.

For a detailed discussion on the limitations of the bounds and on their relation to previous work, please see Section 1.1 in the paper.

2.3 Empirically verifying the lower bounds

Empirically evaluating lower bounds for neural networks is a very challenging task, as it involves searching over the space of all possible networks in order to find the ones that perform the best. For this reason, an experiment cannot be used to confirm the tightness of the bounds: we can never be certain whether the results obtained are the best possible or whether the optimization resulted in a local minimum (though, one may still invalidate the theoretical results empirically).

Task: 4-cycle classification

In the experiment, the neural network was trained to classify whether a graph contained a 4-cycle or not (as in “effect of anonymity” above). Following the lower bound construction described in the paper, I generated distributions over graphs with increasing size n =(8, 16, 24, 32, 44) and an average diameter of (4,6,8,9,11), respectively. Four example graphs are shown below.

The experiment tested how well MPNN of different capacity could classify the test set. I performed grid search over the hyperparameters w = (2, 10, 20) and d = (5, 10, 20, 15). To reduce the dependence on the initial conditions and training length, I trained 4 networks independently for each hyperparameter combination.

Results

The question that I ask is: does the ability of a network to solve a task depend on the relation between its capacity dw and the size of the input graph?

Figure 3 depicts the accuracy of the best performing networks. As seen, even small neural networks could consistently classify all test examples perfectly when n\leq 16. Moreover, there is a strong correlation between test accuracy, d w and n (recall that the theory predicts d w= \tilde{\Omega}(\sqrt{n})).

Figure 3. Training and test accuracy as a function of the product dw for all the 240 networks trained: 5 training sets \times 3 widths \times 4 depths \times 4 iterations.

Networks of the same capacity were consistently less accurate on the test set as n increased. It is also striking to observe that even the most powerful networks considered could not achieve a test accuracy above 95% for n>16 and for n=40 the best accuracy was below 80%.

Figure 4 examines further the relationship between depth, width, and test accuracy. This time networks are separated depending on their depth and width normalized by the square root of the critical capacity. For each n, the critical capacity is the minimum dw of a network that was able to solve the task on a graph of n nodes (here, solving amounts to a test accuracy above 95%). In this way, a network of depth d and width w tested on n nodes corresponds to a point positioned at x=d/\sqrt{\text{critical}}, y=w/\sqrt{\text{critical}} and no network positioned at x y<1 can solve the task (non-highlighted region).

Figure 4. Test accuracy (indicated by color) as a function of normalized depth and width. Points in highlighted areas correspond to networks with super-critical capacity, whereas the diagonal line separates networks that are more deep than wide.

It can be seen that there is a crisp phase transition between the regime of under- and super-critical capacity—almost every network meeting the condition d * w \geq \text{critical} was able to solve the task, independently of whether the depth or width was larger.

3. Discussion

3.1 Further empirical results

Beyond 4-cycle classification, I have also considered two additional tasks: st-connectivity and spanning subgraph verification:

Figure 5. Toy examples of connectivity verification. In spanning subgraph verification (a-b), the neural network should verify whether a given subgraph (in green) of a graph (in black) is connected and spans all nodes: the subgraph in (a) is not spanning due to being disconnected (see edge highlighted in orange), whereas (b) is spanning. In st-connectivity verification, the network should verify whether nodes s and t are connected on the green subgraph: in (c) no such path exists, whereas in (b) the path is highlighted in orange.

The results exhibit similar trends to cycle classification:

Figure 6. Test accuracy as a function of network capacity and number of nodes for spanning subgraph and st-connectivity verification.

It can be seen that critical capacity increases with n, but more mildly (theory predicts a dependency of \tilde{\Omega}(\sqrt{n})). In particular, while even modestly sized networks could solve all tasks when n was in the hundreds, no network was able to do the same for a graph of 1741 nodes.

If there is interest, I will consider expanding upon the above and open sourcing the datasets to the community.

3.2 Limitations of current results

No study is without its limitations. Here are three for this work:

  • First, all lower bounds are of a worst-case nature: a problem is deemed impossible if there exists a graph for which it cannot be solved. The discovery of non worst-case depth-vs-width bounds remains an open problem.
  • Second, rather than taking into account specific parametric functions, each layer is assumed to be sufficiently powerful to compute any function of its input. This strong assumption does not significantly limit the applicability of the results, simply because all lower bounds that hold with layers of unbounded capacity also apply to those limited computationally. What is perhaps surprising is that, even with unbounded capacity layers, many seemingly simple problems are shown to be unsolvable by networks of small depth and width.
  • Lastly, it is assumed that nodes can uniquely identify each other. Node identification is compatible with permutation invariance/equivariance as long as the network output is asked to be invariant to the particular way the ids have been assigned. In the literature, one-hot encoded node ids have found use when working on a single graph or across graphs with same node sets but different edges. When attempting to learn functions across multiple arbitrary graphs, ids should be ideally substituted by sufficiently discriminative node attributes (attributes that uniquely identify each node within each receptive field it belongs to can serve as ids). Nevertheless, similar to the unbounded computation assumption, if a problem cannot be solved by a graph neural network in the studied setting, it also cannot be solved without identifiers and discriminative attributes. Thus, the presented bounds also apply to partially anonymous networks.

3.3 What’s next

I have become quite intrigued by the study of model expressivity in deep learning and plan to continue in this direction in the future.

(You might have seen our work with Jean-Baptiste Cordonnier and Martin Jaggi on the relation between self-attention and convolution.)

If you have ideas and wish to collaborate, do not hesitate to drop me an email. You can also find me at ICLR 2020.

Andreas

Acknowledgements

My research was kindly supported by the Swiss National Science Foundation in the context of the grant “Deep Learning for Graph Structured Data” (grant number PZ00P2 179981).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s