A fundamental question in machine learning is to determine what a model can and cannot learn. In deep learning, there has been significant research effort in establishing positive results for feed-forward neural networks, RNN, and transformers.
The recent paper entitled:
“What graph neural networks cannot learn: depth vs width” (short video, bibtex),
in turn, focused on the expressive power of (message-passing) graph neural networks (GNNs). The latter encompass several state-of-the-art neural networks, including GCN, ChebyNet, gated graph sequence networks, molecular fingerprints, interaction networks—among many others. The paper contains two main results.
First, the good news:
1. GNNs are Turing universal (read more in Section 1)
In other words, GNNs can be used as general purpose graph learning machines—given enough data and the right training procedure, a universal GNN can solve any task that it is presented with.
As expected, universality hinges on some strong conditions on the network’s operation. A key condition is that the depth and width of the network has to be very large. Unfortunately, the latter condition is also necessary:
2. GNNs lose a significant portion of their power when their capacity –which I define as the product of depth and width– is restricted (read more in Section 2).
The proposed impossibility statements stem from a new technique that leads to lower bounds for an array of decision, optimization, verification, and estimation problems involving graphs (e.g., does my graph contain a cycle, find a minimum cut, or estimate the diameter). Strikingly, several of these problems are deemed impossible whenever the product of a GNN’s depth and width is sub-linear on the number of nodes. The dependence remains significant even for tasks that appear simple or when considering approximation.
I discuss these results and their intriguing implications in depth below.
1. Turing universality
In the paper it is shown that, if parametrized correctly, a GNN can compute any function computable by a Turing machine that has the same input. To my knowledge, this is the first time that GNN were shown to be universal.
Of course, to obtain such powerful networks three strong conditions should hold:
- The network should have powerful layers.
- The network should have sufficient depth and width.
- Each node should have access to discriminative attributes that uniquely identify it.
In the following, I discuss the three sufficient conditions in more detail.
1.1 Powerful layers
Recall that, in each layer of a GNN, each node updates its state by the following operations (roughly):
- receives message from each of its neighbors .
- sums all received messages ;
- updates its state as , where the update function is also a MLP.
The first Turing universality condition asserts that and are general functions that map vectors onto vectors. In practice, these functions are instantiated by feed-forward neural networks. Thus, by the universal approximation theorem and its variants, given sufficient depth and/or width, these networks can approximate any continuous function to arbitrary precision. MLPs with appropriate activations are also Turing universal.
1.2 Sufficient depth and width
Let us recall two basic definitions:
- The depth of a GNN equals the number of layers it contains.
- The width of a GNN equals the size of its node representations (i.e., the maximum length of any vector across all layers).
It is easy to understand why the depth should be large: the representation of any node in the output of the GNN is a function of the input within a -radius neighborhood of the node (the receptive field). It directly follows that a network whose depth is strictly smaller than the graph diameter cannot compute quantities that are inherently global. For instance, without a readout function, the network cannot learn to count the total number of nodes/edges of a graph.
The role of width is somewhat more subtle. Intuitively, in narrow networks information propagates more slowly. Thus, the nodes of a very narrow but deep GNN may be unable to gather required information about the rest of the graph if the network.
The relation between depth and width becomes more clear when considering what GNNs cannot learn (more on that in Section 2).
1.3 Discriminative attributes (or beyond the Weisfeiler-Lehman test)
There is a striking difference between the power of anonymous GNNs and those in which nodes have the capacity to uniquely identify each other (e.g., based on unique ids, random features, or discriminative node attributes).
Previous analyses of GNN expressivity focused on the anonymous case. It is perhaps of little surprise that in this setting GNN are not particularly powerful. In fact, it was shown independently by Xu et al. (2018) and Morris et al. (2019) that (anonymous) GNN are as powerful as the Weisfeiler-Lehman (WL) graph isomorphism test. This is disheartening as it implies that anonymous networks are blind to the many graph properties that WL-test cannot see: e.g., any two regular graphs with the same number of nodes are identical from the perspective of the WL test.
Why? A thought experiment
To illustrate the problem with anonymity, I consider a thought experiment where a red node is tasked with reconstructing the graph topology by communicating with its neighbors.
The figure above depicts the red node’s understanding of the world after the first and second message exchange round.
- Left: each node possesses a unique identifier (color). At the end of the first round, each node is aware of its neighbors and after two rounds the entire graph has been successfully reconstructed in red’s memory.
- Right: nodes do not possess ids and cannot distinguish which of their neighbors are themselves adjacent. As such, the red node cannot tell whether the graph contains cycles: after two rounds there are at least two plausible topologies that could explain its observations.
Empirically demonstrating the effect of anonymity
The effect of anonymity becomes apparent when testing how well a GNN solves the 4-cycle detection task: classifying a directed graph as having a 4-cycle or not. To demonstrate this effect, I built an adversarial dataset by starting from a template graph and changing a subset of its edges stochastically (The specifics are quite intricate and follow from a lower bound construction. For more information see the Appendix C of paper). This resulted in 1200 graphs (1000 for training and 200 for testing) out of which exactly half contained a 4-cycle and the rest didn’t. I then constructed a (sufficiently deep and wide) GNN and tasked it with learning to classify the former from the latter.
Figure 2 depicts the training and test curves for four GNN variants.
As seen, more discriminative node attributes yield significantly better performance. When the nodes were given non- or partially-discriminative attributes, the network could not detect cycles—even in the training set. The cycle detection problem was solved exactly when nodes were given unique attributes. Though, when there was no consistent way of assigning ids across graphs, the network could not learn to generalize.
2. What graph neural networks cannot learn
2.1 The importance of negative results in deep learning
Universality statements allow us to grasp the capacity of models in the limit. In theory, given enough data and the right training, a universal network can solve any task that it is presented with. Nevertheless, a universality result does not reveal much about how neural networks should be designed in practice. It also certainly cannot guarantee that said network will be able to solve a given task given a particular optimization procedure, such as stochastic gradient descent.
On the other hand, it may be easier to obtain insights about models by studying their limitations. After all, the knowledge of what cannot be learned by a network of specific characteristics applies independently of the training procedure. Further, impossibility results helps us comprehend the difficulty of a task in relation to a network and can yield practical advice on how to select model hyperparameters.
Take, for instance, the problem of graph classification. Training a graph classifier entails identifying what constitutes a class, i.e., finding properties shared by graphs in one class but not the other, and then deciding whether new graphs abide to said learned properties. However, if the aforementioned decision problem is shown to be impossible by a graph neural network of certain depth then we can be certain that the same network will not learn how to classify a sufficiently diverse test set correctly, independently of the training procedure employed. We should, therefore, focus on networks deeper that the lower bound when performing experiments.
2.2 Depth vs width bounds
The second contribution of the paper has to do with establishing what GNNs cannot compute (and thus learn) in terms of their capacity . The proposed impossibility statements concern an array of decision, verification, optimization, and estimation problems involving graphs:
- detecting whether a graph contains a k-cycle;
- verifying whether a given subgraph is connected, contains a cycle, is a spanning tree, is bipartite, is a simple path, corresponds to a cut or Hamiltonian cycle;
- approximating the shortest path between two vertices, the minimum cut, and the minimum spanning tree;
- finding a maximum independent set, a minimum vertex cover, or a perfect coloring;
- estimating the graph diameter and girth.
A particularly intriguing result is proven:
No GNN can solve the problems above if its capacity is where is the number of nodes of the input graph, is a constant, and .
The following table presents a summary of the main results:
Let me highlight some implications these bounds give rise to:
- The bounds indicate a trade-off between depth and width. Of course, one should not deduce that depth and width are completely exchangeable—such a statement could be made if the bounds were not only necessary but also sufficient. Nevertheless, as we will see empirically in the following, depth and width can be exchangeable in some cases.
- None of the above problems can be solved by a network with linear memory footprint. Storing the entire state of the forward pass of a GNN requires at least bits, even if the feed-forward networks l and consist of a single layer. Hence, to avoid a super linear on the number of nodes, one should choose the depth and width satisfying , which according to the bounds is impossible.
- The dependence of the necessary capacity on n can be significant even if the problem appears local in nature or one only looks for approximate solutions. For example, detecting whether contains a short cycle of odd length cannot be done unless . Approximation also does not help much; computing the graph diameter requires and this reduces to for any -factor approximation. Further, it is impossible to approximate within any constant factor the shortest path, the minimum cut, and the minimum spanning tree, all three of which have polynomial time solutions, unless .
- Finally, for truly hard problems, the network capacity may even need to be super-linear on n. Specifically, it is shown that, even if the layers of the GNN are allowed to take exponential time, solving certain NP-hard problems, such as the minimum independent set, the minimum vertex cover, and the perfect coloring, necessitates depth for any constant-width network.
For a detailed discussion on the limitations of the bounds and on their relation to previous work, please see Section 1.1 in the paper.
2.3 Empirically verifying the lower bounds
Empirically evaluating lower bounds for neural networks is a very challenging task, as it involves searching over the space of all possible networks in order to find the ones that perform the best. For this reason, an experiment cannot be used to confirm the tightness of the bounds: we can never be certain whether the results obtained are the best possible or whether the optimization resulted in a local minimum.
Task: 4-cycle classification
In the experiment, the neural network was trained to classify whether a graph contained a 4-cycle or not (as in “effect of anonymity” above). Following the lower bound construction described in the paper, I generated distributions over graphs with increasing size =(8, 16, 24, 32, 44) and an average diameter of (4,6,8,9,11), respectively. Four example graphs are shown below. It should be noted that, due to the construction, the resulting graphs are often disconnected and have directed edges.
The experiment tested how well GNNs of different capacity could classify the test set. I performed grid search over the hyperparameters = (2, 10, 20) and = (5, 10, 20, 15). To reduce the dependence on the initial conditions and training length, I trained 4 networks independently for each hyperparameter combination.
The question that I ask is: does the ability of a network to solve a task depend on the relation between its capacity and the size of the input graph?
Figure 3 depicts the accuracy of the best performing networks. As seen, even small neural networks could consistently classify all test examples perfectly when . Moreover, there is a strong correlation between test accuracy, and (recall that the theory predicts ).
Networks of the same capacity were consistently less accurate on the test set as increased. It can be observed that even the most powerful networks considered could not achieve a test accuracy above 95% for and for the best accuracy was below 80%.
Figure 4 examines further the relationship between depth, width, and test accuracy. This time networks are separated depending on their depth and width normalized by the square root of the critical capacity. For each , the critical capacity is the minimum of a network that was able to solve the task on a graph of nodes (here, solving amounts to a test accuracy above 95%). In this way, a network of depth and width tested on nodes corresponds to a point positioned at and no network positioned at can solve the task (non-highlighted region).
It can be seen that there is a crisp phase transition between the regime of under- and super-critical capacity—almost every network meeting the condition was able to solve the task, independently of whether the depth or width was larger.
3.1 Further empirical results
Beyond 4-cycle classification, I have also considered two additional tasks: st-connectivity and spanning subgraph verification:
The results exhibit similar trends to cycle classification:
It can be seen that critical capacity increases with , but more mildly (theory predicts a dependency of ). In particular, while even modestly sized networks could solve all tasks when was in the hundreds, no network was able to do the same for a graph of 1741 nodes.
3.2 Limitations of current theoretical results
No study is without its limitations. Here are three for this work:
- First, all lower bounds are of a worst-case nature and concern specific adversarial inputs: a problem is deemed impossible if there exists a graph for which it cannot be solved. The discovery of non worst-case depth-vs-width bounds remains an open problem.
- Second, rather than taking into account specific parametric functions, each layer is assumed to be sufficiently powerful to compute any function of its input. This strong assumption does not significantly limit the applicability of the results, simply because all lower bounds that hold with layers of unbounded capacity also apply to those limited computationally. What is perhaps surprising is that, even with unbounded capacity layers, many seemingly simple problems are shown to be unsolvable by networks of small depth and width.
- Lastly, it is assumed that nodes can uniquely identify each other. Node identification is compatible with permutation invariance/equivariance as long as the network output is asked to be invariant to the particular way the ids have been assigned. In the literature, one-hot encoded node ids have found use when working on a single graph or across graphs with same node sets but different edges. When attempting to learn functions across multiple arbitrary graphs, ids should be ideally substituted by sufficiently discriminative node attributes (attributes that uniquely identify each node within each receptive field it belongs to can serve as ids). Nevertheless, similar to the unbounded computation assumption, if a problem cannot be solved by a graph neural network in the studied setting, it also cannot be solved without identifiers and discriminative attributes. Thus, the presented bounds also apply to partially anonymous networks.
3.3 Limitations of experiments
- Even though the theoretical constructions concern the worst-case behavior over an exponential number of inputs, in practice we must content with experiments that involve finite-size training and test sets. My approach was to generate different datasets by sampling the instances randomly (internally, the lower bound constructions translate a bit-string into a graph–thus, one may sample instances by sampling bit-strings). However, when doing this one of the difficulties one faces is that naive sampling can yield imbalanced class distributions, with the graphs in one of the two classes being significantly denser. To avoid situations were the neural network simply learned to classify the instances by looking at the graph density, I had to change the sampling procedure (by carefully select the sampling probabilities and by using rejection sampling) in order to match the distribution statistics of the two classes.
- Though neural networks operate with fixed precision arithmetic (and the lower bounds take this into account), it’s not easy to directly control the number of bits that they use to store weights (rounding is a non-differentiable operation). One can argue that, in theory, the network can exploit its full precision, but in practice this is very far from reality: it’s unlikely that we can train networks that are so sensitive to tiny changes of their input. When calculating the bounds, I assumed that each neuron can be in two states: active or not. From a theory point of view, this is not a big issue since the precision will only change the constant in front of the bound. Nevertheless, I acknowledge that in practice trained neurons can learn to be more sensitive at some regions of the input space, which would yield different non-asymptotics for the bound.
- In the experiments, the neural networks were given access to directed edges (this is an artifact of the lower bound construction procedure). Changing the graph to undirected renders the problem significantly simpler. For instance, the cycle problem becomes much simpler to solve by anonymous GNNs, even though adding node attributes can yield a slight advantage (see comparison here).
3.4 What’s next
I have become quite intrigued by the study of model expressivity in deep learning and plan to continue in this direction in the future.
- How hard is to distinguish graphs with graph neural networks.
- On the relation between self-attention and convolution.
- Attention is not all you need.
My research was kindly supported by the Swiss National Science Foundation in the context of the grant “Deep Learning for Graph Structured Data” (grant number PZ00P2 179981).