# How hard is to distinguish graphs with GNNs

A hallmark of the capacity of graph neural networks (GNNs) is their ability to recognize different graphs. So which graphs can they effectively distinguish?

The following post summarizes the recent findings from the paper

that will be presented at NeurIPS 2020.

The paper provides new insights on the power of message-passing neural networks (MPNNs) –arguably the most famous GNNs first introduced by Scarselli et al. and later expanded by numerous works (e.g., see works by Gilmer et al. and Battaglia et al.)

In particular, the work examines how MPNN hyperparameters (e.g., depth, width, message size, global state) need to grow to for the network to output the isomorphism class as a function of the number of nodes $n$. Recall that depth d is the number of layers, width w is the width of the MLP at each node, message size m is the dimension of the vectors used to store for messages, and global state size g is the dimension of the vector storing the global information.

The main result has as follows:

For any MPNN that is able to distinguish between connected graphs and between trees $d \min(w, m) + d g$ needs to grow with $\Omega(n^2)$ and $\Omega(n)$, respectively.

The rest of this post aims to provide context for this result, as well as to illustrate some key ideas in how it is derived.

## 1. Background

### 1.1 The power of MPNN in the limit

The simplest type of analysis can be done by considering neural networks of unbounded size.

The answer then is known to depend on what the node attributes are:

• Without attributes, some graphs are indistinguishable. For instance, it is well known (by equivalence to the 1-WL test) that MPNN cannot tell apart $k$-regular graphs with the same number of nodes. There are many works that have studied this setting. A good starting point are the papers by Xu et al., Morris et al., and Chen et al.
• If each node attribute acts as a unique node identifier, MPNN is universal. I first argued about the sufficiency of unique identifiers for Turing universality in “What graph neural networks cannot learn: depth vs width” (see also the accompanying blogpost). Since then, a number of works have expanded on this idea by testing various mechanisms for creating identifiers (usually noise).

### 1.2 What about MPNN of bounded size?

While valuable from a theoretical point of view, knowing that in the limit a given neural network is universal or not carries little insight about the power of real networks. It is therefore important to characterize how the capacity changes as a function of MPNN’s hyperparameters, such as its depth, width and message-size.

In fact, there is a well-known trick for obtaining depth lower bounds:

Folklore theorem. Suppose that the graphs are sampled from some distribution. MPNN needs $d \geq max_{G} \ \text{diam}(G)$ to distinguish between graphs in the worst-case, and $d \geq E[\text{diam}(G)]$ on average.

The proof follows by realizing that, since the receptive field of a node (i.e., the part of the graph a node sees at he end of the forward pass) is equal to the MPNN’s depth, $d$ has to be at least as large as the graph diameter for each node to see the entire graph.

## 2. New insights

### 2.1 Understanding how MPNN communicate

Aiming to provide further insight on the power of MPNN of practical size to distinguish graphs, the paper defines and characterizes communication capacity—measuring of the amount of information that the nodes of the network can exchange during the forward pass.

It is shown that the capacity of MPNN depends linearly on its depth d, width w, message size m, and global state size m.

Communication capacity is an effective generalization of the previously considered product between depth and width, being able to consolidate more involved properties, as well as to characterize MPNN with global state and adaptive architecture (i.e., input dependent hyperparameters).

### 2.2 From communication capacity to telling graphs apart

So we we know that the nodes of an MPNN can exchange a limited amount of information during the forward pass. To gain any further insights one must understand how much information exchange is needed for a given task.

This is exactly where the theory of communication complexity comes in. Introduced by Andrew Yao in 1979, communication complexity studies the amount of information required to compute a given function when the input to the problem is split among multiple parties. By the way things are defined, if the communication capacity of an MPNN is smaller than the complexity of the task, then the former cannot provide the correct answer!

The paper exploits this connection to derive hardness results:

Main theorem (informal). The communication capacity of MPNN needs to grow at least linearly with the number of nodes so that the network can correctly output the isomorphism class of trees, and quadratically for connected graphs.

The above result directly yields lower bounds for the depth, width, message size, and global state size of MPNN. Crucially, when the graph is sampled from a distribution, these aforementioned results hold w.r.t. the average case, implying that the failure of the MPNN will not be a rare event but something that will happen frequently!

### 2.3 Implications

To understand the implications of these insights, let’s compare the obtained lower bounds with those obtained by the folklore theorem:

• For general graphs, these results are one order of magnitude tighter than arguments that
compare the receptive field of a neural network with the graph diameter.
Specifically, connected
graphs have diameter at most n and thus a diameter analysis yields $d = \Omega(n)$ independently of all other hyperparameters for MPNN with unique attributes. On the other hand, the proposed bounds imply $d \min(w, m) = \Omega(n^2)$ which is much tighter.
• For trees, the new analysis also improves upon the folklore theorem: it asserts that $d \min(w, m) = \Omega(n)$ on average, even though the average tree $O(\sqrt{n})$ (implying that \$latex d = \Omega(\sqrt{n})). The proposed result is also relevant to MPNN without unique node attributes, as the latter are known to be able to tell trees apart in the limit (by equivalence to the 1-WL test).
• Finally, for networks with a global state (championed by Deep Mind among others) the folklore theorem requires only that MPNN has depth at least 2, whereas with the improved analysis the hyperparameters still need to depend (linearly or quadratically) with $n$.

## 3. Empirical evidence

The predictions of the developed theory were tested on 12 graph and tree isomorphism tasks of varying $n$.

At first sight, it might appear that these experiment are small scale, as the number of nodes in the graphs considered is generally small. My GPU stands to differ: recall that we are interested in distinguishing between (almost) all graphs/trees of a given size. Even with small n, this number grows in the thousands! (This is also why a classification accuracy of 50% is quite excellent in most cases). In practice, the experiments took many weeks of GPU time to complete.

Main observations:

• Notice first that networks of sufficient size could solve nearly every task up to 100% test accuracy. The latter corroborates previous theoretical findings that with unique node attributes MPNN are universal and can thus solve (small instances) of the graph isomorphism problem.
• It can also be seen that the achieved accuracy strongly correlated with communication capacity with larger-capacity networks performing consistently better.
• Crucially, there is good agreement between theory and practice. The analysis asserts that an MPNN with capacity below the gray dashed lines should not be able to correctly distinguish input graphs for a significant fraction of all inputs. Indeed, networks in the gray region consistently performed poorly.

## 4. Final remarks

From a practical perspective, the results herein provide evidence that, if the amount of training data is not an issue, determining the isomorphism class of graphs is hard but not impossible for MPNN.

Specifically, in the most general case it was argued that the number of parameters needs to increase quadratically with the number of nodes. The implication is that, in the most general case, networks of practical size should be able to solve the problem for graphs with at most a few dozen nodes, but will encounter issues otherwise.

### Acknowledgements

This research was kindly supported by the Swiss National Science Foundation in the context of the grant “Deep Learning for Graph Structured Data” (grant number PZ00P2 179981).