# Erdős goes neural: unsupervised learning of combinatorial algorithms with neural networks

This post summarizes our recent work

presented as an oral contribution at NeurIPS 2020.

The oral presentation has been recorded on video (starting at 36:54). Detailed slides and code are also freely available.

— Nikos Karalias and Andreas Loukas

## 1. Combinatorial algorithms over graphs

In many classical problems in computer science one starts from a graph $G = (V,E)$ and aims to find a ”special” set $S \subseteq V$ of nodes that abide to some property. Such problems can be formalized as combinatorial optimization (CO) problems of the following form:

$\min_{S \subseteq V} f(S; G) \ \text{subject to} \ S \in \Omega$

Above, $f$ is a cost function that depends on the graph $G$, whereas $\Omega$ can be used to encode constraints on the solution.

For instance, in the maximum clique problem the objective is to find the largest clique $S$ within a given graph. This can be expressed in CO form by using as cost function $f(S;G) = - |S|$ the negative cardinality of $S$ and defining $\Omega$ to be the set of all cliques of $G$. .

A clique is a set of nodes that are mutually pair-wise connected. Determining whether graph contains a k-clique was one of Richard Karp’s original 21 problems shown to be NP-complete in his seminal 1972 paper “Reducibility Among Combinatorial Problems“.

Graph CO problems permeate computer science, they include covering and packing, graph partitioning, and routing problems, among others.

## 2. Learning CO algorithms with neural networks

### 2.1 Motivation

The use of machine learning for CO was first put forth by Hopfield and Tank in 1985. They provided evidence that, rather than being designed by hand, discrete algorithms could be learned by a neural network (or Hopfield network more precisely).

It is important to note that we cannot expect learned algorithms to be optimal w.r.t. worst-case instances—after all, most CO problems are NP-hard whereas neural networks take polynomial time. However, we can aim to achieve good average performance when the graphs are sampled from some distribution.

Approximately 35 years after Hopfield’s and Tank’s original paper, innovations in deep learning have spurred a resurgence, bringing with them promise for more efficient solutions.

### 2.2 Unsupervised learning for CO: blessings and curses

Unsupervised learning (UL) presents an ideal paradigm for CO: there is no need to obtain labeled data—which would involve knowing how to solve the given problem instances. We simply train our model to search for sets of small cost. In contrast to supervised learning, since the amount of data is not an issue overfitting is not a major concern.

Despite of its apparent benefits, UL is one of the least explored options in the literature. To quote a recent review article:

“Unsupervised Learning has received little attention in conjunction with CO and its immediate use seems difficult (..)” — Bengio et al. 2018.

The major difficulty one faces has to do with reasoning about discrete objects in a differentiable manner.

To be (easily) trainable with back-propagation, a neural network must remain differentiable. At the same time, unless the network’s output is discrete one may not properly evaluate the cost function nor ensure that the solution is feasible (i.e., that it respects constraints).

The literature suggests two main ways of overcoming the above challenges:

• Loss engineering: one may try to craft a loss function that will encourage integrality and constraint satisfaction. A typical example could be to add an entropy regularization term to the loss in order to encourage solutions that are far from 0.5. This approach can be successful in certain cases, but this depends highly on the problem. We also do not have any guarantees that the constraints will be met.
• Smart decoding: At inference time, the discretized output can be post-processed based on some not-learned procedure (heuristic or otherwise) to ensure feasibility. The main issue here is that the neural network is oblivious to the way the decoding operates, which can lead to inefficiencies.

## 3. The “Erdős goes neural” approach

Here is where our paper comes in. Our idea is to utilize the probabilistic method pioneered by Paul Erdős in order to bridge the gap between the discrete and continuous worlds.

### 3.1 The probabilistic penalty loss

In the probabilistic method, to prove the existence of a set $S$ with a specific property $\pi$ (e.g., being a k-clique) one invents a distribution over possible sets and analyzes the probability that a sampled set will abide to $\pi$. Showing that $P(S \text{ has } \pi) > 0$ is a probabilistic certificate for the existence of $S$.

“Erdős goes neural” utilizes the same procedure, but now the probabilities are learned by a graph neural network (GNN). Suppose that we model the event that $v_i$ is included in the set $S$ as a Bernoulli r.v. with probability $p_i$. The GNN is tasked with finding $p_1, \ldots, p_n$ such that the probability of a sampled set $S$ to be feasible and to have small cost $f(S;G)$ is non-zero.

A quite simple and generic way to achieve the latter involves minimizing the following probabilistic penalty loss function:

$\ell(p_1, \ldots, p_n; G) := E[f(S;G)] + P(S \notin \Omega) \, \beta,$

where $\beta$ is a sufficiently large number.

Critically, as we illustrate in the paper, the above quantities can be computed (or bounded) in closed-form yielding a differentiable expression! This amounts to an important distinction with approaches (e.g., policy gradient) that aim to obtain estimates of the gradient by some form of sampling.

Optimizing the probabilistic penalty loss comes with the following guarantee:

Theorem 1 (informal). Fix any $\beta > \max_S f(S;G)$ and suppose that $\ell(p_1, \ldots, p_n;G) = \epsilon$. If $S$ is constructed by sampling every node $v_i$ independently with probability $p_i$, then $f(S;G) < \epsilon/ (1-t)$ and $S \in \Omega$ with probability at least $t$.

The GNN thus yields a probabilistic certificate for the existence of a feasible set with small cost!

### 3.2 Decoding discrete and feasible solutions

What remains is to identify a good set from the learned distribution. There are two ways to achieve this:

• The simplest is by sampling—by Theorem 1 after $k$ sets have been sampled, the desired object will be identified with probability at least $1 - (1-t)^k$.
• Alternatively, a set of guaranteed low cost can be obtained deterministically by the classical method of conditional expectation. This is essentially a sequential decoding procedure, with each node being added to the set if a condition is met (the conditional expectation decreases). Interesting, despite being greedy, the determinist algorithm will always recover the set in question (see paper).

## 4. Case study

### 4.1 The loss function

To give you a taste of our results, let’s consider again the maximum clique problem. The probabilistic penalty loss in this case turns out to be:

$\ell_{\text{clique}}(p_1, \ldots, p_n, G) = - (\beta+1) \sum_{(v_i,v_j) \in E} p_{i} p_{j} +\frac{\beta}{2} \sum_{v_i\neq v_j } p_i p_j,$

which can be computed in time proportional to the number of edges (by refactoring).

### 4.2 Results

We used the above loss function to train a GNN and compared the results with various baselines, involving neural networks trained with other losses (RUN-CSP, Bomze GNN, MS GNN), discrete algorithms (the NX MIS-based approximation algorithm and two greedy algorithms), and state-of-the-art mixed-integer programing solvers (CBC and Gurobi 9.0).

It can be seen that Erdős’ GNN outperforms neural baselines, the CBC solver, and the various heuristics in most cases. Gurobi 9.0 achieves an excellent performance/speed tradeoff for small instances, but scales poorly for more than a few hundred nodes.

### 4.3 Constraint satisfaction

It might be also interesting to test whether the GNN learned to satisfy the clique constraint.

Simply thresholding the GNN outputs (as done by Bomze and MS GNN) can be seen to frequently result in constraint violation. The issue is particularly noticeable in the TWITTER dataset, which generally contains harder instances.

As predicted by our analysis, Erdős’ GNN did not violate any constrains.

### Disclaimer

In the benefit of simplicity, the above exposition does not present our method and its evaluation in full generality. The literature is also very superficially discussed. For a more complete presentation, please consult our paper.

## Acknowledgements

Our research was kindly supported by the Swiss National Science Foundation in the context of the grant “Deep Learning for Graph Structured Data” (grant number PZ00P2 179981).