# What is the most efficient graph filter? Chebyshev vs ARMA

Graph filters are arguably one of the most useful primitives in graph signal processing (GSP). Their applications range from regression with smoothness and stationarity priors, to robust PCA on graphs and spectral clustering. In fact, I urge you to read the related paper on compressive spectral clustering [Tremblay et al. 2016] for an interesting way of approximating eigenspaces by graph filtering random signals.

This post attempts to bring some clarity into the issue of efficient graph filter design and implementation. I assume basic familiarity with graph signal processing and graph filters (see the classic overview paper [Shuman et al. 2013] ).

I include reproducing the experiments. I would also like to thank David Shuman for contributing [code] and Nathanael Peraudin et al. for their wonderful work with the [GSP toolbox].

### The Problem

The objective will be to find the most accurate and computationally efficient method for computing $f(L)\, x$, where $f(L)$ is a matrix function of the graph normalized Laplacian (in fact it can be any sparse Hermitian matrix, such as the combinatorial Laplacian, the adjacency matrix, or the random walk matrix) and $x$ is a vector, referred to as the graph signal.

Remember, a matrix function returns a matrix $M = f(L)$, such that the eigenvectors of $M, L$ are the same and, if $\lambda$ is an eigenvalue of $L$, then $M$ has the eigenvalue $f(\lambda)$.

### The methods

We will focus on the centralized setting (all computations are carried out in a single machine) and compare two state-of-the-art approaches:

1. Chebychev (also written as TChebyshev)  polynomial approximation [Shuman et al. 2011]: this is the de-facto standard for efficient graph filter design. The idea is to approximate the desired function $f(\lambda)$ with a Chebyshev polynomial $g(\lambda) = \sum_{k = 0}^K c_k \lambda^k$

of order $K$. The computation is then performed recursively at a cost of $K$ matrix-vector multiplications with the sparse $L$

2. Auto-Regressive Moving-Average (ARMA) graph filters [Isufi et al. 2016]: Instead of using a polynomial, now the approximating function $g(\lambda)$ is a rational function with numerator and denominator polynomials of order $Q$ and $P$, respectively. $g(\lambda) = \dfrac{\sum_{q = 0}^Q b_q \lambda^q }{ 1 + \sum_{p= 1}^P a_p \lambda^p}$

There are many implementations of ARMA graph filters, but we focus on two in particular:

• In the parallel ARMA implementation (ARMA PL), one decomposes the rational function $g(\lambda)$ as a sum of $P$ first order rational functions $g_p(\lambda)$ for $p = 1, 2, \ldots, P$ and then computes each one in parallel using the following recursion: $y_p[k+1] = \psi_p L y_p [k] + \phi_p x$

Above,for simplicity, we have set $P=Q$. The filter output at iteration $k$ is then $\sum_{p = 1}^P y_p [k]$. The computational overhead amounts to $P$ matrix-vector multiplications per iteration. For more details (and an implementation) on how to compute coefficients $\psi_p$ and $\phi_p$ from $a_p$ and $b_q$ I urge you to consider the code reproducing the experiments of this post.

• In the conjugate gradient ARMA implementation (ARMA CG), we solve the linear system $\sum_{p= 0}^P a_p L^p y = \sum_{q = 0}^Q b_q L^q x$

using the conjugate gradient method. Taking care to exploit the sparse structure of $L$ in the each iteration, this method has a per-iteration-complexity of $P$ matrix-vector multiplications, in addition to a one time cost of $Q$ matrix-vector multiplications.

Clearly, a rational approximation is more accurate than a polynomial one. Yet, computationally Chebyshev might be more efficient as their computation does not necessitate the convergence of an iterative scheme as with ARMA. In addition, for the case of the parallel implementation, one needs to also take care of stability issues (to ensure convergence (we need to ensure that $\|\psi_p L \| < 1$ for all $p$).

### The verdict

• For convenience, use Chebychev polynomial approximation. The method is fast, it gives consistently very good results, and it is simple to use (only $K$ has to be set).
• When dealing with a rational filtering objective or when computation is a big concern, consider using ARMA CG. This method gives a performance boost for certain filtering objectives, but selecting the right parameters can be tricky (besides $P$ and $Q$ one needs to select carefully the number of iterations of the conjugate gradient method).

Some tips on how to select the parameters of ARMA CG:

• Select a small denominator order, typically $P=2$ or $P=3$. This ensures that the per-iteration complexity is small and makes the filter design problem easier to solve (by avoiding poor conditioning problems).
• Set the conjugate gradient tolerance to be at most one order of magnitude smaller than the anticipated rational approximation error. (There is no point in iterating after the precision of the approximation is exceeded.) This means that, for all practical purposes, the number of conjugate gradient iterations is very small (approximately 4-to-12).
• The numerator polynonial order $Q$ is typically much larger than $P$, but should not exceed 25 to avoid poor conditioning issues.

### CASE STUDY 1: graph Tikhonov denoising

One of the standard ways of smoothing graph signals $x$ corrupted with noise is by solving the graph regularized optimization problem $\text{min}_{y} \| y - x \|_2^2 + \tau y^\top L y,$

where $\tau$ is the regularization weight. It is a well-known fact that the solution of this problem corresponds to filtering $x$ with a filter $f(\lambda) = \dfrac{\tau }{ 1 + \tau \lambda}$.

Since this is a rational function, we expect that ARMA does very well here. Indeed, the numerical results confirm our expectations. Notice that each method can be used to obtain multiple solutions, by varying the polynomial order or the number of iterations of ARMA methods. Each solution attains a different trade-off between computation time (measured in seconds) and approximation error (here we use the normalized root mean squared error for the denoising). The next figure demonstrates exactly this trade-off by having one point for each solution. We see that ARMA CG is consistently the most efficient and accurate method (all blue points fall in the Pareto front). On the other hand, the parallel implementation is generally inefficient due to slow convergence. It might be surprising that Chebyshev polynomial approximation does so well, matching closely the exact solution given by ARMA CG. The reason is that $f(\lambda)$ is a very smooth function, and therefore it can be very well approximated by a polynomial.

### CASE STUDY 2: Low-pass filters

The second problem we are going to focus on corresponds to low-pass filtering and is useful for eigenspace approximation [Tremblay et al. 2016, Paratte et al. 2016]. As shown below, function $f(\lambda)$ is now a step function that allows all graph spectra with eigenvalue smaller to 0.5 to pass and kills everything else. The figure also shows the approximation quality when the ARMA PL (top), ARMA CG (middle) and Chebyshev (bottom) approximation is used. Few observations:

• ARMA PL does not approximate the step function very well. During the its design phase (i.e., when identifying the coefficients of the rational function $g(\lambda)$ are computed) one needs to additionally enforce a stability constraint which ensures that the ARMA iteration converges. As is seen above, this extra constraint harms the approximation accuracy by limiting the space of possible rational functions.
• Both the ARMA CG (shown here for $P=2$ and $Q=18$) and Chebyshev method (with $K=150$) closely approximate the step function. The two main differences are (i) that Chebyshev polynomials tend to oscillate around discontinuities, and (ii) that with ARMA CG one obtains good approximations even when $P$ and $Q$ is relatively small, whereas the Chebyshev we need to increase $K$ significantly to get similar results.

Let us now revisit the trade-off figure. As in the first case study, also here we have one point for each achieved solution. However, since now the filtering objective is not rational, for ARMA CG we also test all denominator and numerator order combinations, $P = 1, 2, 3$ and $Q = 1, 2, \ldots, 24$, respectively. We do not show the ARMA PL since it features inferior performance. It becomes immediately apparent that this filtering problem is much harder than Tikhonov denoising: the smallest error achieved is larger than $10^{-2}$ whereas in the first case study we could reach error smaller than $10^{-15}$. The variation present in the data is, at least in part, due to the variance inherent in timing measurements.

In terms of pure performance, there is no clear winner.

• ARMA CG is faster but cannot achieve a very good approximation. This is an artifact of the rational approximation problem which is non-convex and quite hard to solve accurately. We suspect that larger orders could yield an improvement, but unfortunately the implementation we used for rational approximation [Borges 2009] suffered from conditioning issues for large orders.
• Chebyshev polynomials achieve slightly smaller error at the expense of an increase in complexity. The largest order we tested was $K = 250$. Increasing $K$ further did not yield accuracy improvements.

Yet, undoubtedly, Chebyshev polynomials are much simpler to use. Rather than having to set three parameters as in the ARMA CG (orders P and Q, as well as number of iterations of the conjugate gradient), with polynomial approximation one only needs to choose $K$. Furthermore, generally speaking, $K$  affects the trade-off consistently with larger orders giving smaller errors at the expense of complexity.

### How LARGE can I go?

Since the computational complexity is linear in the number of edges (and the constant is actually small), when the graph is sparse the main limiting factor is memory. In other words, as long as the sparse matrix fits into your RAM filtering should be doable. In practice, this means that you most likely can process graphs with $10^7$ edges in your laptop.