The spectral leaking phenomenon in principal component estimation

This post (jupyter notebook) complements my recent paper How close are the eigenvectors and eigenvalues of the sample and actual covariance matrices? presented at the International Conference on Machine Learning (ICML), Sydney, August 8-11, 2017 (presentation video).

Andreas Loukas,

August 1st 2017

Principal component analysis is one of the most basic tools in data analysis. However, to use it properly we must make sure that the principal components, i.e., the eigenvectors $u_i$ of the covariance matrix $C$ have been estimated correctly.

So how many samples do we need to estimate principal components?

Unfortunately, estimating principal components is, in general, as hard as estimating the covariance matrix. Roughly, this means that we need as many samples $m$ as the data dimensionality $n$ ($C$ is an $n \times n$ matrix).

But do not despair! If we relax our requirements slightly we can do much better.

The reason this holds has to do with spectral leaking, a phenomenon occurring in the eigenvector estimation. What spectral leaking says is that estimated eigenvectors tend to be strongly localized along the eigenvalue axis. In other words, the eigenvector $\tilde{u}_i$ of the sample covariance is far less likely to lie in the span of an eigenvector $u_j$ of the actual covariance when the eigenvalue distance $|\lambda_i - \lambda_j|$ is large. This agrees with the intuition that principal components are easier to estimate, exactly because they are more likely to appear in the samples of the distribution.

What is the point of this post?

To get a better intuition of the spectral leaking phenomenon by seeing it in practice!

So let us begin by loading the MNIST data corresponding to digit ‘9’. We will focus on the distribution of all possible ‘9’s, that has dimensionality $n = 28 \times 28 = 784$.

import numpy as np
from scipy import sparse
import scipy.linalg as linalg
import matplotlib.pyplot as plt
from mnist import MNIST
%matplotlib notebook

mndata = MNIST('../datasets/mnist')

# training data
data = [mndata.train_images[i] for i,x in enumerate(mndata.train_labels) if x == 9]
X = np.array(data).T

# test data
data = [mndata.test_images[i] for i,x in enumerate(mndata.test_labels) if x == 9]
Y = np.array(data).T

# number of images (n), training samples (Mx) and test samples (My)
n, Mx = X.shape[:]
My = Y.shape[1]

The covariance matrix and its spectrum

We can now compute the actual covariance matrix

$C = \mathbf{E}[(x - \mathbf{E}[x]) (x - \mathbf{E}[x])^\top],$

that is the covariance from all samples, and plot its eigenvalues.

Note that, we will assume that the actual covariance matrix (the ground truth) is computed from the test data and use the training data for our estimates.

# We will center our data
mu = np.mean(Y, 1); Y = Y - np.tile(mu, (My,1)).T

# and compute the actual covariance
C = Y.dot(Y.T) / My

def eig(A, order='descend'):

# eigenvalue decomposition
[e,U] = linalg.eig(A)

# reordering indices
idx = e.argsort()
if order == 'descend':
idx = idx[::-1]

# reordering
e = np.real(e[idx])
U = np.real(U[:, idx])
return (U, e)

# The eigenvectors U and eigenvalues e of C sorted in decreasing magnitude
[U,e] = eig(C)

plt.figure(figsize=(8,3.5))
plt.title('The eigenvalues $\lambda_k$ of $C$ decrease very fast.')
plt.plot(e, 'r-x')
ax = plt.gca()
ax.set_yscale('log')
plt.xlabel('k', fontsize=14)


Why is the spectral leaking phenomenon relevant?

Because it implies that we can estimate eigenspaces very easily!

In particular, instead of expecting to identify principal components exactly, we can relax our objective somewhat, asking that the estimated eigenvectors lie in the span of surrounding subspaces:

$\sum_{i = k-\ell}^{k+\ell} (\tilde{u}_k^\top u_i )^2 \approx 1 \quad \text{for some small} \quad \ell > 0$

Then, the estimation problem becomes significantly simpler!

For dimensionality reduction, this relaxation is not restricting, as one commonly projects the data into the span of the first $k$ eigenvectors and not in any single eigenvector (see Section 5).

plt.figure(figsize=(8,3.5))

for ell in range(0, 6, 2):

relaxed_error = np.zeros(m_array.shape[0])

idx = range(max(0, k-ell), min(n, k+ell+1))

for i in range(0, m_array.shape[0]):
relaxed_error[i] = 1 - np.sum( inner_products[idx,i] )

if ell == 0:
str = 'standard error';
else:
str = 'relaxed error, sum goes from {0} to {1}'.format(idx[0], idx[-1])
plt.plot(m_array, relaxed_error, 'o-', label=str)

ax = plt.gca()
plt.xlabel('number of samples (m)', fontsize=14)
plt.ylabel('relaxed error', fontsize=14)
plt.legend(loc='upper right', fontsize=12)
plt.tight_layout()
plt.show()