This post (jupyter notebook) complements my recent paper How close are the eigenvectors and eigenvalues of the sample and actual covariance matrices? presented at the International Conference on Machine Learning (ICML), Sydney, August 8-11, 2017 (presentation video).

Andreas Loukas,

August 1st 2017

Principal component analysis is one of the most basic tools in data analysis. However, to use it properly we must make sure that the principal components, i.e., the eigenvectors of the covariance matrix have been estimated correctly.

## So how many samples do we need to estimate principal components?

Unfortunately, estimating principal components is, in general, as hard as estimating the covariance matrix. Roughly, this means that we need as many samples as the data dimensionality ( is an matrix).

*But do not despair!* If we relax our requirements slightly we can do much better.

The reason this holds has to do with **spectral leaking**, a phenomenon occurring in the eigenvector estimation. What spectral leaking says is that estimated eigenvectors tend to be strongly localized along the eigenvalue axis. In other words, the eigenvector of the sample covariance is far less likely to lie in the span of an eigenvector of the actual covariance when the eigenvalue distance is large. This agrees with the intuition that principal components are easier to estimate, exactly because they are more likely to appear in the samples of the distribution.

## What is the point of this post?

To get a better intuition of the spectral leaking phenomenon by seeing it in practice!

So let us begin by loading the MNIST data corresponding to digit ‘9’. We will focus on the distribution of all possible ‘9’s, that has dimensionality .

```
import numpy as np
from scipy import sparse
import scipy.linalg as linalg
import matplotlib.pyplot as plt
from mnist import MNIST
%matplotlib notebook
# Load MNIST data
mndata = MNIST('../datasets/mnist')
mndata.load_training()
mndata.load_testing()
# training data
data = [mndata.train_images[i] for i,x in enumerate(mndata.train_labels) if x == 9]
X = np.array(data).T
# test data
data = [mndata.test_images[i] for i,x in enumerate(mndata.test_labels) if x == 9]
Y = np.array(data).T
# number of images (n), training samples (Mx) and test samples (My)
n, Mx = X.shape[:]
My = Y.shape[1]
```

## The covariance matrix and its spectrum

We can now compute the actual covariance matrix

that is the covariance from all samples, and plot its eigenvalues.

Note that, we will assume that the actual covariance matrix (the ground truth) is computed from the test data and use the training data for our estimates.

```
# We will center our data
mu = np.mean(Y, 1); Y = Y - np.tile(mu, (My,1)).T
# and compute the actual covariance
C = Y.dot(Y.T) / My
def eig(A, order='descend'):
# eigenvalue decomposition
[e,U] = linalg.eig(A)
# reordering indices
idx = e.argsort()
if order == 'descend':
idx = idx[::-1]
# reordering
e = np.real(e[idx])
U = np.real(U[:, idx])
return (U, e)
# The eigenvectors U and eigenvalues e of C sorted in decreasing magnitude
[U,e] = eig(C)
plt.figure(figsize=(8,3.5))
plt.title('The eigenvalues $\lambda_k$ of $C$ decrease very fast.')
plt.plot(e, 'r-x')
ax = plt.gca()
ax.set_yscale('log')
plt.xlabel('k', fontsize=14)
plt.ylabel('$\lambda_k
```

## The covariance estimation problem

Before we look at principal components, we diverge momentarily so that we can consider the very fundamental question:

*How close is the sample covariance, i.e., the matrix estimated from a finite number of samples , to the actual covariance matrix (with sufficient probability)*?

In his paper [Vershynin 2012], Vershynin showed that a sample size of is up to iterated logarithmic factors sufficient for all distributions with finite fourth moment supported in a centered Euclidean ball of radius . Similar results hold for sub-exponential [Adamczak et al. 2010] and finite second moment distributions [Rudelson 1999].

We can confirm this experimentally in the following code segment.

```
mu = np.mean(X, 1); X = X - np.tile(mu, (Mx,1)).T
# how many samples we are going to use?
m_array = np.logspace(np.log10(10), np.log10(Mx), 10).astype(int)
# for statistical significance, let us also repeat this experiment a few times
iterations = 3
errors = np.zeros((m_array.shape[0], iterations))
for i,m in enumerate(m_array):
for iteration in range(0, iterations):
# sample covariance
idx = np.random.permutation(Mx)
Xs = X[:, idx[0:m]]
Cs = Xs.dot(Xs.T) / m
errors[i,iteration] = linalg.norm(C - Cs)
errors /= linalg.norm(C)
errors = np.mean(errors,1)
plt.figure(figsize=(8.0, 3.5))
plt.plot(m_array, errors, 'or-')
plt.plot([n,n], [0,errors[0]], 'k:')
rank = np.sum(e/e[0] > 1e-5)
plt.plot([rank,rank], [0,errors[0]], 'g:')
ax = plt.gca()
ax.set_xscale('log')
plt.xlabel('number of samples', fontsize=14)
plt.ylabel('covariance est. error', fontsize=14)
plt.tight_layout()
plt.show()
```

## How about the principal components?

As discussed in the introduction, estimating an eigenvector is as hard as estimating the entire covariance, in the sense that also here we need roughly samples such that the inner product

with high probability.

In the context of Gaussians, this was shown by Koltchinskii et al. 2016. See also my paper Loukas 2017 for a wider class of distributions.

Below, we can observe this for the 3rd principal component (the eigenvector which corresponds to the 3rd largest eigenvalue).

```
# which eigenvector we care about?
k = 3
u = U[:,k]
# how many samples we are going to use?
m_array = np.logspace(np.log10(10), np.log10(Mx), 10).astype(int)
# for statistical significance, let us also repeat this experiment a few times
iterations = 10
inner_products = np.zeros((m_array.shape[0], iterations))
for i,m in enumerate(m_array):
for iteration in range(0, iterations):
# sample covariance
idx = np.random.permutation(Mx)
Xs = X[:, idx[0:m]]
Cs = Xs.dot(Xs.T) / m
[Us,es] = eig(Cs)
inner_products[i,iteration] = (Us[:,k].dot(u))**2
inner_products = np.mean(inner_products, 1)
plt.figure(figsize=(8,3.5))
plt.plot(m_array, inner_products, 'or-')
plt.plot([n,n], [0,1], 'k:')
plt.plot([rank,rank], [0,1], 'g:')
ax = plt.gca() ax.set_xscale('log')
plt.xlabel('number of samples (m)', fontsize=14)
plt.ylabel('inner product', fontsize=14)
plt.tight_layout()
plt.show()
```

## Spectral leaking from up close

Let us take a closer look. Instead of , we will examine the inner products

```
# which eigenvector we care about?
k = 3
# how many samples we are going to use?
m_array = np.logspace(np.log10(5), np.log10(Mx), 20).astype(int)
# for statistical significance, let us also repeat this experiment a few times
iterations = 10
inner_products = np.zeros((n, m_array.shape[0], iterations))
for i,m in enumerate(m_array):
for iteration in range(0, iterations):
# sample covariance
idx = np.random.permutation(Mx)
Xs = X[:, idx[0:m]]
Cs = Xs.dot(Xs.T) / m
[Us,es] = eig(Cs)
inner_products[:,i,iteration] = ((U.T).dot(Us[:,k]))**2
inner_products = np.mean(inner_products, 2)
```

Since there are such inner products for each sample size, we visualize inner products with a little animation (thanks to Louis Tao for his code).

Looking at the animation one sees that

- Once more, the inner product is much smaller than 1.
- However, there is a strong localization phenomenon along the x-axis.
- What’s more, this localization becomes apparent even when !

```
from matplotlib import animation, rc
from IPython.display import HTML
%matplotlib inline
# First set up the figure, the axis, and the plot element we want to animate
fig, ax = plt.subplots(figsize=(6,2.5))
ax.set_xlim((1, e[0]))
ax.set_ylim((0, 1))
ax.set_xscale('log')
ax.set_xlabel('$\lambda_k
```

## Why is the spectral leaking phenomenon relevant?

Because it implies that we can estimate *eigenspaces* very easily!

In particular, instead of expecting to identify principal components exactly, we can relax our objective somewhat, asking that the estimated eigenvectors lie in the span of surrounding subspaces:

Then, the estimation problem becomes significantly simpler!

For dimensionality reduction, this relaxation is not restricting, as one commonly projects the data into the span of the first eigenvectors and not in any single eigenvector (see Section 5).

```
plt.figure(figsize=(8,3.5))
for ell in range(0, 6, 2):
relaxed_error = np.zeros(m_array.shape[0])
idx = range(max(0, k-ell), min(n, k+ell+1))
for i in range(0, m_array.shape[0]):
relaxed_error[i] = 1 - np.sum( inner_products[idx,i] )
if ell == 0:
str = 'standard error';
else:
str = 'relaxed error, sum goes from {0} to {1}'.format(idx[0], idx[-1])
plt.plot(m_array, relaxed_error, 'o-', label=str)
ax = plt.gca()
plt.xlabel('number of samples (m)', fontsize=14)
plt.ylabel('relaxed error', fontsize=14)
plt.legend(loc='upper right', fontsize=12)
plt.tight_layout()
plt.show()
```