# Principal Component Analysis (PCA): A Practical Example

## Let’s first define what a is PCA.

Principal Component Analysis or PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

In other words, Principal Component Analysis (PCA) is a technique to detect the main components of a data set in order to reduce into fewer dimensions retaining the relevant information.

To put an example, Let $latex X \in\mathbb{R}^{mxn} $ a data set with zero mean, that is, the matrix formed by *$latex n$* observations of $latex m $ variables. Where the elements of $latex X $ are denoted as usual by $latex x_ij $ meaning that it contains the value of the observable $latex i $ of the $latex j-th $ observation experiment.

A principal component is a linear combination of the variables so that maximizes the variance.

#### Let’s now see a PCA example step by step

#### 1. Create a random toy data set

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

m1 = [4.,-1.]

s1 = [[1,0.9],[0.9,1]]

c1 = np.random.multivariate_normal(m1,s1,100)

plt.plot(c1[:,0],c1[:,1],'r.')

Let’s plot the data set and compute the PCA. The red dots of the figure show below the considered data, the blue arrow shows the eigenvector of maximum eigenvalue.

vaps,veps = np.linalg.eig(np.cov(c1.T))

idx = np.argmax(vaps)

plt.plot(c1[:,0],c1[:,1],'r.')

plt.arrow(np.mean(c1[:,0]),np.mean(c1[:,1]),

vaps[idx]*veps[0,idx],vaps[idx]*veps[1,idx],0.5,

linewidth=1,head_width=0.1,color='blue')

#### Now that we have visualize it, let’s code the closed solution for the PCA

First step is to standardize the data. We are going to use Scikit-learn

from sklearn.preprocessing import StandardScaler

X_std = StandardScaler().fit_transform(c1)

#### Eigendecomposition — Computing Eigenvectors and Eigenvalues

The eigenvectors determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.

mean_vec = np.mean(X_std, axis=0)

cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)

print('Covariance Matrix \n%s' %cov_mat)

Covariance Matrix

[[ 1.01010101 0.88512031]

[ 0.88512031 1.01010101]]

#### Let’s now print our Covariance Matrix

#Let's print our Covariance Matrix

print('NumPy Covariance Matrix: \n%s' %np.cov(X_std.T))

NumPy Covariance Matrix:

[[ 1.01010101 0.88512031]

[ 0.88512031 1.01010101]]

Now we perform an eigendecomposition on the covariance matrix

cov_mat = np.cov(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)

print('\nEigenvalues \n%s' %eig_vals)

Eigenvectors

[[ 0.70710678 -0.70710678]

[ 0.70710678 0.70710678]]

Eigenvalues

[ 1.89522132 0.1249807 ]

#### Let’s sort the eigenvalues to see if everything is ok

# let's sort the eig values to see if everything is ok

for ev in eig_vecs:

np.testing.assert_array_almost_equal(1.0, np.linalg.norm(ev))

print('Everything ok!')

Everything ok!

#### Now we need to make a list of the eigenvalue, eigenvectors tuples and sort them from high to low.

# Make a list of (eigenvalue, eigenvector) tuples

eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low

eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues

print('Eigenvalues in descending order:')

for i in eig_pairs:

print(i[0])

Eigenvalues in descending order:

1.89522131626

0.124980703938

#### Building a Projection Matrix

# Choose the "top 2" eigenvectors with the highest eigenvalues

# we are going to use this values to matrix W.

matrix_w = np.hstack((eig_pairs[0][1].reshape(2,1),

eig_pairs[1][1].reshape(2,1)))

print('Matrix W:\n', matrix_w)

('Matrix W:\n', array([[ 0.70710678, -0.70710678],

[ 0.70710678, 0.70710678]]))

**We will use this data to plot our output later so we can compare with a custom gradient descent approach.**

There are several numerical techniques that allow to find a point $latex x^* $ that corresponds too $latex \nabla x, \lambda L (x^*, \lambda ^*) = 0 $ , the saddle point. One way to tackle the problem is to “construct a new function, related to the Lagrangian, that (ideally) has a minimum at $latex (x^*, \lambda ^*) $

This new function can be considered as ’distorting’ the Lagrangian at infeasible points so as to create a minimum at $latex (x^*, \lambda ^*) $. Unconstrained minimization techniques can then be applied to the new function. This approach can make it easier to guarantee convergence to a local solution, but there is the danger that the local convergence properties of the method can be damaged.

The ’distortion’ of the Lagrangian function can lead to a ’distortion’ in the Newton equations for the method. Hence the behavior of the method near the solution may be poor unless care is taken.” Another way to tackle the condition $latex \nabla x, \lambda L (x, \lambda) = 0 $ is to maintain feasibility at every iteration. That is, to ensure that the updates $latex xk $ follow the implicit curve $latex h(x) = 0 $. For the toy problem we are considering here it is relatively easy. Assume we start from a point x 0 that satisfies $latex h(x 0 ) = 0 $, that is it satisfies the constraint.

The algorithm can be summarized as follows:

- Compute the gradient $latex \nabla L (x^k) $ (observe that we compute the gradient of the Lagrangian with respect to $latex x $).
- Compute an estimate of $latex \lambda $ by computing the value of $latex \lambda $ that minimizes $latex \nabla L (x^k)² $.
- Assume that the update is $latex x^{k+1} = x^k — \alpha ^k \nabla L (x^k) $. For each candidate update $latex x k+1 $, project it over the constraint $latex h(x) = 0 $. Find the α k value that decreases the $latex L (x^{k+1}) $ with respect to $latex \nabla L (x^k) $.
- Goto step 1 and repeat until convergence.

Let’s now implement the KKT conditions to see if we are able to obtain the same result as the one obtained with the closed solution. We will use the projected gradient descent to obtain the solution.

#### Let’s A be our covariance matrix

# A is the covariance matrix of the considered data

A = np.cov(c1.T)

A

**Now we set up our initial values**

# Tolerance

tol=1e-08

# Initial alpha value (line search)

alpha=1.0

# Initial values of w. DO NOT CHOOSE w=(0,0)

w = np.array([1., 0.])

**Now we compute the eigenvalues and eigenvectors**

# let's see now the eigvals and eigvects

eig_vals, eig_vecs = np.linalg.eig(A)

print('Eigenvectors \n%s' %eig_vecs)

print('\nEigenvalues \n%s' %eig_vals)

**Now, we compute the projection for the function w=w.T*w**

#now let's compute the projection for the function. w = w.T*w

den = np.sqrt(np.dot(w.T,w))

w = w / den

**Next step is to compute lambda**

# now we calculate lambda

lam = -np.dot (np.dot (w.T,(A + w.T) ),w) / 2 * np.dot(w.T,w)

**Let’s review our initial values**

print "Initial values"

print "Lagrangian value =", lag

print " w =", w

print " x =", m1

print " y =", s1

Initial values

Lagrangian value = -0.858313040377

w = [ 1. 0.]

x = [4.0, -1.0]

y = [[1, 0.9], [0.9, 1]]

#### Let’s now compute our function using gradient descent

# let's now compute the entire values for our function

while ((alpha > tol) and (cont < 100000)):

cont = cont+1

# Gradient of the Lagrangian

grw = -np.dot (w.T,(A + w.T) ) - 2 * lam * w.T

# Used to know if we finished line search

finished = 0

while ((finished == 0) and (alpha > tol)):

# Update

aux_w = w - alpha * grw

# Our Projection

den = np.sqrt(np.dot(aux_w.T,aux_w))

aux_w = aux_w / den

# Compute new value of the Lagrangian.

aux_lam = -np.dot (np.dot(aux_w.T,(A+w.T)),aux_w) / 2 * np.dot (aux_w.T,aux_w)

aux_lag = -np.dot (aux_w.T,np.dot(A,aux_w)) - lam * (np.dot(aux_w.T,aux_w) - 1)

# Check if this is a descent

if aux_lag < lag:

w = aux_w

lam = aux_lam

lag = aux_lag

alpha = 1.0

finished = 1

else:

alpha = alpha / 2.0

**Let’s now review our final values**

# Let's now review our final values!

print " Our Final Values"

print " Number of iterations", cont

print " Obtained values are w =", w

print " Correct values are w =", veps[idx]

print " Eigenvectors are =", eig_vecs

Our Final Values

Number of iterations 22

Obtained values are w = [ 0.71916397 0.69484041]

Correct values are w = [ 0.71916398 -0.6948404 ]

Eigenvectors are = [[ 0.71916398 -0.6948404 ]

[ 0.6948404 0.71916398]]

**Let’s compare our new values vs the ones obtained by the closed solution**

# Full comparition

print " Gradient Descent values w =", w

print " PCA analysis approach w =", matrix_w

print " Closed Solution w =", veps[idx]

print " Closed Solution w =", veps,vaps

Gradient Descent values w = [ 0.71916397 0.69484041]

PCA analysis approach w = [[ 0.70710678 -0.70710678]

[ 0.70710678 0.70710678]]

Closed Solution w = [ 0.71916398 -0.6948404 ]

Closed Solution w = [[ 0.71916398 -0.6948404 ]

[ 0.6948404 0.71916398]] [ 1.56340502 0.10299214]

**Very close! Let’s print it to visualize the new values versus the ones obtaine with sci-kit learn**

import seaborn as sns

plt.plot(c1[:,0],c1[:,1],'r.')

plt.arrow(np.mean(c1[:,0]),np.mean(c1[:,1]),

vaps[idx]*veps[0,idx],vaps[idx]*veps[1,idx],0.5,

linewidth=1,head_width=0.1,color='blue')

plt.arrow(np.mean(c1[:,0]),np.mean(c1[:,1]),

vaps[idx]*w[idx],vaps[idx]*w[idx],0.5,

linewidth=1,head_width=0.1,color='lightblue')

Would you like a demo for 3Blades? Enroll for one by clicking on the button below!