Principal Component Analysis (PCA): A Practical Example

Let’s first define what a is PCA.

Principal Component Analysis or PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

In other words, Principal Component Analysis (PCA) is a technique to detect the main components of a data set in order to reduce into fewer dimensions retaining the relevant information.

To put an example, Let $latex X \in\mathbb{R}^{mxn} $ a data set with zero mean, that is, the matrix formed by $latex n$ observations of $latex m $ variables. Where the elements of $latex X $ are denoted as usual by $latex x_ij $ meaning that it contains the value of the observable $latex i $ of the $latex j-th $ observation experiment.

A principal component is a linear combination of the variables so that maximizes the variance.

Let’s now see a PCA example step by step

1. Create a random toy data set

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
m1 = [4.,-1.]
s1 = [[1,0.9],[0.9,1]]
c1 = np.random.multivariate_normal(m1,s1,100)

Let’s plot the data set and compute the PCA. The red dots of the figure show below the considered data, the blue arrow shows the eigenvector of maximum eigenvalue.

vaps,veps = np.linalg.eig(np.cov(c1.T))
idx = np.argmax(vaps)
Now that we have visualize it, let’s code the closed solution for the PCA

Now that we have visualize it, let’s code the closed solution for the PCA

First step is to standardize the data. We are going to use Scikit-learn

from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(c1)

Eigendecomposition — Computing Eigenvectors and Eigenvalues

The eigenvectors determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.

mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec) - mean_vec)) / (X_std.shape[0]-1)
print('Covariance Matrix \n%s' %cov_mat)
Covariance Matrix 
[[ 1.01010101 0.88512031]
[ 0.88512031 1.01010101]]

Let’s now print our Covariance Matrix

#Let's print our Covariance Matrix
print('NumPy Covariance Matrix: \n%s' %np.cov(X_std.T))
NumPy Covariance Matrix: 
[[ 1.01010101 0.88512031]
[ 0.88512031 1.01010101]]

Now we perform an eigendecomposition on the covariance matrix

cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
[[ 0.70710678 -0.70710678]
[ 0.70710678 0.70710678]]
[ 1.89522132 0.1249807 ]

Let’s sort the eigenvalues to see if everything is ok

# let's sort the eig values to see if everything is ok
for ev in eig_vecs:
np.testing.assert_array_almost_equal(1.0, np.linalg.norm(ev))
print('Everything ok!')
Everything ok!

Now we need to make a list of the eigenvalue, eigenvectors tuples and sort them from high to low.

# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
Eigenvalues in descending order:

Building a Projection Matrix

# Choose the "top 2" eigenvectors with the highest eigenvalues 
# we are going to use this values to matrix W.
matrix_w = np.hstack((eig_pairs[0][1].reshape(2,1),
print('Matrix W:\n', matrix_w)
('Matrix W:\n', array([[ 0.70710678, -0.70710678],
[ 0.70710678, 0.70710678]]))

We will use this data to plot our output later so we can compare with a custom gradient descent approach.

There are several numerical techniques that allow to find a point $latex x^* $ that corresponds too $latex \nabla x, \lambda L (x^*, \lambda ^*) = 0 $ , the saddle point. One way to tackle the problem is to “construct a new function, related to the Lagrangian, that (ideally) has a minimum at $latex (x^*, \lambda ^*) $

This new function can be considered as ’distorting’ the Lagrangian at infeasible points so as to create a minimum at $latex (x^*, \lambda ^*) $. Unconstrained minimization techniques can then be applied to the new function. This approach can make it easier to guarantee convergence to a local solution, but there is the danger that the local convergence properties of the method can be damaged.

The ’distortion’ of the Lagrangian function can lead to a ’distortion’ in the Newton equations for the method. Hence the behavior of the method near the solution may be poor unless care is taken.” Another way to tackle the condition $latex \nabla x, \lambda L (x, \lambda) = 0 $ is to maintain feasibility at every iteration. That is, to ensure that the updates $latex xk $ follow the implicit curve $latex h(x) = 0 $. For the toy problem we are considering here it is relatively easy. Assume we start from a point x 0 that satisfies $latex h(x 0 ) = 0 $, that is it satisfies the constraint.

The algorithm can be summarized as follows:

  1. Compute the gradient $latex \nabla L (x^k) $ (observe that we compute the gradient of the Lagrangian with respect to $latex x $).
  2. Compute an estimate of $latex \lambda $ by computing the value of $latex \lambda $ that minimizes $latex \nabla L (x^k)² $.
  3. Assume that the update is $latex x^{k+1} = x^k — \alpha ^k \nabla L (x^k) $. For each candidate update $latex x k+1 $, project it over the constraint $latex h(x) = 0 $. Find the α k value that decreases the $latex L (x^{k+1}) $ with respect to $latex \nabla L (x^k) $.
  4. Goto step 1 and repeat until convergence.

Let’s now implement the KKT conditions to see if we are able to obtain the same result as the one obtained with the closed solution. We will use the projected gradient descent to obtain the solution.

Let’s A be our covariance matrix

# A is the covariance matrix of the considered data
A = np.cov(c1.T)

Now we set up our initial values

# Tolerance
# Initial alpha value (line search)
# Initial values of w. DO NOT CHOOSE w=(0,0)
w = np.array([1., 0.])

Now we compute the eigenvalues and eigenvectors

# let's see now the eigvals and eigvects
eig_vals, eig_vecs = np.linalg.eig(A)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

Now, we compute the projection for the function w=w.T*w

#now let's compute the projection for the function. w = w.T*w
den = np.sqrt(,w))
w = w / den

Next step is to compute lambda

# now we calculate lambda
lam = ( (w.T,(A + w.T) ),w) / 2 *,w)

Let’s review our initial values

print "Initial values"
print "Lagrangian value =", lag
print " w =", w
print " x =", m1
print " y =", s1
Initial values
Lagrangian value = -0.858313040377
w = [ 1. 0.]
x = [4.0, -1.0]
y = [[1, 0.9], [0.9, 1]]

Let’s now compute our function using gradient descent

# let's now compute the entire values for our function
while ((alpha > tol) and (cont < 100000)):
cont = cont+1

# Gradient of the Lagrangian
grw = (w.T,(A + w.T) ) - 2 * lam * w.T

# Used to know if we finished line search
finished = 0

while ((finished == 0) and (alpha > tol)):
# Update
aux_w = w - alpha * grw

# Our Projection
den = np.sqrt(,aux_w))
aux_w = aux_w / den
# Compute new value of the Lagrangian.
aux_lam = (,(A+w.T)),aux_w) / 2 * (aux_w.T,aux_w)
aux_lag = (aux_w.T,,aux_w)) - lam * (,aux_w) - 1)

# Check if this is a descent
if aux_lag < lag:
w = aux_w
lam = aux_lam
lag = aux_lag
alpha = 1.0
finished = 1
alpha = alpha / 2.0

Let’s now review our final values

# Let's now review our final values!
print " Our Final Values"
print " Number of iterations", cont
print " Obtained values are w =", w
print " Correct values are w =", veps[idx]
print " Eigenvectors are =", eig_vecs
Our Final Values
Number of iterations 22
Obtained values are w = [ 0.71916397 0.69484041]
Correct values are w = [ 0.71916398 -0.6948404 ]
Eigenvectors are = [[ 0.71916398 -0.6948404 ]
[ 0.6948404 0.71916398]]

Let’s compare our new values vs the ones obtained by the closed solution

# Full comparition
print " Gradient Descent values w =", w
print " PCA analysis approach w =", matrix_w
print " Closed Solution w =", veps[idx]
print " Closed Solution w =", veps,vaps
Gradient Descent values   w = [ 0.71916397  0.69484041]
PCA analysis approach w = [[ 0.70710678 -0.70710678]
[ 0.70710678 0.70710678]]
Closed Solution w = [ 0.71916398 -0.6948404 ]
Closed Solution w = [[ 0.71916398 -0.6948404 ]
[ 0.6948404 0.71916398]] [ 1.56340502 0.10299214]

Very close! Let’s print it to visualize the new values versus the ones obtaine with sci-kit learn

import seaborn as sns 

Very close! Let’s print it to visualize the new values versus the ones obtaine with sci-kit learn


Would you like a demo for 3Blades? Enroll for one by clicking on the button below!

Request Demo

Data Science Python Pca