I get like, ~100 visitors daily, and I know it’s a lot for a person who’s just started, and I’m not even blogging for visitors, because this blog is basically my study notes, but I think, if you look hard enough, if you slant your eyes, you may find something of value in it. So why not introduce it to your friends? Buddies? Comrades? Homies? Compadres? Dusts? Rafighs? Ais? Do it!

I wanna be honest with you. One thing that I never understood is Principal Component Analysis. It’s very hard. At least to me. Because my college doesn’t offer any courses in the way of statistics (yes, I’m an Associates student!) I thought to myself, “Shieeet, I’m gonna read every resource there is. From Youtube, which I kinda feel like is the worst resource for learning – At least, sometimes – to a book about Computer Vision!” and ta-dah! Here we are. A book about Computer Vision is about to clear everything PCA. Let us commence learning! Put your learning hat and goggles on, and wear your diaphragm, because we’re gonna get screwed!

The following is based on Computer Vision, Principles, Algorithms, Applications, Learning by E. R. Davies.

One powerful way of approaching the task of data representation is Principal Component Analysis, or PCA. This involves finding the mean of a cluster of points in feature space, and then finding the principal axes of the cluster in the following way: First, an axis is found which passes through the mean position and which gives the maximum variance when the data are projected onto it. Then, a second such axis is found which maximizes variance in a directional normal to the first. This process is carried out until a total of N principal axes have been found for N-dimensional feature space. The process is illustrated in this figure:

 In fact, the process is entirely mathematical and need not be undertaken in the strict sequence indicated above, it merely involves finding a set of orthogonal axes which diagonalize the covariance matrix. A matrix is diagonalizable if it has n distinct eigenvalues in F, the eigenspace. Eigenvalues are defined as (I know what eigenvalues are, this is just for people who are reading this post in the future and don’t wanna google):

    \[ A\vec{v} = \lambda \vect{v} \]

Where A is the matrix, in this case, the covariance matrix, \vec{v} is the eigenvector, and \lambda is the eigenvalue.

The covariance matrix for the input population is defined as:

    \[ \Sigma = \mathbb{E}((x_{(p)} - \mu)(x_{(p)}-\mu)^T) \]

where x_{(p)} is the location of the pth data point and \mu is the mean of P data points, and \mathbb{E}(.) indicates expectation value for the underlying population. We can estimate \Sigma from the equations:

    \[ \Sigma = \frac{1}{P}\sum_{p=1}^{P}x_{(p)}x_{(p)}^T - \mu\mu^T \]

Where:

    \[ \mu = \frac{1}{P}\sum_{p=1}^{P}x_{(p)} \]

As \Sigma is real and symmetric, it is possible to diagonalize it using a suitable orthogonal transformation matrix A, obtaining a set of N orthonormal eigenvectors u_i with eigenvalues \lambda_i given by:

    \[ \Sigma u_i = \lambda_i u_i \quad (I = 1, 2, \ldots, N) \]

The vectors u_i are derived from the original vectors x_i by:

    \[ u_i = A(x_i - \mu) \]

And the inverse transformation needed to recover the original data vectors is:

    \[ x_i = \mu + A^T\u_i \]

Here, we have recalled that for an orthogonal matrix

    \[ A^{-1} = A^T \]

In fact, it may be shown that A is the matrix whose rows are formed from the eigenvectors of \Sigma, and that the diagonalized covariance matrix \Sigma^\prime is given by:

    \[ \Sigma^\prime = A\SigmaA^T \]

So that:

    \[ \Sigma^\prime =\left[\begin{array}{cccc}\lambda_1 & 0 & \hdots & 0 \\0 & \lambda_2 && 0 \\\vdots && \ddots & \vdots \\0 & 0 & \hdots & \lambda_N\end{array}\right] \]

In what follows, we shall assume that eigenvalues have been placed in an ordered sequence, starting with the largest. In that case, \lambda_1 represents the most characteristic of the set data points, with later eigenvalues representing successively less significant characteristics. We could even go so far as to say that, in some sense, \lambda_1 represents the most interesting characteristic of the data, whereas \lambda_N would be devoid of interest. More practically, if we ignored \lambda_N we might not lose much information. And indeed the last few eigenvalues would frequently represent characteristics which are not statistically significant and are essentially noise. For these reasons, PCA is commonly used for reduction in dimensionality of the feature space N to some lower value N^\prime.

Well, this was a short blog post. I have understood the concept behind PCA, I just haven’t understood how I can calculate it. No worries! There’s always NumPy… Unless I’m in C++, then I’m screwed!