## A Short Stint with Principal Component Analysis

I get like, ~100 visitors daily, and I know it’s a lot for a person who’s just started, and I’m not even blogging for visitors, because this blog is basically my study notes, but I think, if you look hard enough, if you slant your eyes, you may find something of value in it. So why not introduce it to your friends? Buddies? Comrades? Homies? Compadres? Dusts? Rafighs? Ais? Do it!

I wanna be honest with you. One thing that I never understood is Principal Component Analysis. It’s very hard. At least to me. Because my college doesn’t offer any courses in the way of statistics (yes, I’m an Associates student!) I thought to myself, “Shieeet, I’m gonna read every resource there is. From Youtube, which I kinda feel like is the worst resource for learning – At least, sometimes – to a book about Computer Vision!” and ta-dah! Here we are. A book about Computer Vision is about to clear everything PCA. Let us commence learning! Put your learning hat and goggles on, and wear your diaphragm, because we’re gonna get screwed!

The following is based on Computer Vision, Principles, Algorithms, Applications, Learning by E. R. Davies.

One powerful way of approaching the task of data representation is Principal Component Analysis, or PCA. This involves finding the mean of a cluster of points in feature space, and then finding the principal axes of the cluster in the following way: First, an axis is found which passes through the mean position and which gives the maximum variance when the data are projected onto it. Then, a second such axis is found which maximizes variance in a directional normal to the first. This process is carried out until a total of N principal axes have been found for N-dimensional feature space. The process is illustrated in this figure:

In fact, the process is entirely mathematical and need not be undertaken in the strict sequence indicated above, it merely involves finding a set of orthogonal axes which diagonalize the covariance matrix. A matrix is diagonalizable if it has distinct eigenvalues in , the eigenspace. Eigenvalues are defined as (I know what eigenvalues are, this is just for people who are reading this post in the future and don’t wanna google):

Where is the matrix, in this case, the covariance matrix, is the eigenvector, and is the eigenvalue.

The covariance matrix for the input population is defined as:

where is the location of the th data point and is the mean of data points, and indicates expectation value for the underlying population. We can estimate from the equations:

Where:

As is real and symmetric, it is possible to diagonalize it using a suitable orthogonal transformation matrix , obtaining a set of orthonormal eigenvectors with eigenvalues given by:

The vectors are derived from the original vectors by:

And the inverse transformation needed to recover the original data vectors is:

Here, we have recalled that for an orthogonal matrix

In fact, it may be shown that is the matrix whose rows are formed from the eigenvectors of , and that the diagonalized covariance matrix is given by:

So that:

In what follows, we shall assume that eigenvalues have been placed in an ordered sequence, starting with the largest. In that case, represents the most characteristic of the set data points, with later eigenvalues representing successively less significant characteristics. We could even go so far as to say that, in some sense, represents the most interesting characteristic of the data, whereas would be devoid of interest. More practically, if we ignored we might not lose much information. And indeed the last few eigenvalues would frequently represent characteristics which are not statistically significant and are essentially noise. For these reasons, PCA is commonly used for reduction in dimensionality of the feature space to some lower value .

Well, this was a short blog post. I have understood the concept behind PCA, I just haven’t understood how I can calculate it. No worries! There’s always NumPy… Unless I’m in C++, then I’m screwed!

## Monte Carlo Integration in Rendering

When I started this blog, it was graphics, but now I mostly post ML and DSP stuff and I’m really sad about it. I haven’t touched graphics programming in 2 months, and that’s something which I would like to classify as bad (who says that?). But wait! It can still be about graphics programming, I just have to do graphics stuff that has ML in it, and not the other way around, because I want ML to be my de jur my specialty.

So what do you wanna learn today, Chubak? “Something graphics-related.” But what? “Um… I wanna learn about something graphics-related that’s also about probability, because it’ll help me ease into my life as an ML student.” So Chubak, learn about Monte Carlo Integration! “I can’t play blackjack! I mean, I once coded a blackjack game, but I’ve forgotten all about the rules!” Oh Chubakabra, you’re so unfunny I could kill you! Seriously!

So I have this book, Computer Graphics: Principals and Practice. I love it. It’s written by a slew of well-to-do programmers who would eat me just by looking at me (if I were a girl, that would mean something different). I will quote and paraphrase this book to learn. Let’s learn together.

We have all learned about numerical methods in our high school math class. Methods such as integration, interpolation, sequences, and so on. There are two types of numerical methods: Deterministic, and randomized.

A typical deterministic method for integrating a function for integrating a function over an interval is to take equally spaced points in the interval, , evaluate at midpoint of each of the intervals these points define, sum up the results, and multiply by , the width of each interval. For sufficiently continuous functions , as increases this will converge to correct value.

A probabilistic or Monte Carlo method for computing, or more precisely, for estimating the integral of over the interval is this: Let be randomly chosen points in the interval . Then is a random value whose average is the integral . To implement this we use a random number generator to generate points in the interval , evaluate at each, average the results, and multiply by . This gives an estimate of the integral, as shown in the figure below, with 20 samples approximates the correct result, which is .

Of course, each time we compute such an estimate we’ll get a different value (duh!). The variance in these values depends on the shapes of the function . If we generate the random points nonuniformly, we must slightly alter the formula. But in using a nonuniform distribution of points we gain an enormous advantage: By making our non nonuniform distribution favor points where is large, we can drastically reduce the variance of our estimate. This nonuniform sampling approach is called importance sampling, as you see before you.

Because major shift in rendering methods over the past several decades was from deterministic to randomized, we’ll be discussing randomized approaches to solving rendering equations. To do so means we’re using random variables, expectation, and variance. We deal with discrete values, because computers are inherently discrete. Continuous values deal with probability density function but we’re not gonna discuss that. We’re going to discuss probability mass function. PMF has two properties:

1- For every we have .

2-

The first property is called non-negativity. The second one is called normality. The intuition is that represents a set of outcomes of some experiment, and is probability outcome of , a member of . An event is a subset of a probability space. The probability of an event is the sum of PMFs of elements of that event, insomuch as:

A random variable is a function, usually denoted by a capital letter, from a probability space to the real numbers:

Keep in mind that the function is not a variable, its a real-valued function. It’s not random either, is a single real number for any outcome .

A random variable is used to define events. For example, the set of outcome for which , that is, if ht and th are a set of strings representing heads or tails,

and

is an event with probability . We write this as , we use the predicate as a shorthand for the event defined by the predicate.

Let’s take a look at a piece of code for simulating the experiment represented by the formalities above:

headcount = 0
if (randb()): // first coin flip
if (randb()): // second coin flip


Here we assume that ranb() is a Boolean function that returns true half the time. How is it related to our abstraction? Well, imagine the set of all possible executions of the program, declaring two executions to be the same values returned by ranb are pairwise identical. This means that there are four possible program executions, in which the two ranb() invocations return TT, TF, FT, and FF. Our belief and experience is that these four executions are equally likely, that is, each occurs just about one quarter of the time.

Now, the analogy becomes clearer. The set of possible program executions, with associated probabilities, is a probability space. The variables in the program that depend on ranb invocations are random variables. I hope it’s all clear to understand now.

Let’s talk about the expected value, also called the mean. It’s basically the summation of the product of PMF and the random variable:

Imagine h is heads and t is tails. We saw ht and th. There’s also hh and tt. So the expected value of it would be:

You may wonder where we got the s from. Well, something I meant to tell you is that is diminutive, meaning that we shall assign the value to it ourselves. Here, we have assigned 1 to h and 0 to t. The is 2 because it has 2 s.

Let’s talk about distribution. Probability distribution is a function that provides the probabilities of occurrences of different possible outcomes of an event (this is the Wikipedia wording because it’s 7AM and I’m toast and I can’t word it myself).

When we say random variable has the distribution we shall denote .

The dispersion of values clustered around is denoted by variance of it, and is defined as:

Where is the mean, or average of . I won’t insult your intelligence by defining the average. Geez!

is called standard deviation. The random variables and are called independent if:

The important properties of independent random variables are:

1)

2)

When I started talking about probability, I talked about continuous probability vs. discrete probability. We covered discrete probability. Now to cover the difference between continuous probability and discrete probability:

1) Values are continuous. Meaning that the numbers are infinite.

2) Certain aspects of the analysis involves mathematical subtleties like measurablity.

3) Our probability space will be infinite. We must use probability density function instead of PMF.

PDF properties are:

1- For every we have .

2-

But if the distribution of is uniform, the PDF is defined as:

In continuous probability, is defined as:

Now let’s compare the definitions of PMF and PDF:

In continuous probability, random variables are better to be called random points. Because if is a probability space, and is a mapping to another space rather than , then we must call a random point rather than a random variable. The notion of probability density applies here as you can consider for any we have:

Now let’s apply what we’ve learned to the sphere. A sphere has three coordinates, longitude, latitude, and colatitude. We only use longitude and colatitude in a , 2D Carthesian coordinates applied to random variable , turning it into . The particularization of it being:

We’ll start with the uniform probability density on or . Scroll up to see the formula for uniform probability density. For notational convenience, well write .

We have the intuitive sense that if we pick points uniformly, randomly in the unit square and use to convert them to points in the unit sphere, they’ll be clustered near the pole. This means that the induced probability density on will not be uniform. The image below shows this.

We’ll now talk about ways to estimate the expected value of a continuous random variable, and using these to evaluate integrals. This is important because in rendering, we’re going to want to evaluate the reflectence integral:

for various value of and . is the direction of incoming light. A code that generates a random number uniformly distributed on and takes square root produces value between 0 and 1. If we use the PDF on it, as it is a uniform value, the expected value is . This also happens to be the average value of on that interval. What does it all mean?

Let’s take a look at Theorem 3.48 of the book. It states that if is a real-valued function, and is a uniform random variable on the interval , then is a random variable whose expected value is:

What does it say? It says that we can use a randomized algorithm to compute the value of an integral, at least if we run the code often enough and average the results.

In general, we’ve got some quantity , like the integral we saw above, that we’d like to evaluate, and some randomized algorithm that produces an estimate of . We call such a random variable and estimator for the quantity. The estimator is unbiased if its expected value is actually . Generally, unbiased estimators are to be preferred over biased ones.

We talked about discrete and continuous probabilities. There’s a third one, which we shall call mixed probabilities, that comes up in rendering. They arise exactly from the impulses in bidirectional scattering distribution functions, which we shall talk about one day (perhaps, if I’m alive tomorrow) or the impulses caused by point lights. These are probabilities that are defined on a continuum, like interval , but are not defined strictly by PDF. Consider the following program:

if uniform(0, 1) > 0.6 :
return 0.3
else :
return uniform(0, 1)


Sixty percent of the time this program reports 0.3, the other 40% of the time it returns a value uniformly distributed on . The return value is a random variable that has a probability mass of 0.6 at 0.3, and a PDF given by at all other points. We shall define the PDF as:

Overall, a random variable with mixed probability is one for which there is a finite set of points in the domain at which the PDF is defined, and vice versa, uniform points where PMF is undefined.

Wow that was a long post! But I’ve learned so much. This blog has four focus subjects, and that’s a lot. But it’s the summer, and until the universities open, I have a lot of free time on my hands. So I’ll try to post um, about 1~2 blog posts about graphics programming a week.

Did I mention that I have a Fiverr gig now? Well , I do! Here’s the link. Enjoy your day, while I ponder about what shall I learn next!