I get like, ~100 visitors daily, and I know it’s a lot for a person who’s just started, and I’m not even blogging for visitors, because this blog is basically my study notes, but I think, if you look hard enough, if you slant your eyes, you may find something of value in it. So why not introduce it to your friends? Buddies? Comrades? Homies? Compadres? Dusts? Rafighs? Ais? Do it!

I wanna be honest with you. One thing that I never understood is Principal Component Analysis. It’s very hard. At least to me. Because my college doesn’t offer any courses in the way of statistics (yes, I’m an Associates student!) I thought to myself, “Shieeet, I’m gonna read every resource there is. From Youtube, which I kinda feel like is the worst resource for learning – At least, sometimes – to a book about Computer Vision!” and ta-dah! Here we are. A book about Computer Vision is about to clear everything PCA. Let us commence learning! Put your learning hat and goggles on, and wear your diaphragm, because we’re gonna get screwed!

The following is based on Computer Vision, Principles, Algorithms, Applications, Learning by E. R. Davies.

One powerful way of approaching the task of data representation is Principal Component Analysis, or PCA. This involves finding the mean of a cluster of points in feature space, and then finding the principal axes of the cluster in the following way: First, an axis is found which passes through the mean position and which gives the maximum variance when the data are projected onto it. Then, a second such axis is found which maximizes variance in a directional normal to the first. This process is carried out until a total of N principal axes have been found for N-dimensional feature space. The process is illustrated in this figure:

 In fact, the process is entirely mathematical and need not be undertaken in the strict sequence indicated above, it merely involves finding a set of orthogonal axes which diagonalize the covariance matrix. A matrix is diagonalizable if it has n distinct eigenvalues in F, the eigenspace. Eigenvalues are defined as (I know what eigenvalues are, this is just for people who are reading this post in the future and don’t wanna google):

    \[ A\vec{v} = \lambda \vect{v} \]

Where A is the matrix, in this case, the covariance matrix, \vec{v} is the eigenvector, and \lambda is the eigenvalue.

The covariance matrix for the input population is defined as:

    \[ \Sigma = \mathbb{E}((x_{(p)} - \mu)(x_{(p)}-\mu)^T) \]

where x_{(p)} is the location of the pth data point and \mu is the mean of P data points, and \mathbb{E}(.) indicates expectation value for the underlying population. We can estimate \Sigma from the equations:

    \[ \Sigma = \frac{1}{P}\sum_{p=1}^{P}x_{(p)}x_{(p)}^T - \mu\mu^T \]

Where:

    \[ \mu = \frac{1}{P}\sum_{p=1}^{P}x_{(p)} \]

As \Sigma is real and symmetric, it is possible to diagonalize it using a suitable orthogonal transformation matrix A, obtaining a set of N orthonormal eigenvectors u_i with eigenvalues \lambda_i given by:

    \[ \Sigma u_i = \lambda_i u_i \quad (I = 1, 2, \ldots, N) \]

The vectors u_i are derived from the original vectors x_i by:

    \[ u_i = A(x_i - \mu) \]

And the inverse transformation needed to recover the original data vectors is:

    \[ x_i = \mu + A^T\u_i \]

Here, we have recalled that for an orthogonal matrix

    \[ A^{-1} = A^T \]

In fact, it may be shown that A is the matrix whose rows are formed from the eigenvectors of \Sigma, and that the diagonalized covariance matrix \Sigma^\prime is given by:

    \[ \Sigma^\prime = A\SigmaA^T \]

So that:

    \[ \Sigma^\prime =\left[\begin{array}{cccc}\lambda_1 & 0 & \hdots & 0 \\0 & \lambda_2 && 0 \\\vdots && \ddots & \vdots \\0 & 0 & \hdots & \lambda_N\end{array}\right] \]

In what follows, we shall assume that eigenvalues have been placed in an ordered sequence, starting with the largest. In that case, \lambda_1 represents the most characteristic of the set data points, with later eigenvalues representing successively less significant characteristics. We could even go so far as to say that, in some sense, \lambda_1 represents the most interesting characteristic of the data, whereas \lambda_N would be devoid of interest. More practically, if we ignored \lambda_N we might not lose much information. And indeed the last few eigenvalues would frequently represent characteristics which are not statistically significant and are essentially noise. For these reasons, PCA is commonly used for reduction in dimensionality of the feature space N to some lower value N^\prime.

Well, this was a short blog post. I have understood the concept behind PCA, I just haven’t understood how I can calculate it. No worries! There’s always NumPy… Unless I’m in C++, then I’m screwed!

When I started this blog, it was graphics, but now I mostly post ML and DSP stuff and I’m really sad about it. I haven’t touched graphics programming in 2 months, and that’s something which I would like to classify as bad (who says that?). But wait! It can still be about graphics programming, I just have to do graphics stuff that has ML in it, and not the other way around, because I want ML to be my de jur my specialty.

So what do you wanna learn today, Chubak? “Something graphics-related.” But what? “Um… I wanna learn about something graphics-related that’s also about probability, because it’ll help me ease into my life as an ML student.” So Chubak, learn about Monte Carlo Integration! “I can’t play blackjack! I mean, I once coded a blackjack game, but I’ve forgotten all about the rules!” Oh Chubakabra, you’re so unfunny I could kill you! Seriously!

So I have this book, Computer Graphics: Principals and Practice. I love it. It’s written by a slew of well-to-do programmers who would eat me just by looking at me (if I were a girl, that would mean something different). I will quote and paraphrase this book to learn. Let’s learn together.

 

We have all learned about numerical methods in our high school math class. Methods such as integration, interpolation, sequences, and so on. There are two types of numerical methods: Deterministic, and randomized.

A typical deterministic method for integrating a function for integrating a function f over an interval [a, b] is to take n + 1 equally spaced points in the interval, t_0 = a, t_1 = a + \frac{b - a }{n}, \ldots, t_n - b, evaluate f at midpoint \frac{t_i + t_{i + 1}}{2} of each of the intervals these points define, sum up the results, and multiply by \frac{b -a}{b}, the width of each interval. For sufficiently continuous functions f, as n increases this will converge to correct value.

A probabilistic or Monte Carlo method for computing, or more precisely, for estimating the integral of f over the interval [a, b] is this: Let X_1, \ldots, X_n be randomly chosen points in the interval [a, b]. Then Y = (b - a) \frac{1}{n}\sum_{i = 1}^{n}f(X_i) is a random value whose average is the integral \int_{[a,b]}f. To implement this we use a random number generator to generate n points in the interval [a, b], evaluate f at each, average the results, and multiply by b-a. This gives an estimate of the integral, as shown in the figure below, \int_{-1}^{1}\sqrt{1 - x^2} dx with 20 samples approximates the correct result, which is \frac{\pi}{2}.

 

Of course, each time we compute such an estimate we’ll get a different value (duh!). The variance in these values depends on the shapes of the function f. If we generate the random points x_i nonuniformly, we must slightly alter the formula. But in using a nonuniform distribution of points we gain an enormous advantage: By making our non nonuniform distribution favor points x_i where f(x) is large, we can drastically reduce the variance of our estimate. This nonuniform sampling approach is called importance sampling, as you see before you.

 

Because major shift in rendering methods over the past several decades was from deterministic to randomized, we’ll be discussing randomized approaches to solving rendering equations. To do so means we’re using random variables, expectation, and variance. We deal with discrete values, because computers are inherently discrete. Continuous values deal with probability density function but we’re not gonna discuss that. We’re going to discuss probability mass function. PMF has two properties:

1- For every s \in S we have p(s) \geq 0.

2- \sum_{s \in S}p(s) = 1

The first property is called non-negativity. The second one is called normality. The intuition is that S represents a set of outcomes of some experiment, and p(s) is probability outcome of s, a member of S. An event is a subset of a probability space. The probability of an event is the sum of PMFs of elements of that event, insomuch as:

    \[ Pr\{E\} = \sum_{s \in S} p(s) \]

A random variable is a function, usually denoted by a capital letter, from a probability space to the real numbers:

    \[ X: S \rightarrow \boldsymbol{R}. \]

Keep in mind that the function X is not a variable, its a real-valued function. It’s not random either, X(s) is a single real number for any outcome s\inS.

A random variable is used to define events. For example, the set of outcome s for which X(s) = 1, that is, if ht and th are a set of strings representing heads or tails,

    \[ E = {s \in S : X(s) = 1} \]

and

    \[ = {ht, th} \]

is an event with probability \frac{1}{2}. We write this as Pr\{X=1\} = \frac{1}{2}, we use the predicate X=1 as a shorthand for the event defined by the predicate.

Let’s take a look at a piece of code for simulating the experiment represented by the formalities above:

 

headcount = 0
if (randb()): // first coin flip
    headcount++
if (randb()): // second coin flip
    headcount++
return headcount

Here we assume that ranb() is a Boolean function that returns true half the time. How is it related to our abstraction? Well, imagine the set S of all possible executions of the program, declaring two executions to be the same values returned by ranb are pairwise identical. This means that there are four possible program executions, in which the two ranb() invocations return TT, TF, FT, and FF. Our belief and experience is that these four executions are equally likely, that is, each occurs just about one quarter of the time.

Now, the analogy becomes clearer. The set of possible program executions, with associated probabilities, is a probability space. The variables in the program that depend on ranb invocations are random variables. I hope it’s all clear to understand now.

Let’s talk about the expected value, also called the mean. It’s basically the summation of the product of PMF and the random variable:

    \[ E[X] = \sum_{s\inS} p(s)X(s) \]

Imagine h is heads and t is tails. We saw ht and th. There’s also hh and tt. So the expected value of it would be:

    \[ E[X] = p(hh)X(hh) + p(ht)X(ht) + p(th)X(th) + p(tt)X(tt) \]

    \[ = \frac{1}{4}. 2 +\frac{1}{4} . 1 + \frac{1}{4} . 1 + \frac{1}{4} .0 \]

    \[ = 1 \text{QED} \]

You may wonder where we got the Xs from. Well, something I meant to tell you is that X is diminutive, meaning that we shall assign the value to it ourselves. Here, we have assigned 1 to h and 0 to t. The X(hh) is 2 because it has 2 hs.

Let’s talk about distribution. Probability distribution is a function that provides the probabilities of occurrences of different possible outcomes of an event (this is the Wikipedia wording because it’s 7AM and I’m toast and I can’t word it myself).

When we say random variable X has the distribution f we shall denote X \sim f.

The dispersion of values clustered around X is denoted by variance of it, and is defined as:

    \[ \boldsymbol{Var}[X] = E[(X - \bar{X})^2] \]

Where \bar{X} is the mean, or average of X. I won’t insult your intelligence by defining the average. Geez!

\sqrt{\boldsymbol{Var}} is called standard deviation. The random variables X and Y are called independent if:

    \[ Pr\{X = x \text{ and } Y = y\} = Pr\{X = x\}.Pr\{Y = y\} \]

The important properties of independent random variables are:

1)

    \[ E[XY] = E[X]E[Y] \]

2)

    \[ \boldsymbol{Var}[X + Y] = \boldsymbol{Var}[X] + \boldsymbol{Var}[Y] \]

When I started talking about probability, I talked about continuous probability vs. discrete probability. We covered discrete probability. Now to cover the difference between continuous probability and discrete probability:

1) Values are continuous. Meaning that the numbers are infinite.

2) Certain aspects of the analysis involves mathematical subtleties like measurablity.

3) Our probability space will be infinite. We must use probability density function instead of PMF.

PDF properties are:

1- For every s \in S we have p(s) \geq 0.

2- \int_{s\inS}p(s) = 1

But if the distribution of S is uniform, the PDF is defined as:

    \[ p(x) = \left\{ \begin{array}{lr}\frac{1}{b-a} & \text{ for } a \leq b \\ 0 & \text{ for } x < a \text{ or } x > b \end{array}\]

In continuous probability, E[X] is defined as:

    \[ E[X] := \int_{s\inS} p(s)X(s) \]

Now let’s compare the definitions of PMF and PDF:

    \[ \mathbb{PMF} \rightarrow p_y(t) = Pr\{Y = t\} \text{ for } t \in T \]

    \[ \mathbb{PDF} \rightarrow Pr\{a\leq X \leq b\} = \int_a^bp(r)dr \]

In continuous probability, random variables are better to be called random points. Because if S is a probability space, and Y : S \rightarrow T is a mapping to another space rather than \mathbb{R}, then we must call Y a random point rather than a random variable. The notion of probability density applies here as you can consider for any U \subset T we have:

    \[ Pr\{Y\in U} = \int_{u \in U} P_Y(u)du \]

Now let’s apply what we’ve learned to the sphere. A sphere has three coordinates, longitude, latitude, and colatitude. We only use longitude and colatitude in a \mathbb{R}^2, 2D Carthesian coordinates applied to random variable S, turning it into S^2. The particularization of it being:

    \[ Y : [0, 1] \times [0, 1] \rightarrow S^2 : (u, v) \rightarrow (\cos(2\pi u)\sin(\pi v), \cos(\pi v) \sin( 2\pi u) sin(\pi v)) \]

We’ll start with the uniform probability density p on [0, 1] \times [0, 1] or p(u, v) = 1. Scroll up to see the formula for uniform probability density. For notational convenience, well write (x, y, z) = Y(u, v).

We have the intuitive sense that if we pick points uniformly, randomly in the unit square and use f to convert them to points in the unit sphere, they’ll be clustered near the pole. This means that the induced probability density on T will not be uniform. The image below shows this.

We’ll now talk about ways to estimate the expected value of a continuous random variable, and using these to evaluate integrals. This is important because in rendering, we’re going to want to evaluate the reflectence integral:

    \[ L^{ref}(P, \omega_o) = \int_{\omega_i \in S_{+}^{2}}L(P, - \omega_i)f_s(P,\omega_i,\omega_0)\omega_i . \boldsymbol{n}d\omega_i, \]

for various value of P and \omega_0. \omega is the direction of incoming light. A code that generates a random number uniformly distributed on [0, 1] and takes square root produces value between 0 and 1. If we use the PDF on it, as it is a uniform value, the expected value is \frac{2}{3}. This also happens to be the average value of f(x) = \sqrt{x} on that interval. What does it all mean?

Let’s take a look at Theorem 3.48 of the book. It states that if f : [a, b] \rightarrow \mathbb{R} is a real-valued function, and X \sim \boldsymbol{U}(a, b) is a uniform random variable on the interval [a, b], then (b-a)f(x) is a random variable whose expected value is:

    \[ E[(b-a)f(x)] = \int_a^b f(x)dx . \]

What does it say? It says that we can use a randomized algorithm to compute the value of an integral, at least if we run the code often enough and average the results.

In general, we’ve got some quantity C, like the integral we saw above, that we’d like to evaluate, and some randomized algorithm that produces an estimate of C. We call such a random variable and estimator for the quantity. The estimator is unbiased if its expected value is actually C. Generally, unbiased estimators are to be preferred over biased ones.

We talked about discrete and continuous probabilities. There’s a third one, which we shall call mixed probabilities, that comes up in rendering. They arise exactly from the impulses in bidirectional scattering distribution functions, which we shall talk about one day (perhaps, if I’m alive tomorrow) or the impulses caused by point lights. These are probabilities that are defined on a continuum, like interval [0, 1], but are not defined strictly by PDF. Consider the following program:

if uniform(0, 1) > 0.6 :
    return 0.3
else :
    return uniform(0, 1)

Sixty percent of the time this program reports 0.3, the other 40% of the time it returns a value uniformly distributed on [0, 1]. The return value is a random variable that has a probability mass of 0.6 at 0.3, and a PDF given by d(x) = 0.4 at all other points. We shall define the PDF as:

    \[ d(x) = \left\{ \begin{array}{lr}0.4 & \text{ for } x \neq 0.3 \\ 0.6\times\infty & \text{ for } x = 0.3 \end{array}\]

Overall, a random variable with mixed probability is one for which there is a finite set of points in the domain at which the PDF is defined, and vice versa, uniform points where PMF is undefined.

Wow that was a long post! But I’ve learned so much. This blog has four focus subjects, and that’s a lot. But it’s the summer, and until the universities open, I have a lot of free time on my hands. So I’ll try to post um, about 1~2 blog posts about graphics programming a week.

Did I mention that I have a Fiverr gig now? Well , I do! Here’s the link. Enjoy your day, while I ponder about what shall I learn next!