When I started this blog, it was graphics, but now I mostly post ML and DSP stuff and I’m really sad about it. I haven’t touched graphics programming in 2 months, and that’s something which I would like to classify as bad (who says that?). But wait! It can still be about graphics programming, I just have to do graphics stuff that has ML in it, and not the other way around, because I want ML to be my de jur my specialty.

So what do you wanna learn today, Chubak? “Something graphics-related.” But what? “Um… I wanna learn about something graphics-related that’s also about probability, because it’ll help me ease into my life as an ML student.” So Chubak, learn about Monte Carlo Integration! “I can’t play blackjack! I mean, I once coded a blackjack game, but I’ve forgotten all about the rules!” Oh Chubakabra, you’re so unfunny I could kill you! Seriously!

So I have this book, Computer Graphics: Principals and Practice. I love it. It’s written by a slew of well-to-do programmers who would eat me just by looking at me (if I were a girl, that would mean something different). I will quote and paraphrase this book to learn. Let’s learn together.


We have all learned about numerical methods in our high school math class. Methods such as integration, interpolation, sequences, and so on. There are two types of numerical methods: Deterministic, and randomized.

A typical deterministic method for integrating a function for integrating a function f over an interval [a, b] is to take n + 1 equally spaced points in the interval, t_0 = a, t_1 = a + \frac{b - a }{n}, \ldots, t_n - b, evaluate f at midpoint \frac{t_i + t_{i + 1}}{2} of each of the intervals these points define, sum up the results, and multiply by \frac{b -a}{b}, the width of each interval. For sufficiently continuous functions f, as n increases this will converge to correct value.

A probabilistic or Monte Carlo method for computing, or more precisely, for estimating the integral of f over the interval [a, b] is this: Let X_1, \ldots, X_n be randomly chosen points in the interval [a, b]. Then Y = (b - a) \frac{1}{n}\sum_{i = 1}^{n}f(X_i) is a random value whose average is the integral \int_{[a,b]}f. To implement this we use a random number generator to generate n points in the interval [a, b], evaluate f at each, average the results, and multiply by b-a. This gives an estimate of the integral, as shown in the figure below, \int_{-1}^{1}\sqrt{1 - x^2} dx with 20 samples approximates the correct result, which is \frac{\pi}{2}.


Of course, each time we compute such an estimate we’ll get a different value (duh!). The variance in these values depends on the shapes of the function f. If we generate the random points x_i nonuniformly, we must slightly alter the formula. But in using a nonuniform distribution of points we gain an enormous advantage: By making our non nonuniform distribution favor points x_i where f(x) is large, we can drastically reduce the variance of our estimate. This nonuniform sampling approach is called importance sampling, as you see before you.


Because major shift in rendering methods over the past several decades was from deterministic to randomized, we’ll be discussing randomized approaches to solving rendering equations. To do so means we’re using random variables, expectation, and variance. We deal with discrete values, because computers are inherently discrete. Continuous values deal with probability density function but we’re not gonna discuss that. We’re going to discuss probability mass function. PMF has two properties:

1- For every s \in S we have p(s) \geq 0.

2- \sum_{s \in S}p(s) = 1

The first property is called non-negativity. The second one is called normality. The intuition is that S represents a set of outcomes of some experiment, and p(s) is probability outcome of s, a member of S. An event is a subset of a probability space. The probability of an event is the sum of PMFs of elements of that event, insomuch as:

    \[ Pr\{E\} = \sum_{s \in S} p(s) \]

A random variable is a function, usually denoted by a capital letter, from a probability space to the real numbers:

    \[ X: S \rightarrow \boldsymbol{R}. \]

Keep in mind that the function X is not a variable, its a real-valued function. It’s not random either, X(s) is a single real number for any outcome s\inS.

A random variable is used to define events. For example, the set of outcome s for which X(s) = 1, that is, if ht and th are a set of strings representing heads or tails,

    \[ E = {s \in S : X(s) = 1} \]


    \[ = {ht, th} \]

is an event with probability \frac{1}{2}. We write this as Pr\{X=1\} = \frac{1}{2}, we use the predicate X=1 as a shorthand for the event defined by the predicate.

Let’s take a look at a piece of code for simulating the experiment represented by the formalities above:


headcount = 0
if (randb()): // first coin flip
if (randb()): // second coin flip
return headcount

Here we assume that ranb() is a Boolean function that returns true half the time. How is it related to our abstraction? Well, imagine the set S of all possible executions of the program, declaring two executions to be the same values returned by ranb are pairwise identical. This means that there are four possible program executions, in which the two ranb() invocations return TT, TF, FT, and FF. Our belief and experience is that these four executions are equally likely, that is, each occurs just about one quarter of the time.

Now, the analogy becomes clearer. The set of possible program executions, with associated probabilities, is a probability space. The variables in the program that depend on ranb invocations are random variables. I hope it’s all clear to understand now.

Let’s talk about the expected value, also called the mean. It’s basically the summation of the product of PMF and the random variable:

    \[ E[X] = \sum_{s\inS} p(s)X(s) \]

Imagine h is heads and t is tails. We saw ht and th. There’s also hh and tt. So the expected value of it would be:

    \[ E[X] = p(hh)X(hh) + p(ht)X(ht) + p(th)X(th) + p(tt)X(tt) \]

    \[ = \frac{1}{4}. 2 +\frac{1}{4} . 1 + \frac{1}{4} . 1 + \frac{1}{4} .0 \]

    \[ = 1 \text{QED} \]

You may wonder where we got the Xs from. Well, something I meant to tell you is that X is diminutive, meaning that we shall assign the value to it ourselves. Here, we have assigned 1 to h and 0 to t. The X(hh) is 2 because it has 2 hs.

Let’s talk about distribution. Probability distribution is a function that provides the probabilities of occurrences of different possible outcomes of an event (this is the Wikipedia wording because it’s 7AM and I’m toast and I can’t word it myself).

When we say random variable X has the distribution f we shall denote X \sim f.

The dispersion of values clustered around X is denoted by variance of it, and is defined as:

    \[ \boldsymbol{Var}[X] = E[(X - \bar{X})^2] \]

Where \bar{X} is the mean, or average of X. I won’t insult your intelligence by defining the average. Geez!

\sqrt{\boldsymbol{Var}} is called standard deviation. The random variables X and Y are called independent if:

    \[ Pr\{X = x \text{ and } Y = y\} = Pr\{X = x\}.Pr\{Y = y\} \]

The important properties of independent random variables are:


    \[ E[XY] = E[X]E[Y] \]


    \[ \boldsymbol{Var}[X + Y] = \boldsymbol{Var}[X] + \boldsymbol{Var}[Y] \]

When I started talking about probability, I talked about continuous probability vs. discrete probability. We covered discrete probability. Now to cover the difference between continuous probability and discrete probability:

1) Values are continuous. Meaning that the numbers are infinite.

2) Certain aspects of the analysis involves mathematical subtleties like measurablity.

3) Our probability space will be infinite. We must use probability density function instead of PMF.

PDF properties are:

1- For every s \in S we have p(s) \geq 0.

2- \int_{s\inS}p(s) = 1

But if the distribution of S is uniform, the PDF is defined as:

    \[ p(x) = \left\{ \begin{array}{lr}\frac{1}{b-a} & \text{ for } a \leq b \\ 0 & \text{ for } x < a \text{ or } x > b \end{array}\]

In continuous probability, E[X] is defined as:

    \[ E[X] := \int_{s\inS} p(s)X(s) \]

Now let’s compare the definitions of PMF and PDF:

    \[ \mathbb{PMF} \rightarrow p_y(t) = Pr\{Y = t\} \text{ for } t \in T \]

    \[ \mathbb{PDF} \rightarrow Pr\{a\leq X \leq b\} = \int_a^bp(r)dr \]

In continuous probability, random variables are better to be called random points. Because if S is a probability space, and Y : S \rightarrow T is a mapping to another space rather than \mathbb{R}, then we must call Y a random point rather than a random variable. The notion of probability density applies here as you can consider for any U \subset T we have:

    \[ Pr\{Y\in U} = \int_{u \in U} P_Y(u)du \]

Now let’s apply what we’ve learned to the sphere. A sphere has three coordinates, longitude, latitude, and colatitude. We only use longitude and colatitude in a \mathbb{R}^2, 2D Carthesian coordinates applied to random variable S, turning it into S^2. The particularization of it being:

    \[ Y : [0, 1] \times [0, 1] \rightarrow S^2 : (u, v) \rightarrow (\cos(2\pi u)\sin(\pi v), \cos(\pi v) \sin( 2\pi u) sin(\pi v)) \]

We’ll start with the uniform probability density p on [0, 1] \times [0, 1] or p(u, v) = 1. Scroll up to see the formula for uniform probability density. For notational convenience, well write (x, y, z) = Y(u, v).

We have the intuitive sense that if we pick points uniformly, randomly in the unit square and use f to convert them to points in the unit sphere, they’ll be clustered near the pole. This means that the induced probability density on T will not be uniform. The image below shows this.

We’ll now talk about ways to estimate the expected value of a continuous random variable, and using these to evaluate integrals. This is important because in rendering, we’re going to want to evaluate the reflectence integral:

    \[ L^{ref}(P, \omega_o) = \int_{\omega_i \in S_{+}^{2}}L(P, - \omega_i)f_s(P,\omega_i,\omega_0)\omega_i . \boldsymbol{n}d\omega_i, \]

for various value of P and \omega_0. \omega is the direction of incoming light. A code that generates a random number uniformly distributed on [0, 1] and takes square root produces value between 0 and 1. If we use the PDF on it, as it is a uniform value, the expected value is \frac{2}{3}. This also happens to be the average value of f(x) = \sqrt{x} on that interval. What does it all mean?

Let’s take a look at Theorem 3.48 of the book. It states that if f : [a, b] \rightarrow \mathbb{R} is a real-valued function, and X \sim \boldsymbol{U}(a, b) is a uniform random variable on the interval [a, b], then (b-a)f(x) is a random variable whose expected value is:

    \[ E[(b-a)f(x)] = \int_a^b f(x)dx . \]

What does it say? It says that we can use a randomized algorithm to compute the value of an integral, at least if we run the code often enough and average the results.

In general, we’ve got some quantity C, like the integral we saw above, that we’d like to evaluate, and some randomized algorithm that produces an estimate of C. We call such a random variable and estimator for the quantity. The estimator is unbiased if its expected value is actually C. Generally, unbiased estimators are to be preferred over biased ones.

We talked about discrete and continuous probabilities. There’s a third one, which we shall call mixed probabilities, that comes up in rendering. They arise exactly from the impulses in bidirectional scattering distribution functions, which we shall talk about one day (perhaps, if I’m alive tomorrow) or the impulses caused by point lights. These are probabilities that are defined on a continuum, like interval [0, 1], but are not defined strictly by PDF. Consider the following program:

if uniform(0, 1) > 0.6 :
    return 0.3
else :
    return uniform(0, 1)

Sixty percent of the time this program reports 0.3, the other 40% of the time it returns a value uniformly distributed on [0, 1]. The return value is a random variable that has a probability mass of 0.6 at 0.3, and a PDF given by d(x) = 0.4 at all other points. We shall define the PDF as:

    \[ d(x) = \left\{ \begin{array}{lr}0.4 & \text{ for } x \neq 0.3 \\ 0.6\times\infty & \text{ for } x = 0.3 \end{array}\]

Overall, a random variable with mixed probability is one for which there is a finite set of points in the domain at which the PDF is defined, and vice versa, uniform points where PMF is undefined.

Wow that was a long post! But I’ve learned so much. This blog has four focus subjects, and that’s a lot. But it’s the summer, and until the universities open, I have a lot of free time on my hands. So I’ll try to post um, about 1~2 blog posts about graphics programming a week.

Did I mention that I have a Fiverr gig now? Well , I do! Here’s the link. Enjoy your day, while I ponder about what shall I learn next!

If Nearest Neighbor tires you, what would SVM do?” this sentence paraphrases Jeremiah 12:5; or it alludes to 1971 Christexploitation movie by Ron Ormond. Either way, if you’re stuck at NN, how are you going to deal with Support Vector Machines?!

Never fret. I’m here to paraphrase and quote from E. R. Davies’ Computer Vision: Principles, Algorithms, Application. I’ve sucked up to E. R. Davies before, I think he’s a wonderful person, and his contributions to the field of computer and machine vision and pattern recognition shan’t be let adrift from out minds. His books are perfect because he’s too… industrial. He’s not about theory, he’s all about application!

This post looks at classification through vision. So you’ve been warned.

When the objects that appear in an image have simple shapes, just one stage of processing may be required, – as in the case of circular washers on a conveyor belt. For more complex objects such as flat brackets and hinges, locating them requires at least two stages – as when graph matching methods are used. For situations where the full complexity of three dimensions occurs, more subtle procedures are usually required. Indeed, the very ambiguity involved in interpreting 2-D images from a set of 3-D objects generally requires cues to be sought and hypotheses to be proposed before any serious attempt can be made with the task. Thus, cues are vital to keying into complex data structures of many images. However, for simple situations, concentration on small features is valuable in permitting image interpretation to be carried out efficiently and rapidly. Neither must it be forgotten that in many applications of computer vision, the task is made simpler by the fact that the main interest lies in specific types of object.

In practical situations, measurements of prominent features allow most objects to be classified straightforwardly. In fact, this is most commonly achieved with varying degrees of certainty, but by comparing object features with those of a great many other known objects, we arrive at classifcations that are statistically the most likely ones. Hence, this sort of classification procedure is called SPR.

A good number of early pattern recognition techniques and algorithms followed the SPR approach without proceeding to the next stage – that of determing the solution that is mathematically the most probable one. Thus, an important goal has been to move the sujbects from SPR to PPR → Probable Pattern Recognition. Indeed, the key aim of ML is to progress towards probabilistic pattern recognition. In this blog post, we’ll study SPR. We’ve already discussed PPR in the last post, one of the aspects of it, at least, called the EM algorithm. We’ll unravel more aspects of PPR later.

Let’s talk about NN: Nearest Neighbor algorithm. The principle of the NN algorithm is that of comparing input image patterns against a number of paradigms and then classifying them according to the class of the paradigm that gives the closest match. Below yo see a principle component analyzed dataset wherein the dataset, in blue, is closer in paradigm to green rather than red.


An instructive but rather trivial example is shown below:

Here, a number of binary patterns are presented to the computer in the training phase of the algorithm: then the test patterns are presented one at a time and compared bit by bit against each of the training patterns. It is clear that this gives a generally reasonable result, the main problems arise when 1) training patterns of different classes are close together in Hamming distance – Hamming distance begin the distance between the image and its anomalies – and 2) minor translations, rotations, or noise cause variations that inhibit accurate recognition. More generally, problem 2 means that the training patterns are insufficiently representative of what will appear during the test phase. The latter statement encapsulates an exceptionally important principle, and it implies that there must be sufficient patterns in the training set for the algorithm to be able to generalize over all possible patterns of each class. However, problem 1 implies that patterns of two different classes may in some case be so similar as to be indistinguishable by any algorithm; and then it is inevitable that erroneous classifications will be made. It is seen below that this is because the underlying distribution in feature space overlap.

The basis of Bayes’ decision theory will now be examined. If we are trying to get a computer to classify objects a sound approach is to get it to measure some prominent feature of each object such as its length and to use this feature as an aid to classification. Sometimes, such a feature may give very little indication of the pattern class – perhaps because of the effects of manufacturing variation. For example, a hand-written character may be so ill formed that its features are of little help in interpreting it; it then becomes much more reliable to make use of the known relative frequencies of letters, or to invoke context. In fact, either of these strategies can give a greatly increased probability of correct interpretation. In other words, when feature measurement are found to giving an error rate above a certain threshold, it is more reliable to employ a priori probability of a given pattern appearing.

The next step in improving recognition performance is to combine the information from feature measurements and from a priori probabilities, this is achieved by applying Bayes’ rule. For a single feature x, this takes the form:

    \[ P(C_i|x) = \frac{p(x|C_i)P(C_i)}{p(x)} \]


    \[ p(x) = \sum_{j}p(x|C_j)P(C_j) \]

The vertical line | denotes conditional probability.

Mathematically, the variables here are 1) the a priori probability of class Ci, P(Ci); 2) the probability density for feature x, p(x); 3) the class-conditional probability density for feature x in class Ci, p(x|Ci), and 4) the a posteriori probability of class Ci when x is observed, P(Ci|x).

But what is a priori? What is a posteriori? Let this image I’ve created explain it all!


The notation P(Ci|x) is a standard one, being defined as the probability that the class is Ci when the feature is known to have the value x. Bayes’ rule says that to find the class of an object, we need to know two sets of information about the objects that might be viewed: the first is the basic probability P(Ci) that a particular class might arise; the second is the distribution of values of the feature x for each class.

Many common image analysis techniques give features that may be used to help identify or classify object. Such as area of the object, its perimeter, the number of holes it possesses, and so on. Generally, increasing number of features helps to resolve the classification problem more quickly. In the figure below, the first image has a much lower error rate than the second image. Keep in mind the cardinal rule of machine learning: you can’t have an error rate of 0!

Many classification methods, including NN and Bayes’ classifier, can involve substantial amounts of storage and computation the amount of training is to be sufficient to achieve low error rates. Hence, there is considerable value in employing methods that minimize computation cost. This is the goal of our lovely and stout Naïve Bayes’ classifier.

In Naïve Bayes’ algorithm, we multiply the a priori of different features and thus get a a posteriori with low error rate:

    \[ p(x|C_i)P(C_i) = p(x_1|C_i)p(x_2|Ci)\ldots p(x_N|C_i).P(C_i) = \prod_{j}p(x_j|C_i).P(C_i) \]

Let’s talk about the optimum number of features in an image. An important factor to consider is hat the optimum number of features depends on the amount of training a classifier receives. If the number of training set patterns is increased, more evidence is available to support the determination of a greater number of features, and hence to provide more accurate classification of test patterns.

Alright. Probability is nice and all but we shan’t go around all willy nilly – we need to calculate the cost of our mistakes! This is where the cost function comes into play. First, let’s take a look at a function known as the conditional risk function:

    \[ R(C_i|x) = \sum_j L(C_i|C_j)P(C_j|x) \]

This function expresses the expected cost of deciding on class Ci when x is observed. As it is wished to minimize this function, we decide on class Ci only if:

    \[ R(C_i|x) < R(C_i|x) \text{for all} j \neq u \]

If we were to choose a particularily simple cost function, of the form:


    \[L(C_i|C_j) = \left\{\begin{array}{lr}0, & \text{for } i=j\\1, & \text{for }i\neqj\\\end{array} \]

Cost functions permit classifications to be biased in favor oa safe decision in a rigorous, predetermined, and controlled manner. One way of minimizing costs is for the classifier to recognize when it is “doubtful” about a particular classification, because two or more classes are most equally likely. Then, one solution is to make a safe decision, the decision plane in feature space being biased away from its position for maximum probability classification. An alternative is to reject the pattern, i.e. place it into an “unknown” category, in that case, some other means can be employed for making appropriate classification.

Alright buckaroos! That is it for today! Of course one topic remains, and that’s SVMs: Support Vector Machines. But they, along with clustering, require a post of their own.

I’m kinda tired today… And I’m working on this massive project, which is to be revealed soon! Very, very, very soon! So buckle up, because when Chubak rides the bus, you can bet your butt it’s gonna be a bumpy ride!