When I started this blog, it was graphics, but now I mostly post ML and DSP stuff and I’m really sad about it. I haven’t touched graphics programming in 2 months, and that’s something which I would like to classify as bad (who says that?). But wait! It can still be about graphics programming, I just have to do graphics stuff that has ML in it, and not the other way around, because I want ML to be my de jur my specialty.

So what do you wanna learn today, Chubak? “Something graphics-related.” But what? “Um… I wanna learn about something graphics-related that’s also about probability, because it’ll help me ease into my life as an ML student.” So Chubak, learn about Monte Carlo Integration! “I can’t play blackjack! I mean, I once coded a blackjack game, but I’ve forgotten all about the rules!” Oh Chubakabra, you’re so unfunny I could kill you! Seriously!

So I have this book, Computer Graphics: Principals and Practice. I love it. It’s written by a slew of well-to-do programmers who would eat me just by looking at me (if I were a girl, that would mean something different). I will quote and paraphrase this book to learn. Let’s learn together.

 

We have all learned about numerical methods in our high school math class. Methods such as integration, interpolation, sequences, and so on. There are two types of numerical methods: Deterministic, and randomized.

A typical deterministic method for integrating a function for integrating a function f over an interval [a, b] is to take n + 1 equally spaced points in the interval, t_0 = a, t_1 = a + \frac{b - a }{n}, \ldots, t_n - b, evaluate f at midpoint \frac{t_i + t_{i + 1}}{2} of each of the intervals these points define, sum up the results, and multiply by \frac{b -a}{b}, the width of each interval. For sufficiently continuous functions f, as n increases this will converge to correct value.

A probabilistic or Monte Carlo method for computing, or more precisely, for estimating the integral of f over the interval [a, b] is this: Let X_1, \ldots, X_n be randomly chosen points in the interval [a, b]. Then Y = (b - a) \frac{1}{n}\sum_{i = 1}^{n}f(X_i) is a random value whose average is the integral \int_{[a,b]}f. To implement this we use a random number generator to generate n points in the interval [a, b], evaluate f at each, average the results, and multiply by b-a. This gives an estimate of the integral, as shown in the figure below, \int_{-1}^{1}\sqrt{1 - x^2} dx with 20 samples approximates the correct result, which is \frac{\pi}{2}.

 

Of course, each time we compute such an estimate we’ll get a different value (duh!). The variance in these values depends on the shapes of the function f. If we generate the random points x_i nonuniformly, we must slightly alter the formula. But in using a nonuniform distribution of points we gain an enormous advantage: By making our non nonuniform distribution favor points x_i where f(x) is large, we can drastically reduce the variance of our estimate. This nonuniform sampling approach is called importance sampling, as you see before you.

 

Because major shift in rendering methods over the past several decades was from deterministic to randomized, we’ll be discussing randomized approaches to solving rendering equations. To do so means we’re using random variables, expectation, and variance. We deal with discrete values, because computers are inherently discrete. Continuous values deal with probability density function but we’re not gonna discuss that. We’re going to discuss probability mass function. PMF has two properties:

1- For every s \in S we have p(s) \geq 0.

2- \sum_{s \in S}p(s) = 1

The first property is called non-negativity. The second one is called normality. The intuition is that S represents a set of outcomes of some experiment, and p(s) is probability outcome of s, a member of S. An event is a subset of a probability space. The probability of an event is the sum of PMFs of elements of that event, insomuch as:

    \[ Pr\{E\} = \sum_{s \in S} p(s) \]

A random variable is a function, usually denoted by a capital letter, from a probability space to the real numbers:

    \[ X: S \rightarrow \boldsymbol{R}. \]

Keep in mind that the function X is not a variable, its a real-valued function. It’s not random either, X(s) is a single real number for any outcome s\inS.

A random variable is used to define events. For example, the set of outcome s for which X(s) = 1, that is, if ht and th are a set of strings representing heads or tails,

    \[ E = {s \in S : X(s) = 1} \]

and

    \[ = {ht, th} \]

is an event with probability \frac{1}{2}. We write this as Pr\{X=1\} = \frac{1}{2}, we use the predicate X=1 as a shorthand for the event defined by the predicate.

Let’s take a look at a piece of code for simulating the experiment represented by the formalities above:

 

headcount = 0
if (randb()): // first coin flip
    headcount++
if (randb()): // second coin flip
    headcount++
return headcount

Here we assume that ranb() is a Boolean function that returns true half the time. How is it related to our abstraction? Well, imagine the set S of all possible executions of the program, declaring two executions to be the same values returned by ranb are pairwise identical. This means that there are four possible program executions, in which the two ranb() invocations return TT, TF, FT, and FF. Our belief and experience is that these four executions are equally likely, that is, each occurs just about one quarter of the time.

Now, the analogy becomes clearer. The set of possible program executions, with associated probabilities, is a probability space. The variables in the program that depend on ranb invocations are random variables. I hope it’s all clear to understand now.

Let’s talk about the expected value, also called the mean. It’s basically the summation of the product of PMF and the random variable:

    \[ E[X] = \sum_{s\inS} p(s)X(s) \]

Imagine h is heads and t is tails. We saw ht and th. There’s also hh and tt. So the expected value of it would be:

    \[ E[X] = p(hh)X(hh) + p(ht)X(ht) + p(th)X(th) + p(tt)X(tt) \]

    \[ = \frac{1}{4}. 2 +\frac{1}{4} . 1 + \frac{1}{4} . 1 + \frac{1}{4} .0 \]

    \[ = 1 \text{QED} \]

You may wonder where we got the Xs from. Well, something I meant to tell you is that X is diminutive, meaning that we shall assign the value to it ourselves. Here, we have assigned 1 to h and 0 to t. The X(hh) is 2 because it has 2 hs.

Let’s talk about distribution. Probability distribution is a function that provides the probabilities of occurrences of different possible outcomes of an event (this is the Wikipedia wording because it’s 7AM and I’m toast and I can’t word it myself).

When we say random variable X has the distribution f we shall denote X \sim f.

The dispersion of values clustered around X is denoted by variance of it, and is defined as:

    \[ \boldsymbol{Var}[X] = E[(X - \bar{X})^2] \]

Where \bar{X} is the mean, or average of X. I won’t insult your intelligence by defining the average. Geez!

\sqrt{\boldsymbol{Var}} is called standard deviation. The random variables X and Y are called independent if:

    \[ Pr\{X = x \text{ and } Y = y\} = Pr\{X = x\}.Pr\{Y = y\} \]

The important properties of independent random variables are:

1)

    \[ E[XY] = E[X]E[Y] \]

2)

    \[ \boldsymbol{Var}[X + Y] = \boldsymbol{Var}[X] + \boldsymbol{Var}[Y] \]

When I started talking about probability, I talked about continuous probability vs. discrete probability. We covered discrete probability. Now to cover the difference between continuous probability and discrete probability:

1) Values are continuous. Meaning that the numbers are infinite.

2) Certain aspects of the analysis involves mathematical subtleties like measurablity.

3) Our probability space will be infinite. We must use probability density function instead of PMF.

PDF properties are:

1- For every s \in S we have p(s) \geq 0.

2- \int_{s\inS}p(s) = 1

But if the distribution of S is uniform, the PDF is defined as:

    \[ p(x) = \left\{ \begin{array}{lr}\frac{1}{b-a} & \text{ for } a \leq b \\ 0 & \text{ for } x < a \text{ or } x > b \end{array}\]

In continuous probability, E[X] is defined as:

    \[ E[X] := \int_{s\inS} p(s)X(s) \]

Now let’s compare the definitions of PMF and PDF:

    \[ \mathbb{PMF} \rightarrow p_y(t) = Pr\{Y = t\} \text{ for } t \in T \]

    \[ \mathbb{PDF} \rightarrow Pr\{a\leq X \leq b\} = \int_a^bp(r)dr \]

In continuous probability, random variables are better to be called random points. Because if S is a probability space, and Y : S \rightarrow T is a mapping to another space rather than \mathbb{R}, then we must call Y a random point rather than a random variable. The notion of probability density applies here as you can consider for any U \subset T we have:

    \[ Pr\{Y\in U} = \int_{u \in U} P_Y(u)du \]

Now let’s apply what we’ve learned to the sphere. A sphere has three coordinates, longitude, latitude, and colatitude. We only use longitude and colatitude in a \mathbb{R}^2, 2D Carthesian coordinates applied to random variable S, turning it into S^2. The particularization of it being:

    \[ Y : [0, 1] \times [0, 1] \rightarrow S^2 : (u, v) \rightarrow (\cos(2\pi u)\sin(\pi v), \cos(\pi v) \sin( 2\pi u) sin(\pi v)) \]

We’ll start with the uniform probability density p on [0, 1] \times [0, 1] or p(u, v) = 1. Scroll up to see the formula for uniform probability density. For notational convenience, well write (x, y, z) = Y(u, v).

We have the intuitive sense that if we pick points uniformly, randomly in the unit square and use f to convert them to points in the unit sphere, they’ll be clustered near the pole. This means that the induced probability density on T will not be uniform. The image below shows this.

We’ll now talk about ways to estimate the expected value of a continuous random variable, and using these to evaluate integrals. This is important because in rendering, we’re going to want to evaluate the reflectence integral:

    \[ L^{ref}(P, \omega_o) = \int_{\omega_i \in S_{+}^{2}}L(P, - \omega_i)f_s(P,\omega_i,\omega_0)\omega_i . \boldsymbol{n}d\omega_i, \]

for various value of P and \omega_0. \omega is the direction of incoming light. A code that generates a random number uniformly distributed on [0, 1] and takes square root produces value between 0 and 1. If we use the PDF on it, as it is a uniform value, the expected value is \frac{2}{3}. This also happens to be the average value of f(x) = \sqrt{x} on that interval. What does it all mean?

Let’s take a look at Theorem 3.48 of the book. It states that if f : [a, b] \rightarrow \mathbb{R} is a real-valued function, and X \sim \boldsymbol{U}(a, b) is a uniform random variable on the interval [a, b], then (b-a)f(x) is a random variable whose expected value is:

    \[ E[(b-a)f(x)] = \int_a^b f(x)dx . \]

What does it say? It says that we can use a randomized algorithm to compute the value of an integral, at least if we run the code often enough and average the results.

In general, we’ve got some quantity C, like the integral we saw above, that we’d like to evaluate, and some randomized algorithm that produces an estimate of C. We call such a random variable and estimator for the quantity. The estimator is unbiased if its expected value is actually C. Generally, unbiased estimators are to be preferred over biased ones.

We talked about discrete and continuous probabilities. There’s a third one, which we shall call mixed probabilities, that comes up in rendering. They arise exactly from the impulses in bidirectional scattering distribution functions, which we shall talk about one day (perhaps, if I’m alive tomorrow) or the impulses caused by point lights. These are probabilities that are defined on a continuum, like interval [0, 1], but are not defined strictly by PDF. Consider the following program:

if uniform(0, 1) > 0.6 :
    return 0.3
else :
    return uniform(0, 1)

Sixty percent of the time this program reports 0.3, the other 40% of the time it returns a value uniformly distributed on [0, 1]. The return value is a random variable that has a probability mass of 0.6 at 0.3, and a PDF given by d(x) = 0.4 at all other points. We shall define the PDF as:

    \[ d(x) = \left\{ \begin{array}{lr}0.4 & \text{ for } x \neq 0.3 \\ 0.6\times\infty & \text{ for } x = 0.3 \end{array}\]

Overall, a random variable with mixed probability is one for which there is a finite set of points in the domain at which the PDF is defined, and vice versa, uniform points where PMF is undefined.

Wow that was a long post! But I’ve learned so much. This blog has four focus subjects, and that’s a lot. But it’s the summer, and until the universities open, I have a lot of free time on my hands. So I’ll try to post um, about 1~2 blog posts about graphics programming a week.

Did I mention that I have a Fiverr gig now? Well , I do! Here’s the link. Enjoy your day, while I ponder about what shall I learn next!

If Nearest Neighbor tires you, what would SVM do?” this sentence paraphrases Jeremiah 12:5; or it alludes to 1971 Christexploitation movie by Ron Ormond. Either way, if you’re stuck at NN, how are you going to deal with Support Vector Machines?!

Never fret. I’m here to paraphrase and quote from E. R. Davies’ Computer Vision: Principles, Algorithms, Application. I’ve sucked up to E. R. Davies before, I think he’s a wonderful person, and his contributions to the field of computer and machine vision and pattern recognition shan’t be let adrift from out minds. His books are perfect because he’s too… industrial. He’s not about theory, he’s all about application!

This post looks at classification through vision. So you’ve been warned.

When the objects that appear in an image have simple shapes, just one stage of processing may be required, – as in the case of circular washers on a conveyor belt. For more complex objects such as flat brackets and hinges, locating them requires at least two stages – as when graph matching methods are used. For situations where the full complexity of three dimensions occurs, more subtle procedures are usually required. Indeed, the very ambiguity involved in interpreting 2-D images from a set of 3-D objects generally requires cues to be sought and hypotheses to be proposed before any serious attempt can be made with the task. Thus, cues are vital to keying into complex data structures of many images. However, for simple situations, concentration on small features is valuable in permitting image interpretation to be carried out efficiently and rapidly. Neither must it be forgotten that in many applications of computer vision, the task is made simpler by the fact that the main interest lies in specific types of object.

In practical situations, measurements of prominent features allow most objects to be classified straightforwardly. In fact, this is most commonly achieved with varying degrees of certainty, but by comparing object features with those of a great many other known objects, we arrive at classifcations that are statistically the most likely ones. Hence, this sort of classification procedure is called SPR.

A good number of early pattern recognition techniques and algorithms followed the SPR approach without proceeding to the next stage – that of determing the solution that is mathematically the most probable one. Thus, an important goal has been to move the sujbects from SPR to PPR → Probable Pattern Recognition. Indeed, the key aim of ML is to progress towards probabilistic pattern recognition. In this blog post, we’ll study SPR. We’ve already discussed PPR in the last post, one of the aspects of it, at least, called the EM algorithm. We’ll unravel more aspects of PPR later.

Let’s talk about NN: Nearest Neighbor algorithm. The principle of the NN algorithm is that of comparing input image patterns against a number of paradigms and then classifying them according to the class of the paradigm that gives the closest match. Below yo see a principle component analyzed dataset wherein the dataset, in blue, is closer in paradigm to green rather than red.

 

An instructive but rather trivial example is shown below:

Here, a number of binary patterns are presented to the computer in the training phase of the algorithm: then the test patterns are presented one at a time and compared bit by bit against each of the training patterns. It is clear that this gives a generally reasonable result, the main problems arise when 1) training patterns of different classes are close together in Hamming distance – Hamming distance begin the distance between the image and its anomalies – and 2) minor translations, rotations, or noise cause variations that inhibit accurate recognition. More generally, problem 2 means that the training patterns are insufficiently representative of what will appear during the test phase. The latter statement encapsulates an exceptionally important principle, and it implies that there must be sufficient patterns in the training set for the algorithm to be able to generalize over all possible patterns of each class. However, problem 1 implies that patterns of two different classes may in some case be so similar as to be indistinguishable by any algorithm; and then it is inevitable that erroneous classifications will be made. It is seen below that this is because the underlying distribution in feature space overlap.

The basis of Bayes’ decision theory will now be examined. If we are trying to get a computer to classify objects a sound approach is to get it to measure some prominent feature of each object such as its length and to use this feature as an aid to classification. Sometimes, such a feature may give very little indication of the pattern class – perhaps because of the effects of manufacturing variation. For example, a hand-written character may be so ill formed that its features are of little help in interpreting it; it then becomes much more reliable to make use of the known relative frequencies of letters, or to invoke context. In fact, either of these strategies can give a greatly increased probability of correct interpretation. In other words, when feature measurement are found to giving an error rate above a certain threshold, it is more reliable to employ a priori probability of a given pattern appearing.

The next step in improving recognition performance is to combine the information from feature measurements and from a priori probabilities, this is achieved by applying Bayes’ rule. For a single feature x, this takes the form:

    \[ P(C_i|x) = \frac{p(x|C_i)P(C_i)}{p(x)} \]

Where:

    \[ p(x) = \sum_{j}p(x|C_j)P(C_j) \]

The vertical line | denotes conditional probability.

Mathematically, the variables here are 1) the a priori probability of class Ci, P(Ci); 2) the probability density for feature x, p(x); 3) the class-conditional probability density for feature x in class Ci, p(x|Ci), and 4) the a posteriori probability of class Ci when x is observed, P(Ci|x).

But what is a priori? What is a posteriori? Let this image I’ve created explain it all!

 

The notation P(Ci|x) is a standard one, being defined as the probability that the class is Ci when the feature is known to have the value x. Bayes’ rule says that to find the class of an object, we need to know two sets of information about the objects that might be viewed: the first is the basic probability P(Ci) that a particular class might arise; the second is the distribution of values of the feature x for each class.

Many common image analysis techniques give features that may be used to help identify or classify object. Such as area of the object, its perimeter, the number of holes it possesses, and so on. Generally, increasing number of features helps to resolve the classification problem more quickly. In the figure below, the first image has a much lower error rate than the second image. Keep in mind the cardinal rule of machine learning: you can’t have an error rate of 0!

Many classification methods, including NN and Bayes’ classifier, can involve substantial amounts of storage and computation the amount of training is to be sufficient to achieve low error rates. Hence, there is considerable value in employing methods that minimize computation cost. This is the goal of our lovely and stout Naïve Bayes’ classifier.

In Naïve Bayes’ algorithm, we multiply the a priori of different features and thus get a a posteriori with low error rate:

    \[ p(x|C_i)P(C_i) = p(x_1|C_i)p(x_2|Ci)\ldots p(x_N|C_i).P(C_i) = \prod_{j}p(x_j|C_i).P(C_i) \]


Let’s talk about the optimum number of features in an image. An important factor to consider is hat the optimum number of features depends on the amount of training a classifier receives. If the number of training set patterns is increased, more evidence is available to support the determination of a greater number of features, and hence to provide more accurate classification of test patterns.

Alright. Probability is nice and all but we shan’t go around all willy nilly – we need to calculate the cost of our mistakes! This is where the cost function comes into play. First, let’s take a look at a function known as the conditional risk function:

    \[ R(C_i|x) = \sum_j L(C_i|C_j)P(C_j|x) \]

This function expresses the expected cost of deciding on class Ci when x is observed. As it is wished to minimize this function, we decide on class Ci only if:

    \[ R(C_i|x) < R(C_i|x) \text{for all} j \neq u \]

If we were to choose a particularily simple cost function, of the form:

 

    \[L(C_i|C_j) = \left\{\begin{array}{lr}0, & \text{for } i=j\\1, & \text{for }i\neqj\\\end{array} \]

Cost functions permit classifications to be biased in favor oa safe decision in a rigorous, predetermined, and controlled manner. One way of minimizing costs is for the classifier to recognize when it is “doubtful” about a particular classification, because two or more classes are most equally likely. Then, one solution is to make a safe decision, the decision plane in feature space being biased away from its position for maximum probability classification. An alternative is to reject the pattern, i.e. place it into an “unknown” category, in that case, some other means can be employed for making appropriate classification.

Alright buckaroos! That is it for today! Of course one topic remains, and that’s SVMs: Support Vector Machines. But they, along with clustering, require a post of their own.

I’m kinda tired today… And I’m working on this massive project, which is to be revealed soon! Very, very, very soon! So buckle up, because when Chubak rides the bus, you can bet your butt it’s gonna be a bumpy ride!

Sherry is one of those beverages that only stuck-up yuppies drink. I’ve personally never tasted it, for obvious reasons, but I’d imagine it’ll taste like any other fermented grape drink – stingy and bitter. I think people drink these beverages mostly because they’re used to them! Aren’t I right? I think machine learning can be used in the alcoholic beverage industry… Like, how long should this casket sit in the cellar? I dunno.

I wanted to make another post based on Foundations of Machine Learning by MIT Press… But I decided to go the other way and be more succinct: make a general post about machine learning. In the last post I talked about chapter 14 of E. R. Davies book. This chapter is titled “Probabilistic Machine Learning”. I’m gonna make a post about this chapter, and just hope to Lord Almighty that I’ll learn something cool… Of course I’ll learn something cool! What am I saying? Let’s start talking about probabilistic machine learning. Starting off with EM algorithm, EM standing for Expectation Maximization.

Perhaps the main point about probabilistic optimization is that we are always in a situation where we have an absolute mathematical goal – to ensure that the solutions we are seeking are subject to ever-increasing probability. This is important because, when analyzing data involving a large component of randomness, we can never be sure whether any real improvement is being made. But if we can prove mathematically that the process of change can only increase the probability of correct interpretation, we have a crucially important tool at our fingertips!

This is all good and fine, but how exactly will we formulate probabilistic arguments in such a way as to achieve our aims? The answer to this questions lies in the fact that by the 2010s many tools have been developed to let this happne, and at this very moment in time progress in this area is accelerating.

There are concrete methodology, the most powerful of them is Baye’s theory, the sin qua non in the area of applied statistics. There’s then Jensen’s inequality, and Kulback-Leibler divergence formula, which gives a distance measure showing how two different probabilistic distributions are. Then there is Newton’s method of approximation, which is fundamental, but which can be bettered in relevant cases by the Expectation Maximization algorithm. Among all this theory and methodology, we must not forget such basic probability ideas as the vertical bar notation, which allows probabilities to be reexpressed using the product rule, p(A, B) = p(A|B)p(B). These are all fine and dandy, but is there an algorithm that is based on the normal distribution, you may inquire? And I answer, yes, there is! And we’re talking about it.

Much of what we shall do in the probabilistic formalism is to make models of the input data – this being particularly true of EM algorithm, and EM being the subject of our post, is designed for generating accurate statistical models of data. But what types of models are to be used? Gaussian distribution is key, this is because it accurately models the inaccuracies of measurement due to random noise.

Before proceeding to describe the Expectation Maximization algorithm, and its justification, it will be useful to look at the sort of problems that we will want to apply to it. In particular, suppose we have a 1-D distribution of data points which we wish to fit: Perhaps, the most obvious way to model it is by using a set of individual Gaussian distributions, each of which will correspond to one of the peaks of the input distribution. Mathematically, we can model this as a mixture of Gaussians in which each Gaussian has its own mixture coefficient m. Furthermore, if we are to follow our probabilistic strategy, we will need to express both the input distribution and the result as probability distributions.

The first thing to do is to represent the Gaussian distribution as a probability distribution integrating to unity:

    \[ \mathcal{N}(x|\mu, \sigma) = \frac{1}{(2\pi\sigma^2)^(1/2)}exp\left[-\frac{(x - \mu)^2}{2\sigma^2}\right] \]


\mu and \sigma respectively, are the mean and the standard deviation of the distribution. In addition, we follow standard usage in denoting the Gaussian by its alternate name, the normal distribution, using symbol \mathcal{N} to represent it. Probability of this distribution is denoted as:

    \[ p(x) = \sum_{k = 1}^{K}m_k\Nu(x|\mu_k, \sigma_k) \]

If we take the integral of both sides we get:

    \[ \sum_{k = 1}^{K} = 1 \]


So what does it all mean? It basically means that the normal or Gaussian distribution sums up to 1.

If we assume that the probability of EM sample distribution vector z = {z1, …, zk} is mk if zk = 1 and 1 if zk = 0, then according to Baye’s theorem we have the conditional probability of EM vector sample and the Gaussian distribution:

    \[ p(x) = \sum_{z}p(x|z)p(z) = \sum_{k = 1}^{K}m_k\mathcal{N}(x|\mu_k, \sigma_k) \]


    \[ \rho(z_k) = p(z_k = 1 | x) = \frac{p(x|z_k = 1)p(z_k = 1)}{\sum_{j = 1}^{K}}p(x|z_j=1)p(z_j = 1) = \frac{p(x|z)p(z)}{\sum_{z}p(x|z)p(z)} \]

Thus we’re done with the Expectation part of the EM algorithm. But don’t forget about the Probability Density Function! When fitting data to a single Gaussian distribution, it is very necessary to take a products of PDFs of all the individual data points:

    \[ p(x_1, … , x_I|\mu,\sigma) = \prod_{i = 1}^{N}p(x_i|\mu,\sigma) = \prod_{i = 1}^{N} \mathcal{N}(x_i|\mu, \sigma) \]

In the Maximization step of EM algorithm we take the mixture of parameters to be fixed and solve for the Gaussian parameters \mu_k, \sigma_k. This step, along with the former step, are recycled as many times as necessary to proceed from an initial approximation to a final, much more accurate one.

We can generalize EM algorithm for n-dimensions like so:

    \[ \mathcal{N}(x | \mu, \Sigma) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}exp\left[-\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x-\mu)\right] \]

Let’s put EM algorithm to test, shall we? Look at this figure:

 

Look at them contours! That’s what I want in a woman!

The mean positions of this distributions are (1, 1.5), (2, 5), (-2, 5) and the covariance matrix for it is:

    \[ \left( \begin{array}{ccc}2 & 0 \\0 & 0.4 \\\end{array} \right)\]

    \[ \left( \begin{array}{ccc}0.5 & 0 \\0 & 1.5 \\\end{array} \right)\]

    \[ \left( \begin{array}{ccc}1 & -0.5 \\-0.5 & 1 \\\end{array} \right)\]

The 200 points randomly extracted from each of these Gaussians overlap in nontrivial wways, thereby providing a reasonably complex task for the EM algorithm. Next, we move on to a more immediately useful situation: we segment the samples into a number of subareas. It takes a lot of iterations to get a low error rate, but a powerful computer can do it – based on the data structure that’s holding the Gaussians and so many other factors. In fact, training an EM algorithm takes a long time – to test all the contours of the classifiers… And overall, the separability of the data plays a large role which shan’t be ignored.

Threshold… Sounds like a Thrash Metal band!

Above you see the effect of EM algorithm in multilevel thresholding. The intensity of histogram of image A is shown s a green trace in image C. The EM algorithm is used to obtain a GMM as shown in red by the six Gaussians in C. All pixels contributing to the green trace between adjacent Gaussian crossings are assigned to the mean intensity of the intervening Gaussian and rensterted into he image, as in B. The fit to the cloud intensities is naturally relatively poor, but he other intensities are reasonably matched.

That is it for this short blog post! I hope you’ve enjoyed it. I care neither about quality, nor quantity: I just wanna learn. Perhaps 50% of the post is mine, and the rest is credited to the author, so keep that in mind.

Whilst you’re sipping your sherries, I will be smoking Kent Red and reading chapter 13 of Davies’ book: Classification algorithms. I will make a post about it, I promise!

Enjoy life, as if it’s the last day of it – Drinking this much sherry, it might as well be!

Oh boy, I’m on fire yo! Two posts in one day… To be fair, I wrote the last post yesterday. Anyhow, I don’t think anyone cares. But you reap what you sow. By the time I’m finished with this post my knowledge will be richer, and more robust. So what am I complaining about? As I said before I write to learn. So be it!

But first, let’s talk about one of my heroes, E. Roy Davies. He’s one of the forerunners of computer and machine vision, a subset of pattern recognition – one of the three things my profession is about. This post is based on a few chapters of his opus, Computer Vision: Algorithms, Applications, Learning.

Handsome Devil Behind It All

Alright. Enough sucking up to the masters of the field. Let’s talk about Artificial Neural Networks.

ANNs were launched off in the 1950s, and continued well into the 60s, and research into them still continues to this day. Beldoe and Browning developed n-tuple type of classifier that involved bitwise recording and lookup of binary feature data, leading to weightless or logical type of ANN. Rosenblatt’s perceptron, although, was more important than this algorithm. Let’s talk about this “perceptron”.

The simple preceptron is a linear classifier that classifies patterns into two classes. It takes a feature vector, x = (x1, x2, …, xN) as its input and produces a single scalar output sum_{i = 1}^{N} w_i x_i, the classification process being completed by applying a threshold function at \theta. The mathematics is simplified by writing - \theta as w0, and taking it to correspond to an input x0 which is maintained at a constant value of unity. The output of the linear part of the classified is then written in the form:

d = sum_{i = 1}^{N} w_i x_i - \theta = sum_{i = 1}^{N} w_i x_i + w_0 = um_{i = 1}^{N} w_i x_i

and the final output of the classifier is given by:

y = f(d) f \left( sum_{i = 1}^{N} w_i x_i \right)

This type of neuron – which as we said before, is called preceptron, can be trained using a variety of procedures, such as the fixed increment rule. The basic concept of this algorithm was to try to improve the overall error rate by moving the linear disciminant planea fixed distance toward a position where no misclassifcation would occur – by only doing this when a classification error has occurred.

w_i(k + 1) = w_i(k) \t y(k) = \omega(k)

w_i(k + 1) = w_i(k) + \eta[\omega(k) - y(k)]x_i(k) y(k) \neq \omega(k)

Let me interpret all this mumbo jumbo. What it basically means. First, take a look at this image, courtesy of Towards Data Science:

 

 

So you can clearly see what happens. A bunch of inputs are given, then they are weighted by their threshold – the hyperplane if you will. Then they are summed up. If this number is less than a value, it’s one binary choice, if it’s greater than the value, it’s another binary choice. This is a Heaviside function.

The photo below you shows the difference between separable and non-separable data. The hyperplane could do so much more if the data are separable!

 

 

So far, our preceptrons have been single-layered. Meaning that only one layer of hyperplanes can be ousted. The other concept is multilayer preceptron, or MLP. Rosenblatt himself suggested such networks, but was unable to work out his magic and represent them. Ten years later, in 1969, Minsky and Papert published their famous monograph in which they discussed MLP. It wasn’t until 1986 that Rumbelhart et al were successful in proposing a systemic approach to training of MLPs. Their solution is known as the back-propagation algorithm.

Let’s talk about it.

The problem of trainign a MLP can be simply stated as: a general layer of a MLP obtains its first feature data from the lower layers and receives its class data from higher levels. Henece, if all the weights in the MLP are potentially changeable, the information reaching a particular layer cannot be relied upon. There is no reason why training a layer in isolation would lead to overall convergence of the MLP toward and ideal classifier. Although it might be thought that this is a rather minor difficulty, in fact, this is not so; indeed, this is but one example of the so-called credit assignment problem. What is this problem? Well, it’s correctly determinign the local origins or global properties and making the right assignment of rewards, punishments, corrections, and so on.

The key to solving these problems was to modify the preceptron cwomposing the MLP by giving them a less hard activation function that the Heaviside thresholding function whch results in \theta and we call its negation w0, the hyperplane – we give started changing the Heaviside function to a sigmoid shape, such as the tanh function.

Once these softer activation functions were used, it became possible for each layer of the MLP to feel the data more precisely, and thus training procedures could be set up on a systematic basis. In particular, the rate of change of the data at each individual neuron could be communicated to other layers that could then be trained appropriately – though only on an incremntal basis. I’m not going to bore you with the mathematical details, just some points regarding the algorithm:

1) The outputs of one node are the inputs of the next, and an arbitrary choice is made to label all variables as output (y) parameters rather than input (x) variables, all output parameters are in the range 0 to 1 (because of tanh, duh!)

2) The class parameter \omega has been generalized as the target value t of the output variable y.

3) For all except the final outputs, the quantity of \delta_j has to be calculated using the formula \delta_j = y_j (1 - y_j)(\sum_{\delta_m}{w_{jm}}), the summation having to be taken over all the nodes in the layer above node j.

4) The sequence for computing the node weights involves starting with the output nodes and the proceeding downwards one layer at the time.

In the figure below you can see the difference between Heaviside activator, linear activator, and Sigmoid activator.

I have prepared a short video for people who are visual learners. Let’s hope it helps them:

 

Ok. That’s it! I’m not sure what the next post is going to be about. But a little birdie tells me it’s going to be about Deep Learning! So if you really, really read my blog, look out for it!

What am I going to do? Well, first I’m going to a casino and use my X-Ray Auto Blackjack Aviator Specs to win some hands. Then I’m gonna read chapter 14 of Davies books. It’s a refresher on basic Machine learning concepts. I hope I don’t fall asleep, I’ve been awake for 14 hours!

Hiyaaaa! It’s been almost a week. I know if I’m going to post my mundane mendacity, which all bring me much satiety, in form of a blog post in an arbitrary time pattern, it’ll take a machine learning algorithm to predict whenever I’m going to make one next, but wait! Let’s drop every mundane hackneyed thing I was gonna talk about and stick with machine learning!

I fell in love with machine learning a few years back, and I just recently bought one of the most intricate books in this subject, MIT Press’s Foundations of Machine Learning, written by two Persians and one Indian. Well whilst they are waiting at the TSA checkpoints, let us not fret and make a series of posts on the damn book! It would be my privilege, nay, my honor! First episode: PAC framework. I don’t mean to credit for what I’ve not written myself, so you should know that almost everything is taken from this book. If you enjoy this post, please purchase this book. Of course it’s not the entire book, because that would be illegal, it’s just tidbits from the book. I did so because I can’t learn without a purpose, I need can’t do passive learning in my brain, learning’s gotta be active – meaning I shan’t be content with learning something without doing something along with it, so I make blog posts about it! But I have made a visualization of PAC, in Cinema 4D, which you can see when we get to it! Well, at least I’m not a content pirate! (Yeah, Chubak, keep telling yourself that…).

Let the explanations commence!

When we are designing a machine learning algorithm, several fundamental questions shall occupy a healthy man’s mind. Such as the efficiency of what can be learned, the fact that somethings are inherently hard or easy to learn, how many examples are needed, that if there’s a general model for learning, and so on and so forth. In this episode, we begin by formalizing an answer to these questions, in the form of PAC framework.

PAC stands for Probably Approximately Correct. The PAC framework helps define the class of learnable concepts in terms of number of sample points needed to achieve an approximate solution. Meaning that it introduces a sample complexity and the time and space complexity of the learning algorithm, all in one package. Imagine the normal attributes for a machine learning algorithm as some sort of a Chinese food, all in separate packages, and this so-called PAC as a nice Turkey sandwich you can take to work with.

Let’s first introduce several definitions and the notation needed to represent the PAC model, which we’ll also use in later posts as well.

We denote the set of all possible examples or instances as 𝒳, which is also sometimes referred to as the input space. The set of all possible labels or target values is denoted by ყ. In this episode ყ is limited to 0 and 1, which corresponds to the so-called binary classification.

𝒳 maps to ყ; x \rightarrow y. Let’s call this mapping a concept, or Շ. You can see how we denote it in equation 1-1. Since ყ = {0, 1}, we can identify Շ as a subset of 𝒳, in which Շ is either 0 or 1. 𝒜 concept may be a triangle, or a rectangle, or a hexagon. You can see it more clearly in figure 1-1, but we’re not going to talk about figure 1-1 yet – I just said this to give you an idea.

Let’s assume that examples are i.i.d: independently and identically distributed according to some fixed but unknown distribution, 𝔇. The learning problem is then formulated as follows: The learner considers a fixed set of possible concepts, 𝓗, called a hypothesis set, which may or may not coincide with the concept, Շ. It receives a sample S = (x1, …, xm) that is i.i.d according to 𝔇 as well as the labels (Շ(x1), …, Շ(xm)), which are based on a specific target concept that is a member of Շ. The task is then to use the labeled sample S to select a hypothesis hs that is a member of 𝓗 that has a small generalization error with respect to the concept Շ. The generalization error of a hypothesis h that is a member of 𝓗, also referred to as risk or true error or simply, error of h and is denoted by R(h) and summed up in equation below.

R(h) = \underset{x \sim D}{P}[h(x) \neq c(x)] = \underset{x \sim D}{E}[1_{h(x) \neq c(x)}]

Let’s have a headcount. h(x) is the hypothesis of the input, Շ(x) is the concept of the input, 𝔇 is the distribution, and 1w is the indicator function of the even w. Indicator functions are 1 for all the elements of the function, and 0 for anything else. P is probability, and E is the expected value. In the upcoming visualization, the marbles which fall into the container are 1 and the marbles which fall out are 0.

The generalization error of a hypothesis is not directly accessible to the learner since both distribution 𝔇 and the target concept Շ are unknown. However, the learner can measure the empirical error of a hypothesis on the labeled sample S. You can see this empirical risk, for sample S = (x1, … , xm), formalized in equation:

\hat{R}_S(h) = \frac{1}{m} \sum_{i = 1}^{m}1_{h(x_i) \neq c(x_i) .

Thus, the empirical error of h ∈ 𝓗 is its average error over the sample S, while the generalization error is its expected error based on the distribution 𝔇. We can already note that for a fixed h ∈ 𝓗, the expectation of the empirical error based on an i.i.d sample S is equal to the generalization error, as you can see it notarized in equation below. Remember that m is the maximum index of the sample – hence, size of the sample.

\underset{S \sim D^m}{E} [\hat{R}_S(h) ] = R(h)

But what is this distribution? Distribution for a discretely-valued function like this is basically the list of all the probabilities for each outcome of 𝒳.

So let’s introduce PAC, or in other words, Probably Approximately Correct: Let n be a number such that the computational cost of representing any element, x ∈ 𝒳 is at most O(n) and denote by size(c) the maximal cost of the computational representation of c ∈ Շ. For example, 𝒳 may be a vector in Rn. For which the cost of any array-based representation would be in O(n). In addition, let hs denote the hypothesis returned by algorithm 𝒜 after received a labeled sample S. To keep notation simple, the dependency of hs on 𝒜 is not explicitly indicated.

So, a concept class Շ is said to be PAC-learnale if there exists an algorithm 𝒜 and a polynomial functiony poly(.,.,.,.) such that for any ε > 0 and > 0, for all distributions 𝔇 on 𝒳 and for any target concept c ∈ Շ, the following holds true for any sample size m ≥ poly(1/ ε, 1 /𝛿, n, size(c):

\underset{S \sim D^m} {P} [R(h_S) \leq \epsilon] \geq 1 - \delta

When such an algorithm 𝒜 exists, it’s said thatՇ has a PAC-learning algorithm.

A concept class Շ is thus PAC-learnable if the hypothesis returned by the algorithm 𝒜 after observing a number of points polynomial in 1/ ε and 1 /𝛿 is approximately correct – meaning the error is at most ε, with the high probability (at least 1 – 𝛿), which justifies the PAC terminology. The parameter 𝛿 > 0 is used to define confidence 1 – 𝛿 and ε > 0 the accuracy 1 – ε. So error rate must be between delta and epsilon. Note that if the running time of the algorithm is polynomial in 1 / ε and 1 / 𝛿, then the sample size m must also be polynomial if the full sample is received by the algorithm.

As you might have noticed, there are two things in PAC definition which correspond with its name. The probability, and the error rate. The first one is for the P in PAC, the second one is for the AC in PAC.

Several key points of the PAC definition are worth emphasizing. First, the PAC framework is a distribution-free model: no particular assumption is made about the distribution 𝔇 from which examples are drawn. Second, the training sample and the test examples used to define the error are drawn according to the same distribution 𝔇. This is a natural and necessary assumption for generalization to be possible in general. It can be relaxed to include favorable domain adaptation problems. Finally, the PAC framework deals with the questionm of learnability for a concept class Շ and not a particular concept. Note that the concept class Շ is known to the algorithm 𝒜, but of course the target concept c ∈ Շ is unknown.

Now let’s take a look at this visualization I have cooked up in Cinema 4D:

 

 

Figure 1-1

Let me explain how it works. The marbles are the samples. We strain the samples, and the ones that land inside the shared container between 𝓗 and Շ are the expectation of PAC, the correct part. The ones that fall out are the error, the probably part. And the ones that fall in the containers that aren’t shared are the approximate part.

Sometimes, the hypothesis hS returned by the algorithm is always consistent, that is, it admits no error on the training sample S. In this part of the blog post, we present a general sample complexity bound, or equivalently, a generalization bound, for consistent hypotheses, in the case where cardinality |𝓗| of the hypothesis set is infinite. Since we consider consistent hypotheses, we will assume that the target concept c is in 𝓗.

So, let 𝓗 be a finite set of functions mapping from 𝒳 to ყ – Let 𝒜 be an algorithm that for any target concept c ∈ 𝓗 and i.i.dsample S returns a consistent hypothesis hS : RS (hS) = 0. Then for any ε, 𝛿 > 0, the inequality:

\underset{S \sim D^m} {P} [R(h_S) \leq \epsilon] \geq 1 - \delta

holds if:

m \leq \frac{1}{\epsilon} \left(log|H| + log \frac{1}{\delta}\right)

The price to pay for coming up with a consistent algorithm is the use of a larger hypothesis set 𝓗 containing target concepts. Of course, the upper bound increases with |𝓗|, the cardinal rule of PAC. However, that dependency is only logarithmic. Note that the term log|𝓗|,o or the related term log2|𝓗| from which it differs by a constant factor, can be interpreted as the number of bits needed to present 𝓗. Thus the generalization, guarantee of the theorem is controlled by the ratio of this number of bits, log2|𝓗|, and the sample size m.

Along with consistent hypotheses, we also have inconsistent hypotheses. Some argue that they aren’t useful, but dare I say that they may be? I’m not sure.

Let’s talk about Baye’s error. Given a distribution 𝔇 over 𝒳 ×ყ , , the Baye’s error R* is defined as:

R^\star = \underset{h - measurable}{inf} R(h)

Finally, let’s talk about noise. Noise is the minimum of the conditional probability of ყ with 𝒳.

Well, that’s it for today! My next blog post is going to be about something completely unrelated to machine learning, yes, another signal processing post! Keep in mind that I write to learn, but if I appease you, so be it!

If you have any questions, ask, and I will try my best to answer. If you see any errors in this post, tell me and I’ll add it to the addendum just below this very line. Thank you for reading my blog post!

The Podcast. Play it whilst you read the blogpost!

IN: 01

Subject: 2-Dimensional Fourier Transform

Written by Chubak Bidpaa

The source for this episode is Gonzalez and Woods’ Chapter 04 Page 240 onward. We’ll also use a few other sources, which will be highlighted in the blog post.

Have you watched the movie Tron? Some of you may, some of you may have not. Well, let’s not fret over it. The sujet of this movie is that a ragtag group of individuals transform themselves into a video game world, to fight a battle. Keyword here, at least for us, is transform. A Fourier transform is something akin to this: Signals are transformed from their cozy spatial domain into a cryptic frequency domain. By the way of this transformation, we can apply operations that are impossible to think of in the spatial domain.

 

Fun Fact: Tron was the first movie to use computer-generated imagery.

But pray tell, what is this Fourier transform you speak of? Well, it basically a complex function, don’t forget, complex, in which we use Euler’s Formula to generate a sinusoidal periodic value for our analogue function. But that’s just it – Analgoue. Analogue signals are continuous, meaning they are infinite, however, the digital signals we deal with in computers everyday are discrete. So how to compensate for this difference?

It all boils down to the difference between integration, and summation. Integrals are basically the area under the curve of a function, so they are infinite, and continuous, however, summations are finite, and discrete. So by replacing the integral symbol with sigma symbol, we can have ourselves a nice and dandy equation for calculating discrete Fourier transform in 2-D. Refer to the blog post to see this equation, equation 1-1.

Equation 1-1:

F(u, v) = \sum_{x=0}^{M=1} \sum_{y=0}^{N-1} f(x, y)e^{-j2\pi(ux/M+uy/n)}

To revert this transformation back into the spatial domain, we’ll use Inverse Discrete Fourier Transform. Refer to equation 1-2 to get a grasp of what IDFT holds in the sack. As you may notice, the inverse DFT is the average of the opposite of the Euler’s Formula.

Equation 1-2:

f(x, y) = \frac{1}{MN} \sum_{x=0}^{M=1} \sum_{y=0}^{N-1} f(x, y)e^{j2\pi(ux/M+uy/n)}

But what is this Euler’s Formula you speak of? Well, refer to equation 1-3 to see the complex interpretation of Euler’s Formula. We talk about complex numbers as if knowing about them is rite of passage. No. it would, however, be my privilege to talk about complex numbers, for those who are otherwise unaware!

Equation 1-3:

e^{j\theta} = \cos{\theta} + j\sin{\theta}

Well, let me start off by saying that there are basically two sorts of numbers. Real, and Complex. Complex numbers have two sections, a real section, and an imaginary section. Refer to equation 1-5 (1-4) to find out what an “imaginary” number is.

Equation 1-5 (1-4):

a (\text{real part}) + bi (\text{imaginary part})

i = \sqrt(-1)

Let’s ask ourselves this question: We have a frequency interval, and then we have an spatial interval. How are these two related? Well, easy! You can see their relation in formula 1-5 and 1-6, in which we have captured the separation between samples, in the divisor or the quotient, and delta u and v being the separation between the frequencies.

Equations 1-5 and 1-6:

\Delta u = \frac{M}{\Delta T}

\Delta v = \frac{N}{\Delta Z}

Another property of DFT is the fact that you can rotate and translate the domain easily as one, two three. Another property of it is its periodicity. Meaning that both DFT and IDFT are infinitely periodic in both u and v directions, as you can see in figure 1-1 in the blog post.

 

 

Figure 1-1

In equations 1-7 through 1-8, you can see the definition of phase spectrum. This is what we in the signal processing simply call Fourier spectrum. What’s the use? Visualization, my fam! And if you remember Winamp, an audio player from the early Aughts in which we used to play music files we downloaded off Limewire, you’d remember power spectrum. Power Spectrum is Fourier Spectrum squared. But what is it, really? Let’s not fret, it’s basically the Fourier transform in polar coordinates. In equation 1-9 you see equation for phase angle. And in figure 1-2 you can see it in action.

Equations 1-7, 1-8, and 1-9:

F(u, v) = R(u, v) + jI(u, v) = |F(u, v)|e^{j\phi (uv)}

F(u, v) = [R^2(u, v) + I^2(u, v)] ^ {1/2}

\phi(u, v) = arctan\left[\frac{I(u, v)}{R(u, v)}\right]

 

 

Figure 1-2-1
Figure 1-2-2

Let’s talk about convolution. We have two very different sorts of convolutions in signal processing, one is convolution in the spatial domain, one is convolution in the frequency domain. We’re not concerned with the spatial domain right now, so let’s talk about the frequency domain. In frequency domain, the convolution is just the result of multiplying two Fourier functions together. Referring to equation 1-10, we’ll realize that convolution in the frequency domain equates the circular convolution in the spatial domain. But whilst you’re there, take a look at equation 1-11. Looks familiar? Yes, the multiplication of two signals in the spatial domain equals something close to IDFT of a circular convolution. But what is this circular convolution? Refer to equation 1-12.

Equation 1-10, 1-11, and 1-12:

(f \star h) (x, y) = \sum_{m=0}^{M - 1} \sum_{n=0}{N-1} f(m, n)h(x - m, y - n)

(f \star h) (x, y) \Leftrightarrow (F \bullet H) (u, v)

(f \bullet h) (x, y) \Leftrightarrow \frac{1}{MN}(F \star H)(u, v)

Correlation is another one of DFT properties. We denote correlation with a hallow star. Correlation is similar to convolution, except correlation is used to compare the two signals.

What’s important about DFT is that it’s separable. Pray tell mister man, what does it mean? It simply means that you can do certain operations on the rows, and certain operations on the columns ,and then mix and match them together.

But how do you apply DFT? The formula shows that it sums up a complex operation on all rows, and on all columns. So how does it get to be so fast, insomuch as it has been in practice for ages, and I can apply a DFT on a large image, or an audio file? The answer lies in Fast Fourier Transform, or FFT. We’ll discuss FFT soon.

I hope you’ve enjoyed the first episode of my podcast plus blog post. I’ve spent a lot of time on it, and remember that I make these podcasts episodes plus blog posts to learn, more than to teach. So if there are inaccuracies, please forgive me. If you have any corrections to this podcast episode or blog post, just tell me, and I’ll add it to the addendum.

Let’s hope Partly Shaderly podcast slash blog takes off!

 

 

As I’m writing this, the following video is being rendered. Framator is a very simple plugin, born out of lack of ideas, and it creates four rectangles around your screen. Now, you can go ahead and use them as the default frame settings, or play around with them and create some stunning Mograph.  

If you came here from the video, you’ll probably know that I, Chubak Bidpaa, have created this plugin. The parameters are self explanatory, but I’ll just give them a quick review just in case:

Group Uses
Height and Width The Height and the Width of the Rectangles
Offsets The position of the rectangles
Colors The color of the gradient
Angle The angle of the gradient
Stops The stops of the gradient

If you want an effect similar to mine, just use time in the angle’s expression box.

No more loitering. Here’s the download link.

Donation

I’m not sure if anyone wants to donate via Bitcoin, but in case you’ve enjoyed my plugin, and wish to give something back in return, use this Wallet to donate:

bitcoincash:qrt4fx3npgef7yczj6cxzfdfrtqztzekjqdf52rtgj

Thank you. Thank you all. Chubak Out.

This is not something that you’d find in a journal, or a book: this is something I myself have cooked up. But what do I mean by “Parametric Approach to OpenGL”, and by “Using After Effects”?

 

Browse below the latest posts to find a download link

The above video is an After Effects plugin by yours truly. Everything is based upon one fragment shader (which I got from a pastebin, but it looks like it’s from this repo), which as follows:

 

#version 330
uniform sampler2D videoTexture;
uniform float sliderVal;
uniform float multiplier16bit;
in vec4 out_pos;
in vec2 out_uvs;
out vec4 colourOut;

#define PI 3.14159265358979323846


float random (in vec2 _st) {
    return fract(sin(dot(_st.xy,
                         vec2(12.9898,78.233)))*
        43758.5453123);
}

vec2 truchetPattern(in vec2 _st, in float _index){
    _index = fract(((_index-0.5)*2.0));
    if (_index > 0.75) {
        _st = vec2(1.0) - _st;
    } else if (_index > 0.5) {
        _st = vec2(1.0-_st.x,_st.y);
    } else if (_index > 0.25) {
        _st = 1.0-vec2(1.0-_st.x,_st.y);
    }
    return _st;
}

void main() {
    vec2 st = gl_FragCoord.xy/sliderVal;
    st /= 10;
    

    vec2 ipos = floor(st);  // integer
    vec2 fpos = fract(st);  // fraction

    vec2 tile = truchetPattern(fpos, random( ipos ));

    float color = 0.0;

     color = step(tile.x,tile.y);

    gl_FragColor = vec4(vec3(color),1.0);
}

But you may say “so what, what’s so parametric about it?” Before doing anything or saying anything, let’s scrutinize the basis of an After Effects OpenGL plugin.

All AE OpenGL Plugins are based on one example in the SDK: Glator. This example was removed for a large amount of time, maybe 4-5 years, and returned in 2017. Without is, image maninpulation shall be done using Adobe’s own mangled suits, something I wouldn’t wish on my worst enemies. After Effects SDK is based on a few command selectors:

 

switch (cmd) {
			case PF_Cmd_ABOUT:
				err = About(in_data,
							out_data,
							params,
							output);
				break;
				
			case PF_Cmd_GLOBAL_SETUP:
				err = GlobalSetup(	in_data,
									out_data,
									params,
									output);
				break;
				
			case PF_Cmd_PARAMS_SETUP:
				err = ParamsSetup(	in_data,
									out_data,
									params,
									output);
				break;
				
			case PF_Cmd_GLOBAL_SETDOWN:
				err = GlobalSetdown(	in_data,
										out_data,
										params,
										output);
				break;

			case  PF_Cmd_SMART_PRE_RENDER:
				err = PreRender(in_data, out_data, reinterpret_cast<PF_PreRenderExtra*>(extra));
				break;

			case  PF_Cmd_SMART_RENDER:
				err = SmartRender(in_data, out_data, reinterpret_cast<PF_SmartRenderExtra*>(extra));
				break;
		}

Which are engulfed in a try...catch statement. You may have noticed some function calls below each command selector. Those are the bread and mead of an After Effects plgi. Our concern is with PreRender() and SmartRender(). PreRender() is not that important:

static PF_Err
PreRender(
	PF_InData				*in_data,
	PF_OutData				*out_data,
	PF_PreRenderExtra		*extra)
{
	PF_Err	err = PF_Err_NONE,
			err2 = PF_Err_NONE;

	PF_ParamDef slider_param;

	PF_RenderRequest req = extra->input->output_request;
	PF_CheckoutResult in_result;

	AEFX_CLR_STRUCT(slider_param);

	ERR(PF_CHECKOUT_PARAM(in_data,
		GLATOR_SLIDER,
		in_data->current_time,
		in_data->time_step,
		in_data->time_scale,
		&amp;slider_param));

	ERR(extra->cb->checkout_layer(in_data->effect_ref,
		GLATOR_INPUT,
		GLATOR_INPUT,
		&amp;req,
		in_data->current_time,
		in_data->time_step,
		in_data->time_scale,
		&amp;in_result));

	if (!err){
		UnionLRect(&amp;in_result.result_rect, &amp;extra->output->result_rect);
		UnionLRect(&amp;in_result.max_result_rect, &amp;extra->output->max_result_rect);
	}
	ERR2(PF_CHECKIN_PARAM(in_data, &amp;slider_param));
	return err;
}

You can see a call to PF_CHECKOUT_PARAM in the code. “Obtains parameter values, or the source video layer, at a specified time. AfterEffects makes caching decisions based on the checkout state of parameters.”1 This PARAM has been defined before in ParamSetup():

static PF_Err 
ParamsSetup (	
	PF_InData		*in_data,
	PF_OutData		*out_data,
	PF_ParamDef		*params[],
	PF_LayerDef		*output )
{
	PF_Err		err		= PF_Err_NONE;
	PF_ParamDef	def;	

	AEFX_CLR_STRUCT(def);

	PF_ADD_SLIDER(	STR(StrID_Name), 
					GLATOR_SLIDER_MIN, 
					GLATOR_SLIDER_MAX, 
					GLATOR_SLIDER_MIN, 
					GLATOR_SLIDER_MAX, 
					GLATOR_SLIDER_DFLT,
					SLIDER_DISK_ID);

	out_data->num_params = GLATOR_NUM_PARAMS;

	return err;
}

You can add as much PARAMs to your program as you wish and later, pass them to your shaders as uniforms. But how? Well, for that, we must see what goes on in SmartRender():

static PF_Err
SmartRender(
	PF_InData				*in_data,
	PF_OutData				*out_data,
	PF_SmartRenderExtra		*extra)
{
	PF_Err				err = PF_Err_NONE,
						err2 = PF_Err_NONE;

	PF_EffectWorld		*input_worldP = NULL,
						*output_worldP = NULL;
	PF_WorldSuite2		*wsP = NULL;
	PF_PixelFormat		format = PF_PixelFormat_INVALID;
	PF_FpLong			sliderVal = 0;

	AEGP_SuiteHandler suites(in_data->pica_basicP);

	PF_ParamDef slider_param;
	AEFX_CLR_STRUCT(slider_param);

	ERR(PF_CHECKOUT_PARAM(in_data,
		GLATOR_SLIDER,
		in_data->current_time,
		in_data->time_step,
		in_data->time_scale,
		&amp;slider_param));

	if (!err){
		sliderVal = slider_param.u.fd.value / 100.0f;
	}

	ERR((extra->cb->checkout_layer_pixels(in_data->effect_ref, GLATOR_INPUT, &amp;input_worldP)));

	ERR(extra->cb->checkout_output(in_data->effect_ref, &amp;output_worldP));

	ERR(AEFX_AcquireSuite(in_data,
		out_data,
		kPFWorldSuite,
		kPFWorldSuiteVersion2,
		"Couldn't load suite.",
		(void**)&amp;wsP));

	if (!err){
		try
		{
			// always restore back AE's own OGL context
			SaveRestoreOGLContext oSavedContext;

			// our render specific context (one per thread)
			AESDK_OpenGL::AESDK_OpenGL_EffectRenderDataPtr renderContext = GetCurrentRenderContext();

			if (!renderContext->mInitialized) {
				//Now comes the OpenGL part - OS specific loading to start with
				AESDK_OpenGL_Startup(*renderContext.get(), S_GLator_EffectCommonData.get());

				renderContext->mInitialized = true;
			}

			renderContext->SetPluginContext();
			
			// - Gremedy OpenGL debugger
			// - Example of using a OpenGL extension
			bool hasGremedy = renderContext->mExtensions.find(gl::GLextension::GL_GREMEDY_frame_terminator) != renderContext->mExtensions.end();

			A_long				widthL = input_worldP->width;
			A_long				heightL = input_worldP->height;

			//loading OpenGL resources
			AESDK_OpenGL_InitResources(*renderContext.get(), widthL, heightL, S_ResourcePath);

			CHECK(wsP->PF_GetPixelFormat(input_worldP, &amp;format));

			// upload the input world to a texture
			size_t pixSize;
			gl::GLenum glFmt;
			float multiplier16bit;
			gl::GLuint inputFrameTexture = UploadTexture(suites, format, input_worldP, output_worldP, in_data, pixSize, glFmt, multiplier16bit);
			
			// Set up the frame-buffer object just like a window.
			AESDK_OpenGL_MakeReadyToRender(*renderContext.get(), renderContext->mOutputFrameTexture);
			ReportIfErrorFramebuffer(in_data, out_data);

			glViewport(0, 0, widthL, heightL);
			glClearColor(0.0f, 0.0f, 0.0f, 0.0f);
			glClear(GL_COLOR_BUFFER_BIT);
			
			// - simply blend the texture inside the frame buffer
			// - TODO: hack your own shader there
			RenderGL(renderContext, widthL, heightL, inputFrameTexture, sliderVal, multiplier16bit);

			// - we toggle PBO textures (we use the PBO we just created as an input)
			AESDK_OpenGL_MakeReadyToRender(*renderContext.get(), inputFrameTexture);
			ReportIfErrorFramebuffer(in_data, out_data);

			glClear(GL_COLOR_BUFFER_BIT);

			// swizzle using the previous output
			SwizzleGL(renderContext, widthL, heightL, renderContext->mOutputFrameTexture, multiplier16bit);

			if (hasGremedy) {
				gl::glFrameTerminatorGREMEDY();
			}

			// - get back to CPU the result, and inside the output world
			DownloadTexture(renderContext, suites, input_worldP, output_worldP, in_data,
				format, pixSize, glFmt);

			glBindFramebuffer(GL_FRAMEBUFFER, 0);
			glBindTexture(GL_TEXTURE_2D, 0);
			glDeleteTextures(1, &amp;inputFrameTexture);
		}
		catch (PF_Err&amp; thrown_err)
		{
			err = thrown_err;
		}
		catch (...)
		{
			err = PF_Err_OUT_OF_MEMORY;
		}
	}

	// If you have PF_ABORT or PF_PROG higher up, you must set
	// the AE context back before calling them, and then take it back again
	// if you want to call some more OpenGL.		
	ERR(PF_ABORT(in_data));

	ERR2(AEFX_ReleaseSuite(in_data,
		out_data,
		kPFWorldSuite,
		kPFWorldSuiteVersion2,
		"Couldn't release suite."));
	ERR2(PF_CHECKIN_PARAM(in_data, &amp;slider_param));
	ERR2(extra->cb->checkin_layer_pixels(in_data->effect_ref, GLATOR_INPUT));

	return err;
}

“But Chubak, where do we pass the uniforms?” Patience, Constance dear Patience. We pass them in RenderGL():

void RenderGL(const AESDK_OpenGL::AESDK_OpenGL_EffectRenderDataPtr&amp; renderContext,
				  A_long widthL, A_long heightL,
				  gl::GLuint		inputFrameTexture,
				  PF_FpLong			sliderVal,
				  float				multiplier16bit)
	{
		// - make sure we blend correctly inside the framebuffer
		// - even though we just cleared it, another effect may want to first
		// draw some kind of background to blend with
		glEnable(GL_BLEND);
		glBlendFunc(GL_ONE, GL_ONE_MINUS_SRC_ALPHA);
		glBlendEquation(GL_FUNC_ADD);

		// view matrix, mimic windows coordinates
		vmath::Matrix4 ModelviewProjection = vmath::Matrix4::translation(vmath::Vector3(-1.0f, -1.0f, 0.0f)) *
			vmath::Matrix4::scale(vmath::Vector3(2.0 / float(widthL), 2.0 / float(heightL), 1.0f));

		glBindTexture(GL_TEXTURE_2D, inputFrameTexture);

		glUseProgram(renderContext->mProgramObjSu);

		// program uniforms
		GLint location = glGetUniformLocation(renderContext->mProgramObjSu, "ModelviewProjection");
		glUniformMatrix4fv(location, 1, GL_FALSE, (GLfloat*)&amp;ModelviewProjection);
		location = glGetUniformLocation(renderContext->mProgramObjSu, "sliderVal");
		glUniform1f(location, sliderVal);
		location = glGetUniformLocation(renderContext->mProgramObjSu, "multiplier16bit");
		glUniform1f(location, multiplier16bit);

		// Identify the texture to use and bind it to texture unit 0
		AESDK_OpenGL_BindTextureToTarget(renderContext->mProgramObjSu, inputFrameTexture, std::string("videoTexture"));

		// render
		glBindVertexArray(renderContext->vao);
		RenderQuad(renderContext->quad);
		glBindVertexArray(0);

		glUseProgram(0);
		glDisable(GL_BLEND);
	}

RenderGL(), like SwizzleGL(), is from a series of functions defined at the top of the file.

So what does it all amounts to? Go back to the top, to the very first listing, and you’ll see a sliderVal amongst the uniforms. That’s what I mean by Parametric OpenGL. Technically, every OpenGL is Parametric, however, this gives us a slider, or a point, or a value (depending on the type of the PARAM, I recommend reading the SDK manual). Ipso facto, parametric here means “something to mess around with”.

There’s a considerable amount of money in After Effects plugin development, and this is, perhaps, the very first blog post about this SDK. I’m not sure. When I was a kid, I spent my own fair share of the money on After Effects plugins. For a brief period in my life, they were my awe, my life and my livelihood. Make an After Effects Plugin, Make a Kid Happy!

I hope you enjoyed this very short post. I have a brouhaha with /r/EnglishLearning. They believed that my prose is, as my friend Tanami puts it, brash. If you believe so, please tell me so I can do something about it. Thank you.

 

I’ve found another book to mack on while you mack with your paramour. It’s called Fractal Worlds: Grown, Built and Imagined. I’m going to write a fractal generating software based on it, if I don’t die tomorrow. Chubak Out.

 

Further Reading

  • After Effects SDK Manual

Credits

Since giving credits is more important than anything else, I will do that first. This article is partially based on  two papers: A Survey of Procedural Noise Functions by A. Lagae et al and Efficient Computational Noise in GLSL by McEwan et al. Codes are sto-, ahem, taken from this gist and the table at the beginning of the post is taken from this article. Quotes from the papers are in blue, and are attributed using footnotes. The cover photo is taken directly from A Survey et al.

What is Perlin Noise?

Perlin Noise was created by Ken Perlin (who won the Academy Award for his achievement) in 1983. See, back then, photorealism was something to be desired by everyone, but people always came up short. Ken Perlin came up with his noise algorithm to battle this wretched “computer-looking” appearance of 3D models. Noise is the random number generator of computer graphics. It is a random and unstructured pattern, and is useful wherever there is a need for a source of extensive detail that is nevertheless lacking in evident structure.1 Perlin noise is a multi-dimensional algorithm used in procedural generation, textures, terrain generation, map generation, surface generation, vertex generation, and so on and so forth. The following table 2 shows the extent of Perlin noise:

Dimension Raw Noise (Grayscale) Uses
1

Things such as vectors which look hand-drawn

2

Things such as procedural textures, comme feu

3

Minecraft terrain uses Perlin Noise

Perlin gave the following definition for noise: “..noise is an approximation to white noise band-limited to a single octave.3 The formal definition of the noise is:

f_N(y_1, y_2, \cdots, y_n; x_1, x_2, \cdots, x_n) = P(N(x_1), N(x_2), \cdots, N(x_n))

Where N(x) is the noise function. Perlin noise is a procedural noise.The adjective procedural is used in computer science to distinguish entities that are described by program code rather than by data structures. 4 Meaning, something that is generated, not something that is hard-coded. What’s good about using a procedural noise, rather than say, creating the noise ourselves? Well, a procedural noise is compact, meaning it occupies less space. It’s continuous, meaning it’s non-periodic. It’s parametrized, it’s randomly accessible, and many other goodies that make the job of an artist easier… And isn’t that our end goal? To serve the artist? Yesterday, I made an After Effects plugin, which you can see it below this post. I didn’t have the artist in mind, I had my own ego and id in mind, so nobody downloaded it. I learned my lesson: Serve the artist, not yourself. Anyways…

 

Before Perlin, came Lattice gradient noises. They are geenrated by interpolating between random values, and Perlin noise uses a Cubic Lattice on each vertex and then doing a splined interpolation. The pseudo-random gradient is given by hashing the lattice point and using the result to choose a gradient. 5 These hashes are turned into 12 vectors, and they are interpolated from center to edge using a quintic polynomial. A little difficult to imagine right? No worries. We’ll provide pictures6 and pseudo-codes 7.

“… they are interpolated from center to edge using a quintic polynomial. “

And a pseudo-code for Classic Perlin, sans the hash function:

 

// Function to linearly interpolate between a0 and a1
// Weight w should be in the range [0.0, 1.0]
float lerp(float a0, float a1, float w) {
    return (1.0 - w)*a0 + w*a1;

    // as an alternative, this slightly faster equivalent formula can be used:
    // return a0 + w*(a1 - a0);
}

// Computes the dot product of the distance and gradient vectors.
float dotGridGradient(int ix, int iy, float x, float y) {

    // Precomputed (or otherwise) gradient vectors at each grid node
    extern float Gradient[IYMAX][IXMAX][2];

    // Compute the distance vector
    float dx = x - (float)ix;
    float dy = y - (float)iy;

    // Compute the dot-product
    return (dx*Gradient[iy][ix][0] + dy*Gradient[iy][ix][1]);
}

// Compute Perlin noise at coordinates x, y
float perlin(float x, float y) {

    // Determine grid cell coordinates
    int x0 = int(x);
    int x1 = x0 + 1;
    int y0 = int(y);
    int y1 = y0 + 1;

    // Determine interpolation weights
    // Could also use higher order polynomial/s-curve here
    float sx = x - (float)x0;
    float sy = y - (float)y0;

    // Interpolate between grid point gradients
    float n0, n1, ix0, ix1, value;
    n0 = dotGridGradient(x0, y0, x, y);
    n1 = dotGridGradient(x1, y0, x, y);
    ix0 = lerp(n0, n1, sx);
    n0 = dotGridGradient(x0, y1, x, y);
    n1 = dotGridGradient(x1, y1, x, y);
    ix1 = lerp(n0, n1, sx);
    value = lerp(ix0, ix1, sy);

    return value;
}

It is worth knowing that “… All noise functions except Perlin noise and sparse convolution noise are approximately band-pass. Perlin noise is only weakly band-pass, which might lead to problems with aliasing and detail loss.” 8 Plus, Perlin noise does not have Gaussian Amplitude DistributionMeaning, the speckles of noise don’t scatter based on Gaussian function, which is outside the scope of this article. There are many things that Perlin fairs well at, but there are things which it performs weakly at. In the following table plate 9 you can witness for yourself

 


There are many things that Perlin fairs well at, but there are things which it performs weakly at.

Perlin Noise in Practice: GLSL Implementations

Well, I said I will talk about Perlin noise in GLSL, and now I’m going to do it.

You can use Perlin noise as a wave, as a diffuse color, as a diffuse material, as a flickering light, and as some speckles on your texture. I personally used it in this instance as a color fizzler.

 

As I’m writing this, I’m contemplating an After Effects plugin which adds Perlin Noise functionality.

The simplest Perlin Noise can be construed10 as:

float rand(vec2 c){
	return fract(sin(dot(c.xy ,vec2(12.9898,78.233))) * 43758.5453);
}

float noise(vec2 p, float freq ){
	float unit = screenWidth/freq;
	vec2 ij = floor(p/unit);
	vec2 xy = mod(p,unit)/unit;
	//xy = 3.*xy*xy-2.*xy*xy*xy;
	xy = .5*(1.-cos(PI*xy));
	float a = rand((ij+vec2(0.,0.)));
	float b = rand((ij+vec2(1.,0.)));
	float c = rand((ij+vec2(0.,1.)));
	float d = rand((ij+vec2(1.,1.)));
	float x1 = mix(a, b, xy.x);
	float x2 = mix(c, d, xy.x);
	return mix(x1, x2, xy.y);
}

float pNoise(vec2 p, int res){
	float persistance = .5;
	float n = 0.;
	float normK = 0.;
	float f = 4.;
	float amp = 1.;
	int iCount = 0;
	for (int i = 0; i<50; i++){
		n+=amp*noise(p, f);
		f*=2.;
		normK+=amp;
		amp*=persistance;
		if (iCount == res) break;
		iCount++;
	}
	float nf = n/normK;
	return nf*nf*nf*nf;
}
#define M_PI 3.14159265358979323846

float rand(vec2 co){return fract(sin(dot(co.xy ,vec2(12.9898,78.233))) * 43758.5453);}
float rand (vec2 co, float l) {return rand(vec2(rand(co), l));}
float rand (vec2 co, float l, float t) {return rand(vec2(rand(co, l), t));}

float perlin(vec2 p, float dim, float time) {
	vec2 pos = floor(p * dim);
	vec2 posx = pos + vec2(1.0, 0.0);
	vec2 posy = pos + vec2(0.0, 1.0);
	vec2 posxy = pos + vec2(1.0);
	
	float c = rand(pos, dim, time);
	float cx = rand(posx, dim, time);
	float cy = rand(posy, dim, time);
	float cxy = rand(posxy, dim, time);
	
	vec2 d = fract(p * dim);
	d = -0.5 * cos(d * M_PI) + 0.5;
	
	float ccx = mix(c, cx, d.x);
	float cycxy = mix(cy, cxy, d.x);
	float center = mix(ccx, cycxy, d.y);
	
	return center * 2.0 - 1.0;
}

// p must be normalized!
float perlin(vec2 p, float dim) {
	
	/*vec2 pos = floor(p * dim);
	vec2 posx = pos + vec2(1.0, 0.0);
	vec2 posy = pos + vec2(0.0, 1.0);
	vec2 posxy = pos + vec2(1.0);
	
	// For exclusively black/white noise
	/*float c = step(rand(pos, dim), 0.5);
	float cx = step(rand(posx, dim), 0.5);
	float cy = step(rand(posy, dim), 0.5);
	float cxy = step(rand(posxy, dim), 0.5);*/
	
	/*float c = rand(pos, dim);
	float cx = rand(posx, dim);
	float cy = rand(posy, dim);
	float cxy = rand(posxy, dim);
	
	vec2 d = fract(p * dim);
	d = -0.5 * cos(d * M_PI) + 0.5;
	
	float ccx = mix(c, cx, d.x);
	float cycxy = mix(cy, cxy, d.x);
	float center = mix(ccx, cycxy, d.y);
	
	return center * 2.0 - 1.0;*/
	return perlin(p, dim, 0.0);
}

That is, however, a version of Perlin which was reconstructed in 2002. Visit the Gist to see how to apply Classic Perlin Noise.

Well, that is it for today! Short post, I know, and it was lacking in original content, but I am running out of ideas as of now, because I haven’t read Real-Time Rendering yet. It’s a book that’s chuckfull of concepts and ideas for learning, and teaching. I love this damn book!

 

Another book I’m currently reading is Fundamentals of Computer Graphics. I got a little stuck reading about Implicit Curves, but I’m going to ask my cousin, who holds a PhD in mathematics, to help me with it. I hope he does, because he has the tendency to forgoe my requests of intellectual assistance. Dunno, really. Aren’t cousins supposed to help each other? I will write him a visualization app in a minute if he asks… Au fait.

 

Je vous s’exprime quelque chose. S’il vous plait, se introdure mon blog a votre amis, parce que je veux plus visitors, et je veux se attendre les plus gens. Si ils ne voudraient visiter mon blog, leur dit “C’est genial comme un reine qu’est virgo entacta et veut votre putain!”

Dammit, I suck at French, at least I didn’t use Google Translate. Chubak Out.

Reference

  • A Survey of Procedural Noise Functions, Computer Graphics Forum, Volume 29 (2010), Number 8, pp 2379-2600
  • Efficient Computational Noise, Journal of Graphics Tools Volume 16, No. 2: 85-94

Good Evening everyone! I hope you’re all doing A-Ok. Yesterday I received a PM:

Hey. Do you know anything about game rendering? I’ve seen your blog you must know something about game rendering.

I think this person is referring to Graphic Pipeline, which different APIs dub it as “rendering application”. “The main function of the pipeline is to generate, or render, a two-dimensional image, given a virtual camera,three-dimensional objects, light sources, and more”. 1 Now, SIMD architecture with its many cores makes it possible for parallel calculations, and it is de facto the main reason we have GPUs today. Graphics Pipeline is made up of shaders. Shaders are parallel programs that live in the GPU. OpenGL has five shader stages, and about seven application stages in its Pipeline 2. But what are t he stages of this pipleline? What does pipleline even mean? Read this article and find out.

The History of the Word “Pipeline”

The word pipeline became popular when gas lamps were invented. “The first Canadian transmission pipeline was built in 1853. A 25 kilometre cast-iron pipe moving natural gas to Trois Rivières, QC. It was the longest pipeline in the world at the time.” 3

 

Source in Footnotes

What does pipeline mean? Simple. Imagine you wish to transmit something from one stage, to another stage, then another stage, until the output is unrecognizable from the input. For example, in OpenGL, this can be the input:

 

float vertices[] = {
    // first triangle
     0.5f,  0.5f, 0.0f,  // top right
     0.5f, -0.5f, 0.0f,  // bottom right
    -0.5f,  0.5f, 0.0f,  // top left 
    // second triangle
     0.5f, -0.5f, 0.0f,  // bottom right
    -0.5f, -0.5f, 0.0f,  // bottom left
    -0.5f,  0.5f, 0.0f   // top left
}; 

And this can be the output:

 

Photo and code courtesy of Joey De Vries at Learnopengl.com

Some stages of the OpenGL are mandatory, some stages are optional. You can see this pipeline in the following picture.

 

By Yours Truly

Based on two books, The OpenGL Red Book 9th Edition and Real-Time Rendering 4th Edition I wish to explain to you this so-called Pipeline. Turn off your phones, and pay attention!

1. Vertex Data

We saw an example of vertices attributes in the first code in this very post. In OpenGL, vertices are recorded in Homogeneous Coordinate System, which is a complicated 4-point coordinate system. I’ll talk about this coordinate system in another post, very soon. Keep in mind that vertices can be hardcoded in the shader, but most of the times, it is passed to the vertex shader as a Vertex Buffer Object.

2. Vertex Shader

Piece du resistence, the first shader we deal with, and perhaps, the most important shader. In vertex shader, we access the data given to us through Vertex Buffer Objects in the main program using attributes, and uniforms passed to the shader are indexed using binding indices. What are uniforms, you may ask? Well, each vertex has its own shader, and each shader has quite different values, however, a uniform is constant all throughout the shaders. Each shader has its own uniforms, and you can’t pass a uniform from a shader to another. However, other variables can be passed from one shader to another. Variables qualified by the keyword in are kept inside the shader, and variables qualified by out are passed to the next shader.

In Vertex Shader, three matrices are multiplied by the outgoing gl_Position. One is the Model Matrix. the other is the View Matrix, and at the end, we have the Projection Matrix. You can clearly see them in the following picture:

Courtesy of Joey De Vries at Learnopengl.com

The model matrix deals with the object. The view matrix deals with the world around it. The projection matrix deals with the camera. One day, I’ll talk in detail about various camera projections.

 

3. Tessellation Control Shader

“Imagine you have a bouncing ball object.If you represent it with a single set of triangles, you can run into problems with quality or performance. Your ball may look good from 5 meters away, but up close the individual triangles, especially along the silhouette, become visible. If you make the ball with more triangles to improve quality, you may waste considerable processing time and memory when the ball is far away and covers only a few pixels on the screen.With tessellation, a curved surface can be generated with an appropriate number of triangles.” 4

Tessellation is basically the process of breaking higher order primitives, such as cubes, into many small sub-primitives in order to create a more detailed geometry. In OpenGL, instead of cubes, we use patches. TCS is the first of two Tessellation Shaders, the other one being…

 

4. Tessellation Evaluation Shader

Ones the Tessellation Engine does its job, it produces a number of geometry vertices which are responsible for adding detail to the given patch. Tessellation Evaluation Shader invokes them. This shader runs for each generated vertex, and adds a lot of overhead. That’s why programmers shouldn’t be thrifty with TCS.

 

5. Geometry Shader

Geometry Shader looks like Vertex Shader, and uses gl_Position, but it’s responsible for creating multiple instances of a primitive through EmitVertex() and EndPrimitive(). “Each time we call EmitVertex the vector currently set to gl_Position is added to the primitive. Whenever EndPrimitive is called, all emitted vertices for this primitive are combined into the specified output render primitive. By repeatedly calling EndPrimitive after one or more EmitVertex calls multiple primitives can be generated. This particular case emits two vertices that were translated by a small offset from the original vertex position and then calls EndPrimitive, combining these two vertices into a single line strip of 2 vertices.” 5

 

6. Primitive Assembly

Primitive Assembly is a short stage. It is basically grouping of of primitives into lines and triangles. You may ask, well, what about points? They can’t be assembled? The question to your answer is, that yes, it does happen for points, but it’s redundant.

7. Clipping

After gl_Position is converted to Cartesian coordinates, and gets normalized, meaning it is carried out between -1 and 1, it is clipped for screen. Meaning, the primitives which are not in the viewport propogated by the projection matrix are discarded.  This stage is very important for overall performance reasons.

8. Culling

This is another step taken for performance. Each 3D primitive has its face to the camera, and its back is out of the view of the camera. So it’s entirely useless to render the back of the primitive also. Culling is the process of discarding the back in favour of the face of each primitive.

9. Rasterization

Rasterization is the process of converting 3D objects in computer’s memory to 2D objects to be displayed in the monitor, or rendered to a file. In other words, rasterization is the process of determining which fragments might be covered by a triangle or a line. Rasterization has two stages: trangle setup, and triangle traversal. In the former, “…. edge equations, and other data for the triangle are computed. These data may be used for triangle traversal, as well as for interpolation of the various shading data produced by the geometry stage. Fixed function hardware is used for this task. 6. In the latter, each pixel, or a sample, is covered by a triangle, and a fragment is generated for  each pixel, or sample. If the number of samples are low, aliasing happens. Triangles are interpolated, and pixels are sent to the next stage, which is the most important stage: Fragment shader.

10. Fragment Shader

Creme de la creme, it is the last programmable stage in the pipeline, and does operations such as coloring, texturing, shadowing, antialisiang, raytracing (if possible), etc.Perhaps the most important thing done in this stage is texturing.

 

Texturing, picture courtesy of Real-TIme Rendering 4th Edition

If you are interested in learning more about fragment shaders and what you can do in them, you can read The Book of Shaders, a free internet book that, although outdated, teaches a lot of tricks in the trade.

11. Z-Buffer, Stencil Buffer, and Color Buffer

Z-Buffer is the depth buffer of the program, which is propagated by an FBO. An FBO can also be a stencil buffer, which as the name suggest, creates a black mask aroudn the screen. An FBO can also be a color buffer, which is simply the background color of the program.

12. Compute Shaders, and Beyond!

Disney don’t sue please.

Compute Shaders are a part of GPGPU, or General-Purpose GPU. GPUs, with their SIMD architecture, are the bee’s knees for high-performance computing. So in modern OpenGL, Kronos has provided us with a way to harness the power of GPU using shaders. Compute Shaders use GLSL, and are good for thigns like image processing. However, Compute Shaders are not a part of the Pipeline, and in fact, a Shader Program sannot contain both type of shaders. However, a regular program can contain both.

There are many things to do after Fragment Shader. Such as collision detection, animation, physics, music, etc. But they are not a part of OpenGL.

 

So that’s it! I hope you’ve enjoyed this post. Since there was a brouhaha last night about me not citing the sources, this time I made sure to cite everyone. People were right, I should have cited. Thanks people!

Please share my post with your interested friends. It goes a long way. Thank you, Chubak Out.