I’m extremely agitated today. I dunno why. Maybe because there was some convulsion in the peaceful tidings of the house I live in, or the fact that I’m kinda hungry at the moment. Anyways, I don’t have time for chitchat. Let’s get to the studying.

The following is taken from Foundations of Machine Learning by Rostamyar, et al.

Support Vector Machines are the most theoretically well motivated and practically most effective classification algorithms in modern machine learning.

Consider an input space $\mathcal{X}$ that is a subset of $\mathbb{R}^N$ with $N \geq 1$, and the output or target space $\mathcal{Y}=\{-1, +1\}$, and let $f : \mathcal{X} \rightarrow \mathcal{Y} $ be the target function. Given a hypothesis set $\mathcal{H}$ of functions mapping $\mathcal{X}$ to $\mathcal{Y}$, the binary classification task is formulated as follows:

The learner receives a training sample $S$ of size $m$ drawn independently and identically from $\mathcal{X}$ to some unknown distribution $\mathcal{D}$, $S = ((x_1, y_1), \ldots, (x_m, y_m)) \in (\mathcal{X}\times\mathcal{Y})^m$, with $y_i = f(x_i) $ for all $i \in [m]$. The problem consists of determining a hypothesis $ h \in \mathcal{H}$, a binary classifier, with small generalization error: The probability that hypothesis set is not the target function is our error rate.

\[ R_{\mathcal{D}} = \underset{x\sim\mathcal{D}}{\mathbb{P}} [h(x) \neq f(x)]. \]

Different hypothesis sets $\mathcal{H}$ can be selected for this task. Hypothesis sets with smaller complexity provide better learning guarantees, everything else being equal. A natural hypothesis set with relatively small complexity is that of a linear classifier, or hyperplanes, which can be defined as follows:

\[ \mathcal{H}= \{x \rightarrow sign(w.x+b) : w \in \mathbb{R}^N, b \in r\} \]

The learning problem is then referred to as a linear classification problem. The general equation of a hyperplane in $\mathbb{R}^N$is $w.x+b=0$ where $w\in\mathbb{R}^N$ is a non-zero vector normal to the hyperplane $b\in\mathbb{R}$ a scalar. A hypothesisol. of the form $x\rightarrow sign(w.x+b)$ thus labels positively all points falling on one side of the hyperplane $w.x+b=0$ and negatively all others.

From now until we say so, we’ll assume that the training sample $S$ can be linearly separated, that is, we assume the existence of a hyperplane that perfectly separates the training samples into two populations of positively and negatively labeled points, as illustrated by the left panel of figure below. This is equivalent to the existence of $ (\boldsymbol{w}, b) \in (\mathbb{R}^N – \boldsymbol{\{0\}}) \times \mathbb{R}$such that:

\[ \forall i \in [m], \quad y_i(\boldsymbol{w}.x_i + b) \geq 0 \]

 

But, as you can see above, there are then infinitely many such separating hyperplane. Which hyperplane should a learning algorithm select? The definition of SVM solution is based on the notion of geometric margin.

Let’s define what we just came up with: The geometric margin $\rho_h(x)$ pf a linear classifier $h:\rightarrow \boldsymbol{w.x} + b $ at a point $x$ is its Euclidean distance to the hyperplane $\boldsymbol{w.x}+b=0$:

\[ \rho_h(x) = \frac{w.x+b}{||w||_2} \]

The geometric margin of $\rho_h$ of a linear classifier h for a sample $S = (x_1, …, x_m) $ is the minimum geometric margin over the points in the sample, $\rho_h = min_{i\in[m]} \rho_h(x_i)$, that is the distance of hyperplane defining h to the closest sample points.

So what is the solution? It is that, the separating hyperplane with the maximum geometric margin is thus known as maximum-margin hyperplane. The right panel of the figure above illustrates the maximum-margin hyperplane returned by SVM algorithm is the separable case. We will present later in this chapter a theory that provides a strong justification for the solution. We can observe already, however, that the SVM solution can also be viewed as the safest choice in the following sense: a test point is classified correctly by separating hyperplanes with geometric margin $\rho$ even when it falls within a distance $\rho$ of the training samples sharing the same label: for the SVM solution, $\rho$ is the maximum geometric margin and thus the safest value.

We now derive the equations nd optimization problem that define the SVM solution. By definition of the geometric margin, the maximum margin of $\rho$ of a separating hyperplane is given by:

\[ \rho = \underset{w,b : y_i(w.x_i+b) \geq 0}{max}\underset{i\in[m]}{min}\frac{|w.x_i=b}{||w||} = \underset{w,b}{max}{min}\frac{y_(w.x_i+b)}{||w||} \]

The second quality follows from the fact that, since sample is linearly separable, for the maximizing pair $(w, b), y_i(w.x_i+b)$ must be non-negative for al $i\in[m]$. Now, observe that the last expression is invariant to multiplication of $(w, b)$ by a positive scalar. Thus, we can restrict ourselves to pairs $(\boldsymbol{w},b)$ scaled such that $min_{i\in[m]}(\boldsymbol{w}.x_i+b) = 1$:

\[ \rho = \underset{min_{i\in[m]}y_i(w.x_i+b)=1}{max}\frac{1}{||w||} = \underset{\forall i \in[m],y_i(w.x_i+b) \geq }{max}\frac{1}{||w||} \]

Figure below illustrates the solution $(w, b)$ of the maximization we just formalized. In addition to the maximum-margin hyperplane, it also shows the marginal hyperplanes, which are the hyperplanes parallel to the separating hyperplane and passing through the closest points on the negative or positive sides.

 

 

Since maximizing $1/||w||$ is equivalent to minimizing $\frac{1}{2}||w||^2$, in view of the equation above, the pair $(\boldsymbol{w}, b)$ returned by the SVM in the separable case is the solution of the following convex optimization problem:

\[ \underset{w, b}{min}\frac{1}{2}||w||^2 \]\[ \text{subject to}: y_i(\boldsymbol{w}.x_i+b) \geq 1, \forall i \in[m] \]

Since the objective function is quadratic and the constraints are affine (meaning they are greater or equal to) the optimization problem above is in fact a specific instance of quadratic programming (QP), a family of problems extensively studied in optimization. A variety of commercial and open-source solvers are available for solving convex QP problems. Additionally, motivated by the empirical success of SVMs along with its rich theoretical underpinnings, specialized methods have been developed to more efficiently solve this particular convex QP problem, notably the block coordinate descent algorithms with blocks of just two coordinates.

So what are support vectors? See the formula above, we note that constraints tare affine and thus qualified. The objective function as well as the affine constrains are convex and differentiable.

We introduce Lagrange variables $\alpha_i \geq 0, i\in[m]$, associated to the m constrains and denoted by $\boldsymbol{\alpha}$ the vector $(\alpha_1, \ldots, \alpha_m)^T$. The Lagrangian can then be defiend for all $\boldsymbol{w}\in\mathbb{R}^N,b\in\mathbb{R}$, and $\boldsymbol{\alpha}\in\mathbb{R}_+^m$ by:

\[ \mathcal{L}(\boldsymbol{w},b,\boldsymbol{\alpha} = \frac{1}{2}||w||^2 – \sum_{i = 1}^{m}\alpha_i[y_i(w.x_i+b) -1] \]

Support vectors fully define the maximum-margin hyperplane or SVM solution, which justifies the name of the algorithm. By definition, vectors not lying on the marginal hyperplanes do not affect the definiton of these hyperplanse – in their absence, the solution the solution to the SVM problem is unique, the support vectors are not. In dimensiosn $N, N+1$ points are sufficient to define a hyperplane. Thus when more then $N+1$ points lie on them marginal hyperplane, different choices are possible for $N+1$ support vectors.

But the points in the space are not always separable. In most practical settings, the training data is not linearly separable, which implies that for any hyperplane $\boldsymbol{w.x}+b=0$, there exists $x_i \in S$ such that:

\[ y_i[\boldsymbol{w.x_i}+b] \ngeq 1 \]

Thus, the constrains imposed on the linearly separable case cannot be hold simultaneously. However, a relaxed version of these constraints can indeed hold, that is, for each $i\in[m]$, there exists $\xi_i \geq 0$ such that:

\[ y_i[\boldsymbol{w.x_i}+b] \ngeq 1-\xi_i \]

The variables $\xi_i$ are known as slack variables and are commonly used in optimization to define relaxed versions of constraints. Here, a slack variable $\xi_i$ measures the distance by which vector $x_i$ violates the desires inequality, $y_i(\boldsymol{w.x_i} + b) \geq 1 . This figure illustrates this situation:

 

For hyperplane $y_i(w.x_i+b) = 1 $, a vector $x_i$ with $x_i > 0$ can be viewed as an outlier. Each $x_i$ must be positioned on the correct side the appropriate marginal hyperplane. Here’s the formula we use to optimize the non-separable cases:

\[ \underset{w, b, \xi}{min} \frac{1}{2}||w||^2 + C\sum_{i=1}^{m}\xi_i^p \]\[ \text{subject to} \quad y_i(w.x_i+b) \geq 1-\xi_i \wedge \xi_i \geq 0, i\in[m] \]

Okay! Alright! I think I understand it now. That’s enough classification for today. I’m going to study something FUN next. Altough I’m a bit drowsy… No matter! I have some energy drinks at home. Plus I have some methamphetamine which I have acquired to boost my eenrgy… Nah, kidding. I’m a cocaine man!

First, let’s take care of the pleasantries. I’m ok. Are you ok? That’s fine. How’s the missus? Oh, she is thinking about suicide? No worries, just chock a handful of Zoloft at her! How’s the kid? You don’t have a kid? Why? Because you take Risperidone? No comments!

I haven’t talked about DIP, for like, ever. I talked about DSP in one of my opi, but that was mostly about DSP rather than DIP. So it’s time dive into Gonzalez and Woods 4th edition, chapter 6, and learn about Pseudocolor Image Processing – Something I always wanted to learn about. Let’s hope this is going to be a short post because I wanna watch muh Veronica Mars1.

Pseudocolor image processing consists of assigning colors to gray values based on a specified criterion. The term pseudo or false color is used to differentiate the process of assigning colors to achromatic images from the process associated with true color images, a topic which I’m going to learn, and “teach” in the near future.

The principal use of pseudocolor is for human visualization and interpretation of grayscale events in a single image, or a sequence of images. But why do we want to convert a grayscale image to color? Are we all affected by the disease aptly called Ted Turneritus? No, we do this because humans can discern thousands of color shades, as compared to less than two dozen shades of gray.

The techniques of intensity slicing, and color coding are the simplest and earliest examples of pseudocolor processing of digital images. If an image is interpreted as a 3D function, like this:

 

the method can be viewed as one of placing planes parallel to the coordinate plane of the image; each plane then “slices” the function in the area of intersection. In the image below you see an example of using a plate at f(x) = l_i to slice the image intensity function into two levels.

 

If a different color is assigned to each side of the plane, any pixel whose intensity level is above the plane will be coded with one color, and any pixel below the plane will be coded with another. Levels that lie on the plane itself may be arbitrarily assigned one of the two colors, or they could be given a third color to highlight all the pixels at that level. The result is between two to three color images whose relative appearance can be controlled by moving the slice plane up and down in the intensity axis.

A simple but practical use of intensity slicing is shown below, with the grayscale image on the left and the pseudocolor image on the right:

This is an image of the Picker Thyroid Phantom, a radiation test pattern. Regions that appear of constant intensity in the grayscale image are actually quite variable, as shown by the various colors in the intensity image. For instance, the left lobe is a dull gray in the grayscale image, but has various colors of red, yellow and green in the intensity image. Below you see the representation of one of the grayscale intervals and its slicing:

 

In the preceding simple example the grayscale was divided into intervals, and a different color was assigned to each, with no regard for the meaning of the gray levels in the image. Interest in that case was simply to view the different gray levels constituting the image. Intensity slicing assumes a much more meaningful and useful role when subdivision of the grayscale is based on the physical characteristics of the image. For instance, in this image:

 

We see an X-ray image of a weld, containing several cracks and porosities. When there is porosity or crack in the weld, the full strength of the X-rays going through the object saturates the image sensor on the other side of the object. Thus, intensity values of 255, white, in an 8-bit image coming from such a system automatically imply a problem with the weld. If the human visual analysis is used to inspect welds, a simple color coding assigns one color to 255, and one color to the rest. Thus helping the inspectors to assess the problem better.

One of the things that interests geologists is measuring rainfall levels, especially in tropical regions of Earth. Tropical Rainfall Measuring Missions, TRMM satellite utilizes among others, three sensors especially designed to detect rain: a precipitation radar, a microwave imager, and a visible and infrared scanner. The results from the various rain sensors are processed, resulting in estimates of average rainfall over a given time period in the area monitored by the sensors. From these estimates, it is not difficult to generate grayscale images whose intensity values correspond directly to the rainfall, with each pixel representing a physical land area whose size depends on the resolution of the sensors. Such an intensity image is shown below:

Visual examination of such picture is prone to error. However, suppose that we code intensity levels from 0 to 255 using the colors shown here:

 

Which results in:

 

Thus making it easier for the scientists to meter up the rainfall levels.

There are other methods that are more general, and thus are capable of achieving a wider range of pseudocolor enhancement results than the simple slicing techniques discussed is the preceding section. Below you see an approach that is particularily attractive. Basically, the idea underlying this approach is to perform three independent transformations on the intensity of input levels. The three results are then fed separately into the red green, and blue channels of a color monitor. This method produces a composite image whose color content is modulated by the nature of the transformation function.

 

Now, look at this image:

It shows two monochrome images of a luggage obtained from an airport X-ray scanning system. The image on the left contains ordinary articles. The image on the right contains the same articles, as well as a block of simulated plastic explosives. The purpose of this example is to illustrate the use of intensity to color transformations to facilitate detection of explosives.

Look at this image:

 

It shows the transformation functions used. These sinusoidal functions contain regions of relatively constant value around the peaks as well as regions that change rapidly in the valleys. Changing the phase and frequency of each sinusoid can emphasize ranges in the grayscale. For instance, if all three transformations have the same phase and frequency, the output a grayscale image. A small change in the phase between the three transformations produces little change in pixels whose intensity correspond to peaks in the sinusoids, especially if the sinusoids have broad profiles (low frequencies). Pixels with intensity values in the steep section of the sinusoids are assigned a much stronger color content as a result of significant differences between amplitudes of the three sinusoids caused by the phase displacement between them.

The pseudocolored image of the luggage was obtained using the transformations we just saw in the above on the left, which shows the gray-level bands corresponding to the explosive, garment back, and background, respectively. Note that the explosive and background have quite different intensity levels, but they were both coded with approximately the same color as a result of the periodicity of the sine waves. The second pseudocolored image of the luggage was obtained with the transformation on the right. In this case, the explosives and garment bag intensity bands were mapped by similar transformations, and thus received essentially the same color assignment. Note that this mapping allows and observer to “see” through explosives.

It is often in our interest to combine several grayscale images, and then transform them to pseudocolor. Frequent use of this approach is in multispectral image processing, where different sensors produce individual grayscale images, each in a different spectral band.

Ok, that shall be it for tonight! I will be back tomorrow with a post about machine learning… Deep learning, perhaps? I’m not sure. There’s a lot to learn. There’s a lot to process.

My brother broke my hookah, and beat up my mom. I think, just like me, he’s got bipolarity. But well, he’s a proud 5th semester psychology student and there’s no way in hell we could get him to take medication for his mental handicap. We get bipolarity from our father’s side, and he gets it from his mother’s side. Both my uncles are bipolar, my youngest brother has Lithium deficiency with a wee bit of bipolarity, but my middle broter is… Kinda fucked up. The symptoms are showing. I don’t know why he’s so proud. Our second cousin once removed – my father’s cousin, has bipolarity. So do both my uncles, and they’re being medicated for it. Not many women get bipolarity. It’s just not in the evolution of our species. We’re bipolar because nature wanted us to hibernate during the winter, but it fucked up, so based on seasonal changes, bipolar people like me and many people in my family get depressed, it gets harder and harder for them to step out of their dwelling, they get depressed, and after depression passes, they become manic. It’s HELL.

Anyways, if you’re bipolar, I hope you’re getting the medication and the help you deserve. Many people in developing countries suffer for the cheapest of psych meds, even Lithium. This is bad.

Well, to the books!

When I started this blog, it was graphics, but now I mostly post ML and DSP stuff and I’m really sad about it. I haven’t touched graphics programming in 2 months, and that’s something which I would like to classify as bad (who says that?). But wait! It can still be about graphics programming, I just have to do graphics stuff that has ML in it, and not the other way around, because I want ML to be my de jur my specialty.

So what do you wanna learn today, Chubak? “Something graphics-related.” But what? “Um… I wanna learn about something graphics-related that’s also about probability, because it’ll help me ease into my life as an ML student.” So Chubak, learn about Monte Carlo Integration! “I can’t play blackjack! I mean, I once coded a blackjack game, but I’ve forgotten all about the rules!” Oh Chubakabra, you’re so unfunny I could kill you! Seriously!

So I have this book, Computer Graphics: Principals and Practice. I love it. It’s written by a slew of well-to-do programmers who would eat me just by looking at me (if I were a girl, that would mean something different). I will quote and paraphrase this book to learn. Let’s learn together.

 

We have all learned about numerical methods in our high school math class. Methods such as integration, interpolation, sequences, and so on. There are two types of numerical methods: Deterministic, and randomized.

A typical deterministic method for integrating a function for integrating a function f over an interval [a, b] is to take n + 1 equally spaced points in the interval, t_0 = a, t_1 = a + \frac{b - a }{n}, \ldots, t_n - b, evaluate f at midpoint \frac{t_i + t_{i + 1}}{2} of each of the intervals these points define, sum up the results, and multiply by \frac{b -a}{b}, the width of each interval. For sufficiently continuous functions f, as n increases this will converge to correct value.

A probabilistic or Monte Carlo method for computing, or more precisely, for estimating the integral of f over the interval [a, b] is this: Let X_1, \ldots, X_n be randomly chosen points in the interval [a, b]. Then Y = (b - a) \frac{1}{n}\sum_{i = 1}^{n}f(X_i) is a random value whose average is the integral \int_{[a,b]}f. To implement this we use a random number generator to generate n points in the interval [a, b], evaluate f at each, average the results, and multiply by b-a. This gives an estimate of the integral, as shown in the figure below, \int_{-1}^{1}\sqrt{1 - x^2} dx with 20 samples approximates the correct result, which is \frac{\pi}{2}.

 

Of course, each time we compute such an estimate we’ll get a different value (duh!). The variance in these values depends on the shapes of the function f. If we generate the random points x_i nonuniformly, we must slightly alter the formula. But in using a nonuniform distribution of points we gain an enormous advantage: By making our non nonuniform distribution favor points x_i where f(x) is large, we can drastically reduce the variance of our estimate. This nonuniform sampling approach is called importance sampling, as you see before you.

 

Because major shift in rendering methods over the past several decades was from deterministic to randomized, we’ll be discussing randomized approaches to solving rendering equations. To do so means we’re using random variables, expectation, and variance. We deal with discrete values, because computers are inherently discrete. Continuous values deal with probability density function but we’re not gonna discuss that. We’re going to discuss probability mass function. PMF has two properties:

1- For every s \in S we have p(s) \geq 0.

2- \sum_{s \in S}p(s) = 1

The first property is called non-negativity. The second one is called normality. The intuition is that S represents a set of outcomes of some experiment, and p(s) is probability outcome of s, a member of S. An event is a subset of a probability space. The probability of an event is the sum of PMFs of elements of that event, insomuch as:

    \[ Pr\{E\} = \sum_{s \in S} p(s) \]

A random variable is a function, usually denoted by a capital letter, from a probability space to the real numbers:

    \[ X: S \rightarrow \boldsymbol{R}. \]

Keep in mind that the function X is not a variable, its a real-valued function. It’s not random either, X(s) is a single real number for any outcome s\inS.

A random variable is used to define events. For example, the set of outcome s for which X(s) = 1, that is, if ht and th are a set of strings representing heads or tails,

    \[ E = {s \in S : X(s) = 1} \]

and

    \[ = {ht, th} \]

is an event with probability \frac{1}{2}. We write this as Pr\{X=1\} = \frac{1}{2}, we use the predicate X=1 as a shorthand for the event defined by the predicate.

Let’s take a look at a piece of code for simulating the experiment represented by the formalities above:

 

headcount = 0
if (randb()): // first coin flip
    headcount++
if (randb()): // second coin flip
    headcount++
return headcount

Here we assume that ranb() is a Boolean function that returns true half the time. How is it related to our abstraction? Well, imagine the set S of all possible executions of the program, declaring two executions to be the same values returned by ranb are pairwise identical. This means that there are four possible program executions, in which the two ranb() invocations return TT, TF, FT, and FF. Our belief and experience is that these four executions are equally likely, that is, each occurs just about one quarter of the time.

Now, the analogy becomes clearer. The set of possible program executions, with associated probabilities, is a probability space. The variables in the program that depend on ranb invocations are random variables. I hope it’s all clear to understand now.

Let’s talk about the expected value, also called the mean. It’s basically the summation of the product of PMF and the random variable:

    \[ E[X] = \sum_{s\inS} p(s)X(s) \]

Imagine h is heads and t is tails. We saw ht and th. There’s also hh and tt. So the expected value of it would be:

    \[ E[X] = p(hh)X(hh) + p(ht)X(ht) + p(th)X(th) + p(tt)X(tt) \]

    \[ = \frac{1}{4}. 2 +\frac{1}{4} . 1 + \frac{1}{4} . 1 + \frac{1}{4} .0 \]

    \[ = 1 \text{QED} \]

You may wonder where we got the Xs from. Well, something I meant to tell you is that X is diminutive, meaning that we shall assign the value to it ourselves. Here, we have assigned 1 to h and 0 to t. The X(hh) is 2 because it has 2 hs.

Let’s talk about distribution. Probability distribution is a function that provides the probabilities of occurrences of different possible outcomes of an event (this is the Wikipedia wording because it’s 7AM and I’m toast and I can’t word it myself).

When we say random variable X has the distribution f we shall denote X \sim f.

The dispersion of values clustered around X is denoted by variance of it, and is defined as:

    \[ \boldsymbol{Var}[X] = E[(X - \bar{X})^2] \]

Where \bar{X} is the mean, or average of X. I won’t insult your intelligence by defining the average. Geez!

\sqrt{\boldsymbol{Var}} is called standard deviation. The random variables X and Y are called independent if:

    \[ Pr\{X = x \text{ and } Y = y\} = Pr\{X = x\}.Pr\{Y = y\} \]

The important properties of independent random variables are:

1)

    \[ E[XY] = E[X]E[Y] \]

2)

    \[ \boldsymbol{Var}[X + Y] = \boldsymbol{Var}[X] + \boldsymbol{Var}[Y] \]

When I started talking about probability, I talked about continuous probability vs. discrete probability. We covered discrete probability. Now to cover the difference between continuous probability and discrete probability:

1) Values are continuous. Meaning that the numbers are infinite.

2) Certain aspects of the analysis involves mathematical subtleties like measurablity.

3) Our probability space will be infinite. We must use probability density function instead of PMF.

PDF properties are:

1- For every s \in S we have p(s) \geq 0.

2- \int_{s\inS}p(s) = 1

But if the distribution of S is uniform, the PDF is defined as:

    \[ p(x) = \left\{ \begin{array}{lr}\frac{1}{b-a} & \text{ for } a \leq b \\ 0 & \text{ for } x < a \text{ or } x > b \end{array}\]

In continuous probability, E[X] is defined as:

    \[ E[X] := \int_{s\inS} p(s)X(s) \]

Now let’s compare the definitions of PMF and PDF:

    \[ \mathbb{PMF} \rightarrow p_y(t) = Pr\{Y = t\} \text{ for } t \in T \]

    \[ \mathbb{PDF} \rightarrow Pr\{a\leq X \leq b\} = \int_a^bp(r)dr \]

In continuous probability, random variables are better to be called random points. Because if S is a probability space, and Y : S \rightarrow T is a mapping to another space rather than \mathbb{R}, then we must call Y a random point rather than a random variable. The notion of probability density applies here as you can consider for any U \subset T we have:

    \[ Pr\{Y\in U} = \int_{u \in U} P_Y(u)du \]

Now let’s apply what we’ve learned to the sphere. A sphere has three coordinates, longitude, latitude, and colatitude. We only use longitude and colatitude in a \mathbb{R}^2, 2D Carthesian coordinates applied to random variable S, turning it into S^2. The particularization of it being:

    \[ Y : [0, 1] \times [0, 1] \rightarrow S^2 : (u, v) \rightarrow (\cos(2\pi u)\sin(\pi v), \cos(\pi v) \sin( 2\pi u) sin(\pi v)) \]

We’ll start with the uniform probability density p on [0, 1] \times [0, 1] or p(u, v) = 1. Scroll up to see the formula for uniform probability density. For notational convenience, well write (x, y, z) = Y(u, v).

We have the intuitive sense that if we pick points uniformly, randomly in the unit square and use f to convert them to points in the unit sphere, they’ll be clustered near the pole. This means that the induced probability density on T will not be uniform. The image below shows this.

We’ll now talk about ways to estimate the expected value of a continuous random variable, and using these to evaluate integrals. This is important because in rendering, we’re going to want to evaluate the reflectence integral:

    \[ L^{ref}(P, \omega_o) = \int_{\omega_i \in S_{+}^{2}}L(P, - \omega_i)f_s(P,\omega_i,\omega_0)\omega_i . \boldsymbol{n}d\omega_i, \]

for various value of P and \omega_0. \omega is the direction of incoming light. A code that generates a random number uniformly distributed on [0, 1] and takes square root produces value between 0 and 1. If we use the PDF on it, as it is a uniform value, the expected value is \frac{2}{3}. This also happens to be the average value of f(x) = \sqrt{x} on that interval. What does it all mean?

Let’s take a look at Theorem 3.48 of the book. It states that if f : [a, b] \rightarrow \mathbb{R} is a real-valued function, and X \sim \boldsymbol{U}(a, b) is a uniform random variable on the interval [a, b], then (b-a)f(x) is a random variable whose expected value is:

    \[ E[(b-a)f(x)] = \int_a^b f(x)dx . \]

What does it say? It says that we can use a randomized algorithm to compute the value of an integral, at least if we run the code often enough and average the results.

In general, we’ve got some quantity C, like the integral we saw above, that we’d like to evaluate, and some randomized algorithm that produces an estimate of C. We call such a random variable and estimator for the quantity. The estimator is unbiased if its expected value is actually C. Generally, unbiased estimators are to be preferred over biased ones.

We talked about discrete and continuous probabilities. There’s a third one, which we shall call mixed probabilities, that comes up in rendering. They arise exactly from the impulses in bidirectional scattering distribution functions, which we shall talk about one day (perhaps, if I’m alive tomorrow) or the impulses caused by point lights. These are probabilities that are defined on a continuum, like interval [0, 1], but are not defined strictly by PDF. Consider the following program:

if uniform(0, 1) > 0.6 :
    return 0.3
else :
    return uniform(0, 1)

Sixty percent of the time this program reports 0.3, the other 40% of the time it returns a value uniformly distributed on [0, 1]. The return value is a random variable that has a probability mass of 0.6 at 0.3, and a PDF given by d(x) = 0.4 at all other points. We shall define the PDF as:

    \[ d(x) = \left\{ \begin{array}{lr}0.4 & \text{ for } x \neq 0.3 \\ 0.6\times\infty & \text{ for } x = 0.3 \end{array}\]

Overall, a random variable with mixed probability is one for which there is a finite set of points in the domain at which the PDF is defined, and vice versa, uniform points where PMF is undefined.

Wow that was a long post! But I’ve learned so much. This blog has four focus subjects, and that’s a lot. But it’s the summer, and until the universities open, I have a lot of free time on my hands. So I’ll try to post um, about 1~2 blog posts about graphics programming a week.

Did I mention that I have a Fiverr gig now? Well , I do! Here’s the link. Enjoy your day, while I ponder about what shall I learn next!

If Nearest Neighbor tires you, what would SVM do?” this sentence paraphrases Jeremiah 12:5; or it alludes to 1971 Christexploitation movie by Ron Ormond. Either way, if you’re stuck at NN, how are you going to deal with Support Vector Machines?!

Never fret. I’m here to paraphrase and quote from E. R. Davies’ Computer Vision: Principles, Algorithms, Application. I’ve sucked up to E. R. Davies before, I think he’s a wonderful person, and his contributions to the field of computer and machine vision and pattern recognition shan’t be let adrift from out minds. His books are perfect because he’s too… industrial. He’s not about theory, he’s all about application!

This post looks at classification through vision. So you’ve been warned.

When the objects that appear in an image have simple shapes, just one stage of processing may be required, – as in the case of circular washers on a conveyor belt. For more complex objects such as flat brackets and hinges, locating them requires at least two stages – as when graph matching methods are used. For situations where the full complexity of three dimensions occurs, more subtle procedures are usually required. Indeed, the very ambiguity involved in interpreting 2-D images from a set of 3-D objects generally requires cues to be sought and hypotheses to be proposed before any serious attempt can be made with the task. Thus, cues are vital to keying into complex data structures of many images. However, for simple situations, concentration on small features is valuable in permitting image interpretation to be carried out efficiently and rapidly. Neither must it be forgotten that in many applications of computer vision, the task is made simpler by the fact that the main interest lies in specific types of object.

In practical situations, measurements of prominent features allow most objects to be classified straightforwardly. In fact, this is most commonly achieved with varying degrees of certainty, but by comparing object features with those of a great many other known objects, we arrive at classifcations that are statistically the most likely ones. Hence, this sort of classification procedure is called SPR.

A good number of early pattern recognition techniques and algorithms followed the SPR approach without proceeding to the next stage – that of determing the solution that is mathematically the most probable one. Thus, an important goal has been to move the sujbects from SPR to PPR → Probable Pattern Recognition. Indeed, the key aim of ML is to progress towards probabilistic pattern recognition. In this blog post, we’ll study SPR. We’ve already discussed PPR in the last post, one of the aspects of it, at least, called the EM algorithm. We’ll unravel more aspects of PPR later.

Let’s talk about NN: Nearest Neighbor algorithm. The principle of the NN algorithm is that of comparing input image patterns against a number of paradigms and then classifying them according to the class of the paradigm that gives the closest match. Below yo see a principle component analyzed dataset wherein the dataset, in blue, is closer in paradigm to green rather than red.

 

An instructive but rather trivial example is shown below:

Here, a number of binary patterns are presented to the computer in the training phase of the algorithm: then the test patterns are presented one at a time and compared bit by bit against each of the training patterns. It is clear that this gives a generally reasonable result, the main problems arise when 1) training patterns of different classes are close together in Hamming distance – Hamming distance begin the distance between the image and its anomalies – and 2) minor translations, rotations, or noise cause variations that inhibit accurate recognition. More generally, problem 2 means that the training patterns are insufficiently representative of what will appear during the test phase. The latter statement encapsulates an exceptionally important principle, and it implies that there must be sufficient patterns in the training set for the algorithm to be able to generalize over all possible patterns of each class. However, problem 1 implies that patterns of two different classes may in some case be so similar as to be indistinguishable by any algorithm; and then it is inevitable that erroneous classifications will be made. It is seen below that this is because the underlying distribution in feature space overlap.

The basis of Bayes’ decision theory will now be examined. If we are trying to get a computer to classify objects a sound approach is to get it to measure some prominent feature of each object such as its length and to use this feature as an aid to classification. Sometimes, such a feature may give very little indication of the pattern class – perhaps because of the effects of manufacturing variation. For example, a hand-written character may be so ill formed that its features are of little help in interpreting it; it then becomes much more reliable to make use of the known relative frequencies of letters, or to invoke context. In fact, either of these strategies can give a greatly increased probability of correct interpretation. In other words, when feature measurement are found to giving an error rate above a certain threshold, it is more reliable to employ a priori probability of a given pattern appearing.

The next step in improving recognition performance is to combine the information from feature measurements and from a priori probabilities, this is achieved by applying Bayes’ rule. For a single feature x, this takes the form:

    \[ P(C_i|x) = \frac{p(x|C_i)P(C_i)}{p(x)} \]

Where:

    \[ p(x) = \sum_{j}p(x|C_j)P(C_j) \]

The vertical line | denotes conditional probability.

Mathematically, the variables here are 1) the a priori probability of class Ci, P(Ci); 2) the probability density for feature x, p(x); 3) the class-conditional probability density for feature x in class Ci, p(x|Ci), and 4) the a posteriori probability of class Ci when x is observed, P(Ci|x).

But what is a priori? What is a posteriori? Let this image I’ve created explain it all!

 

The notation P(Ci|x) is a standard one, being defined as the probability that the class is Ci when the feature is known to have the value x. Bayes’ rule says that to find the class of an object, we need to know two sets of information about the objects that might be viewed: the first is the basic probability P(Ci) that a particular class might arise; the second is the distribution of values of the feature x for each class.

Many common image analysis techniques give features that may be used to help identify or classify object. Such as area of the object, its perimeter, the number of holes it possesses, and so on. Generally, increasing number of features helps to resolve the classification problem more quickly. In the figure below, the first image has a much lower error rate than the second image. Keep in mind the cardinal rule of machine learning: you can’t have an error rate of 0!

Many classification methods, including NN and Bayes’ classifier, can involve substantial amounts of storage and computation the amount of training is to be sufficient to achieve low error rates. Hence, there is considerable value in employing methods that minimize computation cost. This is the goal of our lovely and stout Naïve Bayes’ classifier.

In Naïve Bayes’ algorithm, we multiply the a priori of different features and thus get a a posteriori with low error rate:

    \[ p(x|C_i)P(C_i) = p(x_1|C_i)p(x_2|Ci)\ldots p(x_N|C_i).P(C_i) = \prod_{j}p(x_j|C_i).P(C_i) \]


Let’s talk about the optimum number of features in an image. An important factor to consider is hat the optimum number of features depends on the amount of training a classifier receives. If the number of training set patterns is increased, more evidence is available to support the determination of a greater number of features, and hence to provide more accurate classification of test patterns.

Alright. Probability is nice and all but we shan’t go around all willy nilly – we need to calculate the cost of our mistakes! This is where the cost function comes into play. First, let’s take a look at a function known as the conditional risk function:

    \[ R(C_i|x) = \sum_j L(C_i|C_j)P(C_j|x) \]

This function expresses the expected cost of deciding on class Ci when x is observed. As it is wished to minimize this function, we decide on class Ci only if:

    \[ R(C_i|x) < R(C_i|x) \text{for all} j \neq u \]

If we were to choose a particularily simple cost function, of the form:

 

    \[L(C_i|C_j) = \left\{\begin{array}{lr}0, & \text{for } i=j\\1, & \text{for }i\neqj\\\end{array} \]

Cost functions permit classifications to be biased in favor oa safe decision in a rigorous, predetermined, and controlled manner. One way of minimizing costs is for the classifier to recognize when it is “doubtful” about a particular classification, because two or more classes are most equally likely. Then, one solution is to make a safe decision, the decision plane in feature space being biased away from its position for maximum probability classification. An alternative is to reject the pattern, i.e. place it into an “unknown” category, in that case, some other means can be employed for making appropriate classification.

Alright buckaroos! That is it for today! Of course one topic remains, and that’s SVMs: Support Vector Machines. But they, along with clustering, require a post of their own.

I’m kinda tired today… And I’m working on this massive project, which is to be revealed soon! Very, very, very soon! So buckle up, because when Chubak rides the bus, you can bet your butt it’s gonna be a bumpy ride!

Oh boy, I’m on fire yo! Two posts in one day… To be fair, I wrote the last post yesterday. Anyhow, I don’t think anyone cares. But you reap what you sow. By the time I’m finished with this post my knowledge will be richer, and more robust. So what am I complaining about? As I said before I write to learn. So be it!

But first, let’s talk about one of my heroes, E. Roy Davies. He’s one of the forerunners of computer and machine vision, a subset of pattern recognition – one of the three things my profession is about. This post is based on a few chapters of his opus, Computer Vision: Algorithms, Applications, Learning.

Handsome Devil Behind It All

Alright. Enough sucking up to the masters of the field. Let’s talk about Artificial Neural Networks.

ANNs were launched off in the 1950s, and continued well into the 60s, and research into them still continues to this day. Beldoe and Browning developed n-tuple type of classifier that involved bitwise recording and lookup of binary feature data, leading to weightless or logical type of ANN. Rosenblatt’s perceptron, although, was more important than this algorithm. Let’s talk about this “perceptron”.

The simple preceptron is a linear classifier that classifies patterns into two classes. It takes a feature vector, x = (x1, x2, …, xN) as its input and produces a single scalar output sum_{i = 1}^{N} w_i x_i, the classification process being completed by applying a threshold function at \theta. The mathematics is simplified by writing - \theta as w0, and taking it to correspond to an input x0 which is maintained at a constant value of unity. The output of the linear part of the classified is then written in the form:

d = sum_{i = 1}^{N} w_i x_i - \theta = sum_{i = 1}^{N} w_i x_i + w_0 = um_{i = 1}^{N} w_i x_i

and the final output of the classifier is given by:

y = f(d) f \left( sum_{i = 1}^{N} w_i x_i \right)

This type of neuron – which as we said before, is called preceptron, can be trained using a variety of procedures, such as the fixed increment rule. The basic concept of this algorithm was to try to improve the overall error rate by moving the linear disciminant planea fixed distance toward a position where no misclassifcation would occur – by only doing this when a classification error has occurred.

w_i(k + 1) = w_i(k) \t y(k) = \omega(k)

w_i(k + 1) = w_i(k) + \eta[\omega(k) - y(k)]x_i(k) y(k) \neq \omega(k)

Let me interpret all this mumbo jumbo. What it basically means. First, take a look at this image, courtesy of Towards Data Science:

 

 

So you can clearly see what happens. A bunch of inputs are given, then they are weighted by their threshold – the hyperplane if you will. Then they are summed up. If this number is less than a value, it’s one binary choice, if it’s greater than the value, it’s another binary choice. This is a Heaviside function.

The photo below you shows the difference between separable and non-separable data. The hyperplane could do so much more if the data are separable!

 

 

So far, our preceptrons have been single-layered. Meaning that only one layer of hyperplanes can be ousted. The other concept is multilayer preceptron, or MLP. Rosenblatt himself suggested such networks, but was unable to work out his magic and represent them. Ten years later, in 1969, Minsky and Papert published their famous monograph in which they discussed MLP. It wasn’t until 1986 that Rumbelhart et al were successful in proposing a systemic approach to training of MLPs. Their solution is known as the back-propagation algorithm.

Let’s talk about it.

The problem of trainign a MLP can be simply stated as: a general layer of a MLP obtains its first feature data from the lower layers and receives its class data from higher levels. Henece, if all the weights in the MLP are potentially changeable, the information reaching a particular layer cannot be relied upon. There is no reason why training a layer in isolation would lead to overall convergence of the MLP toward and ideal classifier. Although it might be thought that this is a rather minor difficulty, in fact, this is not so; indeed, this is but one example of the so-called credit assignment problem. What is this problem? Well, it’s correctly determinign the local origins or global properties and making the right assignment of rewards, punishments, corrections, and so on.

The key to solving these problems was to modify the preceptron cwomposing the MLP by giving them a less hard activation function that the Heaviside thresholding function whch results in \theta and we call its negation w0, the hyperplane – we give started changing the Heaviside function to a sigmoid shape, such as the tanh function.

Once these softer activation functions were used, it became possible for each layer of the MLP to feel the data more precisely, and thus training procedures could be set up on a systematic basis. In particular, the rate of change of the data at each individual neuron could be communicated to other layers that could then be trained appropriately – though only on an incremntal basis. I’m not going to bore you with the mathematical details, just some points regarding the algorithm:

1) The outputs of one node are the inputs of the next, and an arbitrary choice is made to label all variables as output (y) parameters rather than input (x) variables, all output parameters are in the range 0 to 1 (because of tanh, duh!)

2) The class parameter \omega has been generalized as the target value t of the output variable y.

3) For all except the final outputs, the quantity of \delta_j has to be calculated using the formula \delta_j = y_j (1 - y_j)(\sum_{\delta_m}{w_{jm}}), the summation having to be taken over all the nodes in the layer above node j.

4) The sequence for computing the node weights involves starting with the output nodes and the proceeding downwards one layer at the time.

In the figure below you can see the difference between Heaviside activator, linear activator, and Sigmoid activator.

I have prepared a short video for people who are visual learners. Let’s hope it helps them:

 

Ok. That’s it! I’m not sure what the next post is going to be about. But a little birdie tells me it’s going to be about Deep Learning! So if you really, really read my blog, look out for it!

What am I going to do? Well, first I’m going to a casino and use my X-Ray Auto Blackjack Aviator Specs to win some hands. Then I’m gonna read chapter 14 of Davies books. It’s a refresher on basic Machine learning concepts. I hope I don’t fall asleep, I’ve been awake for 14 hours!

Hiyaaaa! It’s been almost a week. I know if I’m going to post my mundane mendacity, which all bring me much satiety, in form of a blog post in an arbitrary time pattern, it’ll take a machine learning algorithm to predict whenever I’m going to make one next, but wait! Let’s drop every mundane hackneyed thing I was gonna talk about and stick with machine learning!

I fell in love with machine learning a few years back, and I just recently bought one of the most intricate books in this subject, MIT Press’s Foundations of Machine Learning, written by two Persians and one Indian. Well whilst they are waiting at the TSA checkpoints, let us not fret and make a series of posts on the damn book! It would be my privilege, nay, my honor! First episode: PAC framework. I don’t mean to credit for what I’ve not written myself, so you should know that almost everything is taken from this book. If you enjoy this post, please purchase this book. Of course it’s not the entire book, because that would be illegal, it’s just tidbits from the book. I did so because I can’t learn without a purpose, I need can’t do passive learning in my brain, learning’s gotta be active – meaning I shan’t be content with learning something without doing something along with it, so I make blog posts about it! But I have made a visualization of PAC, in Cinema 4D, which you can see when we get to it! Well, at least I’m not a content pirate! (Yeah, Chubak, keep telling yourself that…).

Let the explanations commence!

When we are designing a machine learning algorithm, several fundamental questions shall occupy a healthy man’s mind. Such as the efficiency of what can be learned, the fact that somethings are inherently hard or easy to learn, how many examples are needed, that if there’s a general model for learning, and so on and so forth. In this episode, we begin by formalizing an answer to these questions, in the form of PAC framework.

PAC stands for Probably Approximately Correct. The PAC framework helps define the class of learnable concepts in terms of number of sample points needed to achieve an approximate solution. Meaning that it introduces a sample complexity and the time and space complexity of the learning algorithm, all in one package. Imagine the normal attributes for a machine learning algorithm as some sort of a Chinese food, all in separate packages, and this so-called PAC as a nice Turkey sandwich you can take to work with.

Let’s first introduce several definitions and the notation needed to represent the PAC model, which we’ll also use in later posts as well.

We denote the set of all possible examples or instances as 𝒳, which is also sometimes referred to as the input space. The set of all possible labels or target values is denoted by ყ. In this episode ყ is limited to 0 and 1, which corresponds to the so-called binary classification.

𝒳 maps to ყ; x \rightarrow y. Let’s call this mapping a concept, or Շ. You can see how we denote it in equation 1-1. Since ყ = {0, 1}, we can identify Շ as a subset of 𝒳, in which Շ is either 0 or 1. 𝒜 concept may be a triangle, or a rectangle, or a hexagon. You can see it more clearly in figure 1-1, but we’re not going to talk about figure 1-1 yet – I just said this to give you an idea.

Let’s assume that examples are i.i.d: independently and identically distributed according to some fixed but unknown distribution, 𝔇. The learning problem is then formulated as follows: The learner considers a fixed set of possible concepts, 𝓗, called a hypothesis set, which may or may not coincide with the concept, Շ. It receives a sample S = (x1, …, xm) that is i.i.d according to 𝔇 as well as the labels (Շ(x1), …, Շ(xm)), which are based on a specific target concept that is a member of Շ. The task is then to use the labeled sample S to select a hypothesis hs that is a member of 𝓗 that has a small generalization error with respect to the concept Շ. The generalization error of a hypothesis h that is a member of 𝓗, also referred to as risk or true error or simply, error of h and is denoted by R(h) and summed up in equation below.

R(h) = \underset{x \sim D}{P}[h(x) \neq c(x)] = \underset{x \sim D}{E}[1_{h(x) \neq c(x)}]

Let’s have a headcount. h(x) is the hypothesis of the input, Շ(x) is the concept of the input, 𝔇 is the distribution, and 1w is the indicator function of the even w. Indicator functions are 1 for all the elements of the function, and 0 for anything else. P is probability, and E is the expected value. In the upcoming visualization, the marbles which fall into the container are 1 and the marbles which fall out are 0.

The generalization error of a hypothesis is not directly accessible to the learner since both distribution 𝔇 and the target concept Շ are unknown. However, the learner can measure the empirical error of a hypothesis on the labeled sample S. You can see this empirical risk, for sample S = (x1, … , xm), formalized in equation:

\hat{R}_S(h) = \frac{1}{m} \sum_{i = 1}^{m}1_{h(x_i) \neq c(x_i) .

Thus, the empirical error of h ∈ 𝓗 is its average error over the sample S, while the generalization error is its expected error based on the distribution 𝔇. We can already note that for a fixed h ∈ 𝓗, the expectation of the empirical error based on an i.i.d sample S is equal to the generalization error, as you can see it notarized in equation below. Remember that m is the maximum index of the sample – hence, size of the sample.

\underset{S \sim D^m}{E} [\hat{R}_S(h) ] = R(h)

But what is this distribution? Distribution for a discretely-valued function like this is basically the list of all the probabilities for each outcome of 𝒳.

So let’s introduce PAC, or in other words, Probably Approximately Correct: Let n be a number such that the computational cost of representing any element, x ∈ 𝒳 is at most O(n) and denote by size(c) the maximal cost of the computational representation of c ∈ Շ. For example, 𝒳 may be a vector in Rn. For which the cost of any array-based representation would be in O(n). In addition, let hs denote the hypothesis returned by algorithm 𝒜 after received a labeled sample S. To keep notation simple, the dependency of hs on 𝒜 is not explicitly indicated.

So, a concept class Շ is said to be PAC-learnale if there exists an algorithm 𝒜 and a polynomial functiony poly(.,.,.,.) such that for any ε > 0 and > 0, for all distributions 𝔇 on 𝒳 and for any target concept c ∈ Շ, the following holds true for any sample size m ≥ poly(1/ ε, 1 /𝛿, n, size(c):

\underset{S \sim D^m} {P} [R(h_S) \leq \epsilon] \geq 1 - \delta

When such an algorithm 𝒜 exists, it’s said thatՇ has a PAC-learning algorithm.

A concept class Շ is thus PAC-learnable if the hypothesis returned by the algorithm 𝒜 after observing a number of points polynomial in 1/ ε and 1 /𝛿 is approximately correct – meaning the error is at most ε, with the high probability (at least 1 – 𝛿), which justifies the PAC terminology. The parameter 𝛿 > 0 is used to define confidence 1 – 𝛿 and ε > 0 the accuracy 1 – ε. So error rate must be between delta and epsilon. Note that if the running time of the algorithm is polynomial in 1 / ε and 1 / 𝛿, then the sample size m must also be polynomial if the full sample is received by the algorithm.

As you might have noticed, there are two things in PAC definition which correspond with its name. The probability, and the error rate. The first one is for the P in PAC, the second one is for the AC in PAC.

Several key points of the PAC definition are worth emphasizing. First, the PAC framework is a distribution-free model: no particular assumption is made about the distribution 𝔇 from which examples are drawn. Second, the training sample and the test examples used to define the error are drawn according to the same distribution 𝔇. This is a natural and necessary assumption for generalization to be possible in general. It can be relaxed to include favorable domain adaptation problems. Finally, the PAC framework deals with the questionm of learnability for a concept class Շ and not a particular concept. Note that the concept class Շ is known to the algorithm 𝒜, but of course the target concept c ∈ Շ is unknown.

Now let’s take a look at this visualization I have cooked up in Cinema 4D:

 

 

Figure 1-1

Let me explain how it works. The marbles are the samples. We strain the samples, and the ones that land inside the shared container between 𝓗 and Շ are the expectation of PAC, the correct part. The ones that fall out are the error, the probably part. And the ones that fall in the containers that aren’t shared are the approximate part.

Sometimes, the hypothesis hS returned by the algorithm is always consistent, that is, it admits no error on the training sample S. In this part of the blog post, we present a general sample complexity bound, or equivalently, a generalization bound, for consistent hypotheses, in the case where cardinality |𝓗| of the hypothesis set is infinite. Since we consider consistent hypotheses, we will assume that the target concept c is in 𝓗.

So, let 𝓗 be a finite set of functions mapping from 𝒳 to ყ – Let 𝒜 be an algorithm that for any target concept c ∈ 𝓗 and i.i.dsample S returns a consistent hypothesis hS : RS (hS) = 0. Then for any ε, 𝛿 > 0, the inequality:

\underset{S \sim D^m} {P} [R(h_S) \leq \epsilon] \geq 1 - \delta

holds if:

m \leq \frac{1}{\epsilon} \left(log|H| + log \frac{1}{\delta}\right)

The price to pay for coming up with a consistent algorithm is the use of a larger hypothesis set 𝓗 containing target concepts. Of course, the upper bound increases with |𝓗|, the cardinal rule of PAC. However, that dependency is only logarithmic. Note that the term log|𝓗|,o or the related term log2|𝓗| from which it differs by a constant factor, can be interpreted as the number of bits needed to present 𝓗. Thus the generalization, guarantee of the theorem is controlled by the ratio of this number of bits, log2|𝓗|, and the sample size m.

Along with consistent hypotheses, we also have inconsistent hypotheses. Some argue that they aren’t useful, but dare I say that they may be? I’m not sure.

Let’s talk about Baye’s error. Given a distribution 𝔇 over 𝒳 ×ყ , , the Baye’s error R* is defined as:

R^\star = \underset{h - measurable}{inf} R(h)

Finally, let’s talk about noise. Noise is the minimum of the conditional probability of ყ with 𝒳.

Well, that’s it for today! My next blog post is going to be about something completely unrelated to machine learning, yes, another signal processing post! Keep in mind that I write to learn, but if I appease you, so be it!

If you have any questions, ask, and I will try my best to answer. If you see any errors in this post, tell me and I’ll add it to the addendum just below this very line. Thank you for reading my blog post!

As I’m writing this, the following video is being rendered. Framator is a very simple plugin, born out of lack of ideas, and it creates four rectangles around your screen. Now, you can go ahead and use them as the default frame settings, or play around with them and create some stunning Mograph.  

If you came here from the video, you’ll probably know that I, Chubak Bidpaa, have created this plugin. The parameters are self explanatory, but I’ll just give them a quick review just in case:

Group Uses
Height and Width The Height and the Width of the Rectangles
Offsets The position of the rectangles
Colors The color of the gradient
Angle The angle of the gradient
Stops The stops of the gradient

If you want an effect similar to mine, just use time in the angle’s expression box.

No more loitering. Here’s the download link.

Donation

I’m not sure if anyone wants to donate via Bitcoin, but in case you’ve enjoyed my plugin, and wish to give something back in return, use this Wallet to donate:

bitcoincash:qrt4fx3npgef7yczj6cxzfdfrtqztzekjqdf52rtgj

Thank you. Thank you all. Chubak Out.

The following article is based on a journal by Jorge Jimenez, Jose J. Echevarria, Tiago Sousa and Diego Gutierrez.

You can view their SMAA demo here. It runs fine on my GTX 960 2GB.

Old Methods of Antialiasing

For years, MSAA (Multisampling Antialiasing) and SSAA (Supersampling Antialiasing) have been de facto the methods of antialiasing. In fact, these two still retain the highest quality amongst the various modern antialiasing methods. As we know, aliasing is caused by the lack of samples, in spatial level (jagged lines) and in temporal level (flickering), usually around the edges and high/low contrast regions of the picture. To battle these, we have two main ways that were once the only way around, Supersampling, and Multisampling:  In Supersampling, we blow up the picture, then downsampling for the final resolution. It works fine because, as my uncle puts it, it’s a pincer attack, because it covers every basi s of the problem and surrounds it. Multisampling is similarly pincer-ish. In this method, each sample gets duplicated based on the given coefficient. In today’s high resolutions, it would require a rather fiendish graphics card to achieve this. Therefore, we need new methods of antialiasing, both in spatial, and temporal level. All these methods rely on one algorithm to do their job: edge detection algorithm. But they rely on other things as well.

 

Modern Ways of Antialiasing

There are many modern filter-based methods for antialiasing which all, although inferior to the former and latter, do their job. FXAA,  DEAA, GPAA, GBAA, CSAA, EQAA, DLAA… In this article, we’ll talk about SMAA, and its predecessor, MLAA. These modern filter-based methods have their own problems:

  • Most edge detection methods, which are the basis of these methods, only take into account numerical differences between pixels, ignoring how they appear to the viewer.
  • The original shape of the object is not always preserved, an overall rounding of corners is most of the times clearly visible in text, sharp corners, and subpixel features.
  • Most approaches are designed to handle horizontal or vertical patterns only, ignoring the vericals.
  • Real subpixel features and subpixel motion are not properly handled. Specular and shading aliasing is not completely removed.

You’ve guessed right: We raise these issues because aim to decimate them.

Morphological Antialiasing (MLAA)

MLAA tries to estimate the coverage of the original geometry. To accurately rasterize an anitalised triangle, the coverage area for each pixel inside the triangle must be calculated to blend it properly with the background. MLAA begins the image without antialiasing, and it reverses the process by re-vectorizing the silhouettes, in order to estimate such coverage areas. Then, since the background cannot be known after rasterization, MLAA blends with a neighbor, assuming that its value is similar with the original background. In other words, The algorithm detects borders (either using color or depth information) and then finds specific patterns in these. Anti-aliasing is achieved by blending pixels in the borders intelligently. MLAA has implementations in DirectX 10 and Mono Game (XNA). Games such as Fable II use it faithfully. From the creators of MLAA, comes SMAA, or Enhanced Subpixel Morphological Antialiasing, which is the main point of our post.

MLAA in action

Enhanced Subpixel Morphological Antialiasing (SMAA)

 

Comparison between SMAA and other methods in Crysis 2

SMAA offers reliable edge detection, and a simple and effective way to handle sharp geometric features and diagonal lines. Besides, SMAA doesn’t change the shape of the geometry, as many other methods do.

 

Top) No AA

Middle) MLAA

Bottom) SMAA

SMAA builds on MLAA pipeline, improving or redefining at every step. In particular, the edge detection is improved by using color information with local contrast adaptation for cleaner edges. It extends the number of patterns handled for preservation of sharp geometric features and diagonals. And lastly, it shows how morphological antialiasing can be accurately combined with multisampling or supersampling and temporal reprojection.

Edge Detection

Edge detection is vital, because undetected edges remain aliased. On the other hand, too many filtered edges can reduce the quality of the antialiased image. Different information can be used to detect he edge: Chroma, luma, depth, surface normal, or a combination of them. For four reasons, SMAA uses luma:

  1.   Less artifacts.
  2. Luma is always visible.
  3. It can handle shading aliasing.
  4. And finally, it’s faster than chroma.

 

 

Left and Center: Other Methods of Edge Detection, causing red crossing and artifacts

Right: SMAA edges, clean as a whistle

Have this image in mind. Here’s how edge detection works: the final calculated value is a boolean called left edge boundary. Boolean values for the top edge is similarly calculated. The formula is:

c_{max }= max\left(c_l, c_r, c_b, c_i, c_{2l}\right)

e_l^\prime =  e_l \wedge c_l > 0.5.c_{max}

All the c values are called contrast deltas.

Pattern Handling

SMAA pattern detection allows to preserve sharp geometric features like corners, deals with diagonal and enables accurate distance searches.

Sharp Geometric Features: The re-vectorization of silhouettes in MLAA tensd to round corners. To avoid this, SMAA makes the observation that crossing the edges in contour lines have a maximum size of oen pixel, whereas for sharp corners this length will most likely be longer. Thusly, SMAA fetches two-pixel-long crossing edges instead, this allows less aggressive corners processing.

Diagonal Patterns: We introduce a novel diogonal pattern detection. It consists of the following steps:

  1. Search for the diagonal distance d_l and d_r to the left and the right of the diagonal lines.
  2. Fetch the crossing edges e_1 and e_2.
  3. Use this input information, defining the specific diagonal patter, to access the precomputed area texture, yielding the areas a_t and and a_b.

If the diagonal pattern detection fails, then the orthogonal detection is triggered.

Accurate Distance Search: Key to pattern detection and classification is obtaining accurate edge distance (lengths to both end of the lien) MLAA makes extensive use of hardware interpolation to speed up this process. Hardware bilinear filtering can be used as a way of fetching and encoding up to four different values with a single memory access. This linear interpolation of two binary values (that is, bilinear) producing a single floating point value is shown as:

f_x(b_1,b_2,x) = x.b_1(1-x).b_

Where b_1 and b_2 are two binary values (either 0 or 1) and x is the interpolation value.

Results

MLAA works with a single sample per pixel. This translates to subsampling, which makes it impossible to recover real subpixel features.

 

MLAA vs SMAA vs No AA

SMAA, however, works in the subpixel level. This results in:

  • Local contrast
  • Diagonal pattern detection
  • Sharp geometric features
  • Accurate searches

You can view all these in the following image, with these features compared to other methods. In fact, SMAA can produce results close to SSAA 16x.

 

The overhead produced by each of the solutions is negligible. In particular, local contrast adaptation is only 0.08ms, the sharp geometric features detection adn accurate distance takes 0.01ms, and diagonals processing produces an overhead of 0.12ms. In short, SMAA is rather fast, slower than SSAA and MSAA, but more fruitful, and less resource-intensive. 

Well, thanks for reading the article! And thanks for the writers of the journal which I used for the majority of the article. I hope you guys have a good day, and also, go on reading scientific articles on your own. It’s simple, just head to libgen.io and search for what you like — not necessarily graphics programming. Read, read and read! Don’t watch too many Youtube tutorials, it kills your senses. I can’t stress that enough. I don’t intend to tell you what to do, these are all just suggestions. I’m currently studying Structured Computer Organization and enjoying it very, very much. I recommend it for everyone, even as bedtime reading.

Please, please, please tweet my blog, introduce it to your friends, and share it to people whom you want to enjoy life, and learn about graphics programming. Thank you, thanks a lot. Chubak out.

Radiosity: Hauntingly Beautiful

I saw “Radiosity Maps” option in Cinema 4D many years ago, but I didn’t understand what it was. Now, I understand, and I want you, faithful reader, to understand them as well. I’m basing this post on an old article I found on Libgen — an article from 1986 the trio of Donald P. Greenberg, Micheal F. Cohen, and Kenneth E. Torrance. I hope they don’t mind me basing my blog post on their article from 34 years ago!

We know that there’s three categories of light: ambient, diffuse, and specular. In Phong shading, which we talked about before, a simple formula is used to simulate lighting in the given scene. This, however, is not sufficient. It is enough for real-time rendering, where performance is the main concern, but when time is not of concern, we can use more complex formulae to calculate the lighting in our scene.

In 1979, Whitted published a paper in which he explained and talked about ray tracing, a method for producing images of excellent quality.

 

Raytracing, a noble way for noble men.

However, raytracing procedure is limited and can only model intra-environment reflectionss in specular direction. Additionally, shadows always exhibit sharp boundaries, and plus, since raytracing is view-dependant, for every view, there must be a new pass.

This, however, is not true of the radiosity method. This method renders Global Illumination independent of views.

In SIGGRAPH convention of 1984, Lady Cindy Goral exhibited a metohd that sparked a lot of interest and turned quite a few heads: Derived from thermal engineering, the field of Heat Transfer, they revealed a new Global Illumination method based on energy equilibrium, and called it the Radiosity Method.

 

The difference between Direct Illumination, i.e. Phong Lighting, and Radiosity is evident in the given picture. Bright lights, no more of the demented color bleeding which are caused by diffuse reflections, and softer shadows.

Two other papers were published in 1985 which introduced concepts such as hemi-cube which we’ll discuss later. For now, let’s talk about Radiosity, and what it exactly really is.

Radiosity Formulae

 The radiosity method explains energy equilibrium inside an enclosure. The light leaving the surface (its radiosity) consists of self-emitted light and reflected or transmitted incident light. The amount of light arriving at a surface requires complete specification of geometric relationships among all reflecting and transmitting surfaces, along with the light leaving every other surface. The formula of this relationship is:

Formula 1: Radiosity essence

B_i = E_i + \rho_i \sum_{j=1}^{N} B_j F_{ij} | i = 1 to N

Factors and coefficients are as follows:

Radiosity (B): The total rate of energy leaving a surface.

Emission (E): The rate of energy (light) emitted from a surface

Reflectivity (\rho): The fraction of incident light which is reflected back into the environment

Form-facor (F): The fraction of the energy leaving one surface which lands on another surface.

N = the number of discrete surfaces of patches.

What does this equation state, you may ask? Well, from this equation we understand that the amount of nergy leaving a particular surface is equal to the self-emitted light plus the reflected light. The reflected light is equal to the light leaving every other surface multiplied by both the fraction of that light which reaches the surface in question, and the reflectivity of the receiving surface.

We said that the form-factor is the fraction of the energy leaving one surface which lands on another surface. The formula for this is:

Formula 2: Form-factor essence

F_{ij} = \frac{1}{A_i} \int_{A_i} \int_{A_j} \frac{\cos\theta_i\cos\theta_j}{\pi r^2} HID_{ij}dA_jDA_i

Factors and coefficients can be seen in this figure:

 

Also, HID is a function which calculates the area of i and j.

The Hemi-Cube Method

We talked about the Hemi-Cube, but what is it? The Hemi-Cube Algorithm provides a numerical integration (you know, integrals) technique for evaluating Formula 2.  Instead of projecting into sphere, which is the normal procedure for lighting a scene, an imaginary cube is constructed around the center of the receiving patch. The environment is transformed to set the patch’s center at the origin with patch’s normal coinciding with the positive Z-axis (in other words, the tangent space!).

 

Then, the Hemi-Cube is divided into orhogonal mesh of pixels (the article puts “pixels” in quotes, as it was a novel word at the time!) at any desired resolution.

 

The Hemi-Cube divided into these so-called pixels… Witchcraft!

Ipso facto, the total value of the form-factor from the patch at the center of the hemi-cube to any patch j can be determined by the summation of these pixels.

Radiosity… GI?

As we’ve learned, Radiosity is a subset of GI, but it’s different from raytracing. Today, most 3D applications use Monte Carlo for their GI engine, which retains some aspects of Radiosity. Examples include Cinema 4D. It used to have Radiosity, but it changed it to GI, so people, mostly C4D users, think GI and Radiosity are different things.

Well, that is it! Dear reader, I appreciate your keen interest in my blog. Please leave a comment. If I’ve made an error, please alert me. I’m not perfect, I make mistakes.

Another thing. Please, if you’ve liked my blog, tweet it to your friends so they can enjoy it as well. Even if it marks you as the nerd of the group, please do it, chicks dig nerds these days. If you have a girlfriend, just take some MDMA and read my blog to her out loud, she’ll return the favor, I rather not say how. Besides tweeting you can reddit my blog. I post my blog in all the relevant subreddits, but I may miss one or two, or a million. So please, help me grow my blog’s readership. Thanks, Chubk.

What is a shadow, even?

Well, let me start off by saying that I love shadows more than light, and I have chosen the one room in the house where every being of the room occludes light. Occludes light… Hmm… How do we know what’s behind the light frustum, and what’s in front of the light frustum, or frusta?

 

Let us scrutinize shadows first. What are shadows? When you shine a light on an object with a wall behind it, what happens?

Look at this picture:

 

Umbra: Full, hard-edged shadow.

Penumbra: Half, soft-edged shadow.

What happens if we take the light source furher and further away, until the light frustum become so wide that umbra does not seem feasible anymore? Then, umbra turns into antumbra, where different penumbras meet.

Shadow Mapping

The concept was popularized by Lance Williams in 1978. Lance Williams is also the person behind mip-mapping, which we’ll discuss in another post, moon-god-willingly. So what is Shadow Mapping? It’s by far the only feasible real-time shadowing solution, as raytracing is rather resource-intensive and volume shadows are not suiable for real-time rendering.

 

Such a happy man!

Native API support is also another reason to choose shadow mapping. GLSL, for example, has a native texture type for receiving the shadow map as a Sampler2D Uniform, which we’ll see. 

But what is shadow mapping?

 Imagine angle of the light is less than 180 degrees. If we choose camera position as the light position, camera direction as the direction of the light, and shine the light frustum on the scene then the resulting depth buffer is our shadow map.

 

Left: Light Frustum Shone on the Scene | Right: The Resulting Framebuffer becomes out shadow map
Depth as seen from a ligh

The moment of truth

We said all those things, and we showed you the result, but how exactly do we know that an object occludes light at a certain position? Let’s see what happens when it doesn’t and then we’ll see what happens when it does so.

Imagine this spot on our Utah Teapot:

 

Imagine if Z_{sm} is the depth value point on the teapot. And imagine Z_{ls} is the depth value of the length of the light source to the lit point.

 

There will not be a shadow in this case, because Z_{ls} = Z_{ms}. However, imagine if Z_{ms} was behind the teapot:

 

In this case, when Z_{ls} < Z_{ms} there this point is definitely in the shadow, and since each point is a fragment, this fragment should be shaded In an Octopus’ Garden in the Shade!

Again, I’m burying the lede, and I’m sorry for that. Octopus’ Garden in the Shade has nothing to do with shaders! Anyways, we must factor in a bias when we do compare the depths. Because if we don’t, as you see in the photo, there will be surface acne.

In a scene with dueling frasta, i.e. more than one light source and shadow map, the size of the shadow map shan’t be different from each other, otherwise, as a result, the shadow will be pixelated.

 

Shadow Map: Pros vs. Cons

Shadow Map Pros are:

  1. No matter how heavy they may look, they are still better for real-time rendering than the alternatives (e.g. RayTracing).
  2. You can adjust the size of the texture which holds the depth data, ultimately, add it as an option to your game, or demo.

 And the Cons are:

  1. Aliasing occurs in low-res textures.
  2. Textures are heavy and occupy a lot of memory.
  3. Effects of self-occlusion may be visible in the output as sparkling. This can be fixed by polygon offset.

 Finally, let’s implement it in the code using OpenGL.

OpenGL Implementation

With all that said, how exactly is this implemented in OpenGL? Here’s how. The following code is taken from OpenGL Superbible 7th Edition.

Listing 1: First, we create the shadow depth buffer.

 

GLuint shadow_buffer;
GLuint shadow_tex;

glGenFramebuffers(1, &amp;shadow_buffer);
glBindFramebuffer(GL_FRAMEBUFFER, shadow_buffer);

glGenTextures(1, &amp;shadow_tex);
glBindTexture(GL_TEXTURE_2D, shadow_tex);
glTexStorage2D(GL_TEXTURE_2D, 1, GL_DEPTH_COMPONENT32,
               DEPTH_TEX_WIDTH, DEPTH_TEX_HEIGHT);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_COMPARE_MODE,
                GL_COMPARE_REF_TO_TEXTURE);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_COMPARE_FUNC, GL_LEQUAL);
glFramebufferTexture(GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT,
                     shadow_tex, 0);

glBindFramebuffer(GL_FRAMEBUFFER, 0);

Listing 2: Then, we create model-view-projection matrices, but this time, instead of the camera, we use the light!

 

vmath::mat4 model_matrix = vmath::rotate(currentTime, 0.0f, 1.0f, 0.0f);
vmath::mat4 light_view_matrix =
    vmath::lookat(light_pos,
                  vmath::vec3(0.0f),
                  vmath::vec3(0.0f, 1.0f, 0.0f);
vmath::mat4 light_proj_matrix =
   vmath::frustum(-1.0f, 1.0f, -1.0f, 1.0f,
                  1.0f, 1000.0f);
vmath::mat4 light_mvp_matrix = light_projection_matrix *
                               light_view_matrix * model_matrix;

Listing 3: Thusly, we generate a shadow matrix.

const vmath::mat4 scale_bias_matrix =
     vmath::mat4(vmath::vec4(0.5f, 0.0f, 0.0f, 0.0f),
                 vmath::vec4(0.0f, 0.5f, 0.0f, 0.0f),
                 vmath::vec4(0.0f, 0.0f, 0.5f, 0.0f),
                 vmath::vec4(0.5f, 0.5f, 0.5f, 1.0f));
vmath::mat4 shadow_matrix = scale_bias_matrix *
                            light_proj_matrix *
                            light_view_matrix *
                            model_matrix;

List 4: Nothing can be done without shaders, so we implement the vertex shader for this shadw. Not much different than any other vertex shaders.

#version 420 core

uniform mat4 mv_matrix;
uniform mat4 proj_matrix;
uniform mat4 shadow_matrix;
layout (location = 0) in vec4 position;

out VS_OUT
{
    vec4 shadow_coord;
} vs_out;

void main(void)
{
    gl_Position = proj_matrix * mv_matrix * position;
    vs_out.shadow_coord = shadow_matrix * position;
}

As you can see, we’ve got two model-view-projection matrices, one we use for ourselves, and one (the shadwo matrix) we pass to the fragment shader to see if each fragment is in the shadow, or in the light.

Listing 5: the fragment shader of it all.

#version 420 core

layout (location = 0) out vec4 color;

layout (binding = 0) uniform sampler2DShadow shadow_tex;

in VS_OUT
{
    vec4 shadow_coord;
} fs_in;

void main(void)
{
    color = textureProj(shadow_tex, fs_in.shadow_coord) * vec4(1.0);
}

That’s it! Notice something that we’ve never before seen in this blog, sampler2DShadow. It’s something recent, and not found in 3.3. This is why I say Superbible is better than LearnopenGl.com. It’s up to you, do you want something outdated written by a fan, or something up-to-date written by ARB? You choice!

 

Well, that is it for today! I might make another post, but it would be for the Europeans, because by the time I post my second entry of today, Americans will be asleep. A lot of people have visited my blog, this blog is not far from generating profit. So thank you! Don’t forget to leave a comment. I love each and everyone of you, if that’s not creepy. Really. I love everyone who reads the drivel I write. This gives me a feeling of self-importance. Anyways, before we have engaged in coitus, goodbye for now!