I know I promised a post on regression, but then I realized I only have a shallow understanding of Boosting and AdaBoost. So I biked to the nearest public library, when to the index cards, search for ‘Boost’ and after perusing through hundreds of self-help books, I found the greatest resource on AdaBoost: “How to Boost Your Spirit by Ada MacNally”. Nah, kidding. Such book doesn’t exist. The following, as always, is my study notes, taken from Foundations of Machine Learning by MIT Press and Machine Learning in Action by Peter Harrington published by Manning. The first book is heavy on math. The second book is more of a fluff book and is much simpler than the first one.

It is often difficult, for a non-trivial learning task, to directly devise an accurate algorithm satisfying the strong PAC-learning requirements that we saw in the PAC-learnable algorithm post I wrote before (link) but, there can be more hope for finding simple predictors guaranteed only to perform slightly better than random. The following gives a formal definition of such weak learners. As in the PAC-learning post, we let $n$ be a number such that the computational cost of representing any element $x \in \mathcal{X}$ is at most $O(n)$ and denote by size(c) the maximal cost of computational representation of $c \in \mathcal{C}$.

Let’s define weak learning. A concept class $\mathcal{C}$ is said to be weakly PAC-learnable if there exists an algorithm $\mathcal{A}, \gamma > 0$ and a polynomial function $poly(., . , .)$ such that for any $\delta > 0$, for all distributions $\mathcal{D}$ on $\mathcal{X}$ and for any target concept $c \in \mathcal{C}$, the following holds for any sample size $m \geq poly(1/\delta,n,size(c))$:

\[ \underset{\mathcal{S} \sim \mathcal{D}^m}{\mathbb{P}}\left[R(h_s) \leq \frac{1}{2} – \gamma \right] \leq 1 – \delta, \]

where $h_s$ is the hypothesis returned by algorithm $\mathcal{A}$ when trained on sample $S$. When such an algorithm $\mathcal{A}$ exists, it is called a weak learning algorithm for $\mathcal{C}$ or a weak learner. The hypothesis returned by a weak learning algorithm is called a base classifier.

The key idea behind boosting techniques is to use a weak learning algorithm to build a strong learner. That is, an accurate PAC-learning algorithm. To do so, boosting techniques use an ensemble method: they combine different base classifiers returned by a weak learner by a weak learner to create more accurate predictors. But which base classifier should be used and how should they be combined?

Another visualization shall be:

The algorithm takes as input a labeled sample $S=((x_1, y_1),\ldots,(x_m, y_m))$, with $(x_i, y_i) \in \mathcal{X} \times \{-1, +1\}$ for all $i \in [m]$, and maintains a distribution over the indices $\{-1, \ldots, m\}$. Initially, lines 1-2, the distribution uniform ($\mathcal{D}_1$). At each round of boosting, that is each iteration $t \in [T]$, of the loop in lines 3-8, a new base classifier $ h_t \in \mathcal{H} $ is selected that minimizes the error on the training sample weighted by the distribution $ \mathcal{D}_t$:

\[ h_t \in \underset{h \in \mathcal{H}}{argmin}\underset{i \sim \mathcal{D}_t}{\mathbb{P}} [h(x_i) \neq y_i] = \underset{h \in \mathcal{H}}{argmin}\sum_{i=1}^{m}\mathcal{D}_t(i)1_{h(x_i) \neq y_i} .\]

$Z_t$ is simply a normalization factor to ensure that the weights $D_{t+1}(i)$ sum to one. The precise reason for the definition of the coefficient at $\alpha_t$ will become clear later. For now, observe that if $\epsilon_t$, the error of the base classifier, is less than $\frac{1}{2}$, then $\frac{1-\epsilon_t}{\epsilon_t} > 1$ and $\alpha_t$ is positive. Thus, the new distribution $\mathcal{D}_{t+1}$ is defined from $\mathcal{D}_t$ by substantially increasing the weight on $i$ if point $x_i$ is incorrectly classified, and, on the contrary, decreasing it if $x_i$ is correctly classified. This has the effect of focusing more on the points incorrectly classified at the next round of boosting, less on those correctly classified by $h_t$.

After $T$ rounds of boosting, the classifier returned by AdaBoost is based on the sing of function $f$, which is a non-negative linear combination of the base classifiers $h_t$. The weight $\alpha_t$ assigned to $h_t$in that um is a logarithmic function of the ratio of the accuracy $1-\epsilon_t$ and error $\epsilon_t$ of $h_t$. Thus, more accurate base classifiers are assigned a larger weight in that sum.

For any $t \in [T]$, we will denote by $f_t$ the linear combination of the base classifiers after $t$ rounds of boosting: $f_t = \sum_{s=1}^{t}\alpha_sh_s$. In particular, we have,e $f_T = f$. The distribution $\mathcal{D}_{t+1}$ can be expressed in terms of $f_t$ and the normalization factors $Z_s$, $s \in [t]$ as follows:

\[ \forall i \in [m], \mathcal{D}_{t+1}(i)=\frac{e^{y_if_t(x_i)}}{ m \prod_{s=1}^{t}Z_s} .\]

The AdaBoost algorithm can be generalized in several ways:

– Instead of a hypothesis with minimal weighted error, $h_t$ can be more generally the base classifier returned by a weak learner algorithm trained on $D_t$;

– The range of the base classifiers could be $[-1, +1]$, or more generally bounded subset of $\mathbb{R}$. The coefficients $\alpha_t$ can then be different and may not even admit a closed form. In general, they are chosen to minimize an upper bound on the empirical error, as discussed in the next section. Of course, in that general case the hypotheses $h_t$ are not binary classifiers, but their sign could define the label and their magnitude could be interpreted as a measure of confidence.

AdaBoost was originally designed to address the theoretical question of whether a weak learning algorithm could be be used to derive a strong learning one. Here, we will show that it coincides in fact with a very simple algorithm, which consists of applying a general coordinate descent technique to a convex and differentiable objective function.

For simplicity, in this section, we assume that the base classifier set $\mathcal{H}$ is finite, with cardinality $N: \mathcal{H} = /{h_1, \ldots, h_N/}$. An ensemble function $f$ such as the one returned by AdaBoost can then be written as $f = \sum_{j=1}^N\bar{\alpha}_j h_j$. Given a labeled sample $S = ((x_1, y_1), \ldots, (x_m, y_m))$ let $F$ be the objective function defined for all $\boldsymbol{\bar{\alpha}} = (\bar{alpha}_1, \ldots, \bar{\alpha}_N) \in \mathbb{R}^N$ by:

\[ F(\boldsymbol{\bar{\alpha}}) = \frac{1}{m}\sum_{i=1}^{m}e^{y_i f(x_i)} = \frac{1}{m}\sum_{i=1}^m e^{y_i \sum_{j=1}{N}\bar{\alpha}h_j(x_i)}. \]

$F$ is a convex function of $\boldsymbol{\bar{\alpha}}$ since it is a sum of convex functions, each obtained by composition of the (convex) exponential function with an affine function of $\boldsymbol{\bar{\alpha}}$.

The AdaBoost algorithm takes the input dataset, the class labels, and the number of iterations. This is the only parameter you need to specify in most ML libraries.

Here, we briefly describe the standard practical use of AdaBoost. An important requirement for the algorithm is the choice of the base classifiers or that of the weak learner. The family of base classifiers typically used with AdaBoost in practice is that of decision trees, which are equivalent to hierarchical partitions of the space. Among decision trees, those of depth one, also known as stumps, are by far the most frequently used base classifiers.

Boosting stumps are threshold functons associated to a single feature. Thus, a stump corresponds to a single axis-aligned partition of space. If the data is in $\mathbb{R}^N$, we can associate a stump to each ot he N components. Thus, to determine the stump with the minimal weighted error at each round of boosting, the best component and the best threshold for each component must be computed.

Now, let’s create a weak learner with a decision stump, and then implement a different AdaBoost algorithm using it. If you’re familiar with decision trees, that’s OK, you’ll understand this part. However, if you’re not, either learn about them, or wait until I cover them tomorrow. A decision tree with only one split is a decision stump.

 

The pseudocode to generate a simple decision stump looks like this:

Set the minError to +

     For every feature in dataset:

           For every step:

                    For each inequality:

                           Build a decision stump and test it with the weighted dataset

                            If the error is less than minError:

Set this stump as the best stump

Return the best stump

Now that we have generated a decision stump, let’s train it:

For each iteration:

         Find the best stump using buildStump()

           Add the best stump to the stump array

           Calculate α

           Calculate the new eight vector

           Update the aggregate class estimate

           If the error rate is 0:

               Break out of the loop

Well, that is it for today! I know I keep promising one thing, and deliver another, but this time I’m really going to talk about decision trees!

Semper Fudge!

 

I’m extremely agitated today. I dunno why. Maybe because there was some convulsion in the peaceful tidings of the house I live in, or the fact that I’m kinda hungry at the moment. Anyways, I don’t have time for chitchat. Let’s get to the studying.

The following is taken from Foundations of Machine Learning by Rostamyar, et al.

Support Vector Machines are the most theoretically well motivated and practically most effective classification algorithms in modern machine learning.

Consider an input space $\mathcal{X}$ that is a subset of $\mathbb{R}^N$ with $N \geq 1$, and the output or target space $\mathcal{Y}=\{-1, +1\}$, and let $f : \mathcal{X} \rightarrow \mathcal{Y} $ be the target function. Given a hypothesis set $\mathcal{H}$ of functions mapping $\mathcal{X}$ to $\mathcal{Y}$, the binary classification task is formulated as follows:

The learner receives a training sample $S$ of size $m$ drawn independently and identically from $\mathcal{X}$ to some unknown distribution $\mathcal{D}$, $S = ((x_1, y_1), \ldots, (x_m, y_m)) \in (\mathcal{X}\times\mathcal{Y})^m$, with $y_i = f(x_i) $ for all $i \in [m]$. The problem consists of determining a hypothesis $ h \in \mathcal{H}$, a binary classifier, with small generalization error: The probability that hypothesis set is not the target function is our error rate.

\[ R_{\mathcal{D}} = \underset{x\sim\mathcal{D}}{\mathbb{P}} [h(x) \neq f(x)]. \]

Different hypothesis sets $\mathcal{H}$ can be selected for this task. Hypothesis sets with smaller complexity provide better learning guarantees, everything else being equal. A natural hypothesis set with relatively small complexity is that of a linear classifier, or hyperplanes, which can be defined as follows:

\[ \mathcal{H}= \{x \rightarrow sign(w.x+b) : w \in \mathbb{R}^N, b \in r\} \]

The learning problem is then referred to as a linear classification problem. The general equation of a hyperplane in $\mathbb{R}^N$is $w.x+b=0$ where $w\in\mathbb{R}^N$ is a non-zero vector normal to the hyperplane $b\in\mathbb{R}$ a scalar. A hypothesisol. of the form $x\rightarrow sign(w.x+b)$ thus labels positively all points falling on one side of the hyperplane $w.x+b=0$ and negatively all others.

From now until we say so, we’ll assume that the training sample $S$ can be linearly separated, that is, we assume the existence of a hyperplane that perfectly separates the training samples into two populations of positively and negatively labeled points, as illustrated by the left panel of figure below. This is equivalent to the existence of $ (\boldsymbol{w}, b) \in (\mathbb{R}^N – \boldsymbol{\{0\}}) \times \mathbb{R}$such that:

\[ \forall i \in [m], \quad y_i(\boldsymbol{w}.x_i + b) \geq 0 \]

 

But, as you can see above, there are then infinitely many such separating hyperplane. Which hyperplane should a learning algorithm select? The definition of SVM solution is based on the notion of geometric margin.

Let’s define what we just came up with: The geometric margin $\rho_h(x)$ pf a linear classifier $h:\rightarrow \boldsymbol{w.x} + b $ at a point $x$ is its Euclidean distance to the hyperplane $\boldsymbol{w.x}+b=0$:

\[ \rho_h(x) = \frac{w.x+b}{||w||_2} \]

The geometric margin of $\rho_h$ of a linear classifier h for a sample $S = (x_1, …, x_m) $ is the minimum geometric margin over the points in the sample, $\rho_h = min_{i\in[m]} \rho_h(x_i)$, that is the distance of hyperplane defining h to the closest sample points.

So what is the solution? It is that, the separating hyperplane with the maximum geometric margin is thus known as maximum-margin hyperplane. The right panel of the figure above illustrates the maximum-margin hyperplane returned by SVM algorithm is the separable case. We will present later in this chapter a theory that provides a strong justification for the solution. We can observe already, however, that the SVM solution can also be viewed as the safest choice in the following sense: a test point is classified correctly by separating hyperplanes with geometric margin $\rho$ even when it falls within a distance $\rho$ of the training samples sharing the same label: for the SVM solution, $\rho$ is the maximum geometric margin and thus the safest value.

We now derive the equations nd optimization problem that define the SVM solution. By definition of the geometric margin, the maximum margin of $\rho$ of a separating hyperplane is given by:

\[ \rho = \underset{w,b : y_i(w.x_i+b) \geq 0}{max}\underset{i\in[m]}{min}\frac{|w.x_i=b}{||w||} = \underset{w,b}{max}{min}\frac{y_(w.x_i+b)}{||w||} \]

The second quality follows from the fact that, since sample is linearly separable, for the maximizing pair $(w, b), y_i(w.x_i+b)$ must be non-negative for al $i\in[m]$. Now, observe that the last expression is invariant to multiplication of $(w, b)$ by a positive scalar. Thus, we can restrict ourselves to pairs $(\boldsymbol{w},b)$ scaled such that $min_{i\in[m]}(\boldsymbol{w}.x_i+b) = 1$:

\[ \rho = \underset{min_{i\in[m]}y_i(w.x_i+b)=1}{max}\frac{1}{||w||} = \underset{\forall i \in[m],y_i(w.x_i+b) \geq }{max}\frac{1}{||w||} \]

Figure below illustrates the solution $(w, b)$ of the maximization we just formalized. In addition to the maximum-margin hyperplane, it also shows the marginal hyperplanes, which are the hyperplanes parallel to the separating hyperplane and passing through the closest points on the negative or positive sides.

 

 

Since maximizing $1/||w||$ is equivalent to minimizing $\frac{1}{2}||w||^2$, in view of the equation above, the pair $(\boldsymbol{w}, b)$ returned by the SVM in the separable case is the solution of the following convex optimization problem:

\[ \underset{w, b}{min}\frac{1}{2}||w||^2 \]\[ \text{subject to}: y_i(\boldsymbol{w}.x_i+b) \geq 1, \forall i \in[m] \]

Since the objective function is quadratic and the constraints are affine (meaning they are greater or equal to) the optimization problem above is in fact a specific instance of quadratic programming (QP), a family of problems extensively studied in optimization. A variety of commercial and open-source solvers are available for solving convex QP problems. Additionally, motivated by the empirical success of SVMs along with its rich theoretical underpinnings, specialized methods have been developed to more efficiently solve this particular convex QP problem, notably the block coordinate descent algorithms with blocks of just two coordinates.

So what are support vectors? See the formula above, we note that constraints tare affine and thus qualified. The objective function as well as the affine constrains are convex and differentiable.

We introduce Lagrange variables $\alpha_i \geq 0, i\in[m]$, associated to the m constrains and denoted by $\boldsymbol{\alpha}$ the vector $(\alpha_1, \ldots, \alpha_m)^T$. The Lagrangian can then be defiend for all $\boldsymbol{w}\in\mathbb{R}^N,b\in\mathbb{R}$, and $\boldsymbol{\alpha}\in\mathbb{R}_+^m$ by:

\[ \mathcal{L}(\boldsymbol{w},b,\boldsymbol{\alpha} = \frac{1}{2}||w||^2 – \sum_{i = 1}^{m}\alpha_i[y_i(w.x_i+b) -1] \]

Support vectors fully define the maximum-margin hyperplane or SVM solution, which justifies the name of the algorithm. By definition, vectors not lying on the marginal hyperplanes do not affect the definiton of these hyperplanse – in their absence, the solution the solution to the SVM problem is unique, the support vectors are not. In dimensiosn $N, N+1$ points are sufficient to define a hyperplane. Thus when more then $N+1$ points lie on them marginal hyperplane, different choices are possible for $N+1$ support vectors.

But the points in the space are not always separable. In most practical settings, the training data is not linearly separable, which implies that for any hyperplane $\boldsymbol{w.x}+b=0$, there exists $x_i \in S$ such that:

\[ y_i[\boldsymbol{w.x_i}+b] \ngeq 1 \]

Thus, the constrains imposed on the linearly separable case cannot be hold simultaneously. However, a relaxed version of these constraints can indeed hold, that is, for each $i\in[m]$, there exists $\xi_i \geq 0$ such that:

\[ y_i[\boldsymbol{w.x_i}+b] \ngeq 1-\xi_i \]

The variables $\xi_i$ are known as slack variables and are commonly used in optimization to define relaxed versions of constraints. Here, a slack variable $\xi_i$ measures the distance by which vector $x_i$ violates the desires inequality, $y_i(\boldsymol{w.x_i} + b) \geq 1 . This figure illustrates this situation:

 

For hyperplane $y_i(w.x_i+b) = 1 $, a vector $x_i$ with $x_i > 0$ can be viewed as an outlier. Each $x_i$ must be positioned on the correct side the appropriate marginal hyperplane. Here’s the formula we use to optimize the non-separable cases:

\[ \underset{w, b, \xi}{min} \frac{1}{2}||w||^2 + C\sum_{i=1}^{m}\xi_i^p \]\[ \text{subject to} \quad y_i(w.x_i+b) \geq 1-\xi_i \wedge \xi_i \geq 0, i\in[m] \]

Okay! Alright! I think I understand it now. That’s enough classification for today. I’m going to study something FUN next. Altough I’m a bit drowsy… No matter! I have some energy drinks at home. Plus I have some methamphetamine which I have acquired to boost my eenrgy… Nah, kidding. I’m a cocaine man!

I get like, ~100 visitors daily, and I know it’s a lot for a person who’s just started, and I’m not even blogging for visitors, because this blog is basically my study notes, but I think, if you look hard enough, if you slant your eyes, you may find something of value in it. So why not introduce it to your friends? Buddies? Comrades? Homies? Compadres? Dusts? Rafighs? Ais? Do it!

I wanna be honest with you. One thing that I never understood is Principal Component Analysis. It’s very hard. At least to me. Because my college doesn’t offer any courses in the way of statistics (yes, I’m an Associates student!) I thought to myself, “Shieeet, I’m gonna read every resource there is. From Youtube, which I kinda feel like is the worst resource for learning – At least, sometimes – to a book about Computer Vision!” and ta-dah! Here we are. A book about Computer Vision is about to clear everything PCA. Let us commence learning! Put your learning hat and goggles on, and wear your diaphragm, because we’re gonna get screwed!

The following is based on Computer Vision, Principles, Algorithms, Applications, Learning by E. R. Davies.

One powerful way of approaching the task of data representation is Principal Component Analysis, or PCA. This involves finding the mean of a cluster of points in feature space, and then finding the principal axes of the cluster in the following way: First, an axis is found which passes through the mean position and which gives the maximum variance when the data are projected onto it. Then, a second such axis is found which maximizes variance in a directional normal to the first. This process is carried out until a total of N principal axes have been found for N-dimensional feature space. The process is illustrated in this figure:

 In fact, the process is entirely mathematical and need not be undertaken in the strict sequence indicated above, it merely involves finding a set of orthogonal axes which diagonalize the covariance matrix. A matrix is diagonalizable if it has n distinct eigenvalues in F, the eigenspace. Eigenvalues are defined as (I know what eigenvalues are, this is just for people who are reading this post in the future and don’t wanna google):

    \[ A\vec{v} = \lambda \vect{v} \]

Where A is the matrix, in this case, the covariance matrix, \vec{v} is the eigenvector, and \lambda is the eigenvalue.

The covariance matrix for the input population is defined as:

    \[ \Sigma = \mathbb{E}((x_{(p)} - \mu)(x_{(p)}-\mu)^T) \]

where x_{(p)} is the location of the pth data point and \mu is the mean of P data points, and \mathbb{E}(.) indicates expectation value for the underlying population. We can estimate \Sigma from the equations:

    \[ \Sigma = \frac{1}{P}\sum_{p=1}^{P}x_{(p)}x_{(p)}^T - \mu\mu^T \]

Where:

    \[ \mu = \frac{1}{P}\sum_{p=1}^{P}x_{(p)} \]

As \Sigma is real and symmetric, it is possible to diagonalize it using a suitable orthogonal transformation matrix A, obtaining a set of N orthonormal eigenvectors u_i with eigenvalues \lambda_i given by:

    \[ \Sigma u_i = \lambda_i u_i \quad (I = 1, 2, \ldots, N) \]

The vectors u_i are derived from the original vectors x_i by:

    \[ u_i = A(x_i - \mu) \]

And the inverse transformation needed to recover the original data vectors is:

    \[ x_i = \mu + A^T\u_i \]

Here, we have recalled that for an orthogonal matrix

    \[ A^{-1} = A^T \]

In fact, it may be shown that A is the matrix whose rows are formed from the eigenvectors of \Sigma, and that the diagonalized covariance matrix \Sigma^\prime is given by:

    \[ \Sigma^\prime = A\SigmaA^T \]

So that:

    \[ \Sigma^\prime =\left[\begin{array}{cccc}\lambda_1 & 0 & \hdots & 0 \\0 & \lambda_2 && 0 \\\vdots && \ddots & \vdots \\0 & 0 & \hdots & \lambda_N\end{array}\right] \]

In what follows, we shall assume that eigenvalues have been placed in an ordered sequence, starting with the largest. In that case, \lambda_1 represents the most characteristic of the set data points, with later eigenvalues representing successively less significant characteristics. We could even go so far as to say that, in some sense, \lambda_1 represents the most interesting characteristic of the data, whereas \lambda_N would be devoid of interest. More practically, if we ignored \lambda_N we might not lose much information. And indeed the last few eigenvalues would frequently represent characteristics which are not statistically significant and are essentially noise. For these reasons, PCA is commonly used for reduction in dimensionality of the feature space N to some lower value N^\prime.

Well, this was a short blog post. I have understood the concept behind PCA, I just haven’t understood how I can calculate it. No worries! There’s always NumPy… Unless I’m in C++, then I’m screwed!

Sherry is one of those beverages that only stuck-up yuppies drink. I’ve personally never tasted it, for obvious reasons, but I’d imagine it’ll taste like any other fermented grape drink – stingy and bitter. I think people drink these beverages mostly because they’re used to them! Aren’t I right? I think machine learning can be used in the alcoholic beverage industry… Like, how long should this casket sit in the cellar? I dunno.

I wanted to make another post based on Foundations of Machine Learning by MIT Press… But I decided to go the other way and be more succinct: make a general post about machine learning. In the last post I talked about chapter 14 of E. R. Davies book. This chapter is titled “Probabilistic Machine Learning”. I’m gonna make a post about this chapter, and just hope to Lord Almighty that I’ll learn something cool… Of course I’ll learn something cool! What am I saying? Let’s start talking about probabilistic machine learning. Starting off with EM algorithm, EM standing for Expectation Maximization.

Perhaps the main point about probabilistic optimization is that we are always in a situation where we have an absolute mathematical goal – to ensure that the solutions we are seeking are subject to ever-increasing probability. This is important because, when analyzing data involving a large component of randomness, we can never be sure whether any real improvement is being made. But if we can prove mathematically that the process of change can only increase the probability of correct interpretation, we have a crucially important tool at our fingertips!

This is all good and fine, but how exactly will we formulate probabilistic arguments in such a way as to achieve our aims? The answer to this questions lies in the fact that by the 2010s many tools have been developed to let this happne, and at this very moment in time progress in this area is accelerating.

There are concrete methodology, the most powerful of them is Baye’s theory, the sin qua non in the area of applied statistics. There’s then Jensen’s inequality, and Kulback-Leibler divergence formula, which gives a distance measure showing how two different probabilistic distributions are. Then there is Newton’s method of approximation, which is fundamental, but which can be bettered in relevant cases by the Expectation Maximization algorithm. Among all this theory and methodology, we must not forget such basic probability ideas as the vertical bar notation, which allows probabilities to be reexpressed using the product rule, p(A, B) = p(A|B)p(B). These are all fine and dandy, but is there an algorithm that is based on the normal distribution, you may inquire? And I answer, yes, there is! And we’re talking about it.

Much of what we shall do in the probabilistic formalism is to make models of the input data – this being particularly true of EM algorithm, and EM being the subject of our post, is designed for generating accurate statistical models of data. But what types of models are to be used? Gaussian distribution is key, this is because it accurately models the inaccuracies of measurement due to random noise.

Before proceeding to describe the Expectation Maximization algorithm, and its justification, it will be useful to look at the sort of problems that we will want to apply to it. In particular, suppose we have a 1-D distribution of data points which we wish to fit: Perhaps, the most obvious way to model it is by using a set of individual Gaussian distributions, each of which will correspond to one of the peaks of the input distribution. Mathematically, we can model this as a mixture of Gaussians in which each Gaussian has its own mixture coefficient m. Furthermore, if we are to follow our probabilistic strategy, we will need to express both the input distribution and the result as probability distributions.

The first thing to do is to represent the Gaussian distribution as a probability distribution integrating to unity:

    \[ \mathcal{N}(x|\mu, \sigma) = \frac{1}{(2\pi\sigma^2)^(1/2)}exp\left[-\frac{(x - \mu)^2}{2\sigma^2}\right] \]


\mu and \sigma respectively, are the mean and the standard deviation of the distribution. In addition, we follow standard usage in denoting the Gaussian by its alternate name, the normal distribution, using symbol \mathcal{N} to represent it. Probability of this distribution is denoted as:

    \[ p(x) = \sum_{k = 1}^{K}m_k\Nu(x|\mu_k, \sigma_k) \]

If we take the integral of both sides we get:

    \[ \sum_{k = 1}^{K} = 1 \]


So what does it all mean? It basically means that the normal or Gaussian distribution sums up to 1.

If we assume that the probability of EM sample distribution vector z = {z1, …, zk} is mk if zk = 1 and 1 if zk = 0, then according to Baye’s theorem we have the conditional probability of EM vector sample and the Gaussian distribution:

    \[ p(x) = \sum_{z}p(x|z)p(z) = \sum_{k = 1}^{K}m_k\mathcal{N}(x|\mu_k, \sigma_k) \]


    \[ \rho(z_k) = p(z_k = 1 | x) = \frac{p(x|z_k = 1)p(z_k = 1)}{\sum_{j = 1}^{K}}p(x|z_j=1)p(z_j = 1) = \frac{p(x|z)p(z)}{\sum_{z}p(x|z)p(z)} \]

Thus we’re done with the Expectation part of the EM algorithm. But don’t forget about the Probability Density Function! When fitting data to a single Gaussian distribution, it is very necessary to take a products of PDFs of all the individual data points:

    \[ p(x_1, … , x_I|\mu,\sigma) = \prod_{i = 1}^{N}p(x_i|\mu,\sigma) = \prod_{i = 1}^{N} \mathcal{N}(x_i|\mu, \sigma) \]

In the Maximization step of EM algorithm we take the mixture of parameters to be fixed and solve for the Gaussian parameters \mu_k, \sigma_k. This step, along with the former step, are recycled as many times as necessary to proceed from an initial approximation to a final, much more accurate one.

We can generalize EM algorithm for n-dimensions like so:

    \[ \mathcal{N}(x | \mu, \Sigma) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}exp\left[-\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x-\mu)\right] \]

Let’s put EM algorithm to test, shall we? Look at this figure:

 

Look at them contours! That’s what I want in a woman!

The mean positions of this distributions are (1, 1.5), (2, 5), (-2, 5) and the covariance matrix for it is:

    \[ \left( \begin{array}{ccc}2 & 0 \\0 & 0.4 \\\end{array} \right)\]

    \[ \left( \begin{array}{ccc}0.5 & 0 \\0 & 1.5 \\\end{array} \right)\]

    \[ \left( \begin{array}{ccc}1 & -0.5 \\-0.5 & 1 \\\end{array} \right)\]

The 200 points randomly extracted from each of these Gaussians overlap in nontrivial wways, thereby providing a reasonably complex task for the EM algorithm. Next, we move on to a more immediately useful situation: we segment the samples into a number of subareas. It takes a lot of iterations to get a low error rate, but a powerful computer can do it – based on the data structure that’s holding the Gaussians and so many other factors. In fact, training an EM algorithm takes a long time – to test all the contours of the classifiers… And overall, the separability of the data plays a large role which shan’t be ignored.

Threshold… Sounds like a Thrash Metal band!

Above you see the effect of EM algorithm in multilevel thresholding. The intensity of histogram of image A is shown s a green trace in image C. The EM algorithm is used to obtain a GMM as shown in red by the six Gaussians in C. All pixels contributing to the green trace between adjacent Gaussian crossings are assigned to the mean intensity of the intervening Gaussian and rensterted into he image, as in B. The fit to the cloud intensities is naturally relatively poor, but he other intensities are reasonably matched.

That is it for this short blog post! I hope you’ve enjoyed it. I care neither about quality, nor quantity: I just wanna learn. Perhaps 50% of the post is mine, and the rest is credited to the author, so keep that in mind.

Whilst you’re sipping your sherries, I will be smoking Kent Red and reading chapter 13 of Davies’ book: Classification algorithms. I will make a post about it, I promise!

Enjoy life, as if it’s the last day of it – Drinking this much sherry, it might as well be!

Hiyaaaa! It’s been almost a week. I know if I’m going to post my mundane mendacity, which all bring me much satiety, in form of a blog post in an arbitrary time pattern, it’ll take a machine learning algorithm to predict whenever I’m going to make one next, but wait! Let’s drop every mundane hackneyed thing I was gonna talk about and stick with machine learning!

I fell in love with machine learning a few years back, and I just recently bought one of the most intricate books in this subject, MIT Press’s Foundations of Machine Learning, written by two Persians and one Indian. Well whilst they are waiting at the TSA checkpoints, let us not fret and make a series of posts on the damn book! It would be my privilege, nay, my honor! First episode: PAC framework. I don’t mean to credit for what I’ve not written myself, so you should know that almost everything is taken from this book. If you enjoy this post, please purchase this book. Of course it’s not the entire book, because that would be illegal, it’s just tidbits from the book. I did so because I can’t learn without a purpose, I need can’t do passive learning in my brain, learning’s gotta be active – meaning I shan’t be content with learning something without doing something along with it, so I make blog posts about it! But I have made a visualization of PAC, in Cinema 4D, which you can see when we get to it! Well, at least I’m not a content pirate! (Yeah, Chubak, keep telling yourself that…).

Let the explanations commence!

When we are designing a machine learning algorithm, several fundamental questions shall occupy a healthy man’s mind. Such as the efficiency of what can be learned, the fact that somethings are inherently hard or easy to learn, how many examples are needed, that if there’s a general model for learning, and so on and so forth. In this episode, we begin by formalizing an answer to these questions, in the form of PAC framework.

PAC stands for Probably Approximately Correct. The PAC framework helps define the class of learnable concepts in terms of number of sample points needed to achieve an approximate solution. Meaning that it introduces a sample complexity and the time and space complexity of the learning algorithm, all in one package. Imagine the normal attributes for a machine learning algorithm as some sort of a Chinese food, all in separate packages, and this so-called PAC as a nice Turkey sandwich you can take to work with.

Let’s first introduce several definitions and the notation needed to represent the PAC model, which we’ll also use in later posts as well.

We denote the set of all possible examples or instances as 𝒳, which is also sometimes referred to as the input space. The set of all possible labels or target values is denoted by ყ. In this episode ყ is limited to 0 and 1, which corresponds to the so-called binary classification.

𝒳 maps to ყ; x \rightarrow y. Let’s call this mapping a concept, or Շ. You can see how we denote it in equation 1-1. Since ყ = {0, 1}, we can identify Շ as a subset of 𝒳, in which Շ is either 0 or 1. 𝒜 concept may be a triangle, or a rectangle, or a hexagon. You can see it more clearly in figure 1-1, but we’re not going to talk about figure 1-1 yet – I just said this to give you an idea.

Let’s assume that examples are i.i.d: independently and identically distributed according to some fixed but unknown distribution, 𝔇. The learning problem is then formulated as follows: The learner considers a fixed set of possible concepts, 𝓗, called a hypothesis set, which may or may not coincide with the concept, Շ. It receives a sample S = (x1, …, xm) that is i.i.d according to 𝔇 as well as the labels (Շ(x1), …, Շ(xm)), which are based on a specific target concept that is a member of Շ. The task is then to use the labeled sample S to select a hypothesis hs that is a member of 𝓗 that has a small generalization error with respect to the concept Շ. The generalization error of a hypothesis h that is a member of 𝓗, also referred to as risk or true error or simply, error of h and is denoted by R(h) and summed up in equation below.

R(h) = \underset{x \sim D}{P}[h(x) \neq c(x)] = \underset{x \sim D}{E}[1_{h(x) \neq c(x)}]

Let’s have a headcount. h(x) is the hypothesis of the input, Շ(x) is the concept of the input, 𝔇 is the distribution, and 1w is the indicator function of the even w. Indicator functions are 1 for all the elements of the function, and 0 for anything else. P is probability, and E is the expected value. In the upcoming visualization, the marbles which fall into the container are 1 and the marbles which fall out are 0.

The generalization error of a hypothesis is not directly accessible to the learner since both distribution 𝔇 and the target concept Շ are unknown. However, the learner can measure the empirical error of a hypothesis on the labeled sample S. You can see this empirical risk, for sample S = (x1, … , xm), formalized in equation:

\hat{R}_S(h) = \frac{1}{m} \sum_{i = 1}^{m}1_{h(x_i) \neq c(x_i) .

Thus, the empirical error of h ∈ 𝓗 is its average error over the sample S, while the generalization error is its expected error based on the distribution 𝔇. We can already note that for a fixed h ∈ 𝓗, the expectation of the empirical error based on an i.i.d sample S is equal to the generalization error, as you can see it notarized in equation below. Remember that m is the maximum index of the sample – hence, size of the sample.

\underset{S \sim D^m}{E} [\hat{R}_S(h) ] = R(h)

But what is this distribution? Distribution for a discretely-valued function like this is basically the list of all the probabilities for each outcome of 𝒳.

So let’s introduce PAC, or in other words, Probably Approximately Correct: Let n be a number such that the computational cost of representing any element, x ∈ 𝒳 is at most O(n) and denote by size(c) the maximal cost of the computational representation of c ∈ Շ. For example, 𝒳 may be a vector in Rn. For which the cost of any array-based representation would be in O(n). In addition, let hs denote the hypothesis returned by algorithm 𝒜 after received a labeled sample S. To keep notation simple, the dependency of hs on 𝒜 is not explicitly indicated.

So, a concept class Շ is said to be PAC-learnale if there exists an algorithm 𝒜 and a polynomial functiony poly(.,.,.,.) such that for any ε > 0 and > 0, for all distributions 𝔇 on 𝒳 and for any target concept c ∈ Շ, the following holds true for any sample size m ≥ poly(1/ ε, 1 /𝛿, n, size(c):

\underset{S \sim D^m} {P} [R(h_S) \leq \epsilon] \geq 1 - \delta

When such an algorithm 𝒜 exists, it’s said thatՇ has a PAC-learning algorithm.

A concept class Շ is thus PAC-learnable if the hypothesis returned by the algorithm 𝒜 after observing a number of points polynomial in 1/ ε and 1 /𝛿 is approximately correct – meaning the error is at most ε, with the high probability (at least 1 – 𝛿), which justifies the PAC terminology. The parameter 𝛿 > 0 is used to define confidence 1 – 𝛿 and ε > 0 the accuracy 1 – ε. So error rate must be between delta and epsilon. Note that if the running time of the algorithm is polynomial in 1 / ε and 1 / 𝛿, then the sample size m must also be polynomial if the full sample is received by the algorithm.

As you might have noticed, there are two things in PAC definition which correspond with its name. The probability, and the error rate. The first one is for the P in PAC, the second one is for the AC in PAC.

Several key points of the PAC definition are worth emphasizing. First, the PAC framework is a distribution-free model: no particular assumption is made about the distribution 𝔇 from which examples are drawn. Second, the training sample and the test examples used to define the error are drawn according to the same distribution 𝔇. This is a natural and necessary assumption for generalization to be possible in general. It can be relaxed to include favorable domain adaptation problems. Finally, the PAC framework deals with the questionm of learnability for a concept class Շ and not a particular concept. Note that the concept class Շ is known to the algorithm 𝒜, but of course the target concept c ∈ Շ is unknown.

Now let’s take a look at this visualization I have cooked up in Cinema 4D:

 

 

Figure 1-1

Let me explain how it works. The marbles are the samples. We strain the samples, and the ones that land inside the shared container between 𝓗 and Շ are the expectation of PAC, the correct part. The ones that fall out are the error, the probably part. And the ones that fall in the containers that aren’t shared are the approximate part.

Sometimes, the hypothesis hS returned by the algorithm is always consistent, that is, it admits no error on the training sample S. In this part of the blog post, we present a general sample complexity bound, or equivalently, a generalization bound, for consistent hypotheses, in the case where cardinality |𝓗| of the hypothesis set is infinite. Since we consider consistent hypotheses, we will assume that the target concept c is in 𝓗.

So, let 𝓗 be a finite set of functions mapping from 𝒳 to ყ – Let 𝒜 be an algorithm that for any target concept c ∈ 𝓗 and i.i.dsample S returns a consistent hypothesis hS : RS (hS) = 0. Then for any ε, 𝛿 > 0, the inequality:

\underset{S \sim D^m} {P} [R(h_S) \leq \epsilon] \geq 1 - \delta

holds if:

m \leq \frac{1}{\epsilon} \left(log|H| + log \frac{1}{\delta}\right)

The price to pay for coming up with a consistent algorithm is the use of a larger hypothesis set 𝓗 containing target concepts. Of course, the upper bound increases with |𝓗|, the cardinal rule of PAC. However, that dependency is only logarithmic. Note that the term log|𝓗|,o or the related term log2|𝓗| from which it differs by a constant factor, can be interpreted as the number of bits needed to present 𝓗. Thus the generalization, guarantee of the theorem is controlled by the ratio of this number of bits, log2|𝓗|, and the sample size m.

Along with consistent hypotheses, we also have inconsistent hypotheses. Some argue that they aren’t useful, but dare I say that they may be? I’m not sure.

Let’s talk about Baye’s error. Given a distribution 𝔇 over 𝒳 ×ყ , , the Baye’s error R* is defined as:

R^\star = \underset{h - measurable}{inf} R(h)

Finally, let’s talk about noise. Noise is the minimum of the conditional probability of ყ with 𝒳.

Well, that’s it for today! My next blog post is going to be about something completely unrelated to machine learning, yes, another signal processing post! Keep in mind that I write to learn, but if I appease you, so be it!

If you have any questions, ask, and I will try my best to answer. If you see any errors in this post, tell me and I’ll add it to the addendum just below this very line. Thank you for reading my blog post!