## A Snifter of Sherry along with the EM Algorithm

Sherry is one of those beverages that only stuck-up yuppies drink. I’ve personally never tasted it, for obvious reasons, but I’d imagine it’ll taste like any other fermented grape drink – stingy and bitter. I think people drink these beverages mostly because they’re used to them! Aren’t I right? I think machine learning can be used in the alcoholic beverage industry… Like, how long should this casket sit in the cellar? I dunno.

I wanted to make another post based on Foundations of Machine Learning by MIT Press… But I decided to go the other way and be more succinct: make a general post about machine learning. In the last post I talked about chapter 14 of E. R. Davies book. This chapter is titled “Probabilistic Machine Learning”. I’m gonna make a post about this chapter, and just hope to Lord Almighty that I’ll learn something cool… Of course I’ll learn something cool! What am I saying? Let’s start talking about probabilistic machine learning. Starting off with EM algorithm, EM standing for Expectation Maximization.

Perhaps the main point about probabilistic optimization is that we are always in a situation where we have an absolute mathematical goal – to ensure that the solutions we are seeking are subject to ever-increasing probability. This is important because, when analyzing data involving a large component of randomness, we can never be sure whether any real improvement is being made. But if we can prove mathematically that the process of change can only increase the probability of correct interpretation, we have a crucially important tool at our fingertips!

This is all good and fine, but how exactly will we formulate probabilistic arguments in such a way as to achieve our aims? The answer to this questions lies in the fact that by the 2010s many tools have been developed to let this happne, and at this very moment in time progress in this area is accelerating.

There are concrete methodology, the most powerful of them is Baye’s theory, the sin qua non in the area of applied statistics. There’s then Jensen’s inequality, and Kulback-Leibler divergence formula, which gives a distance measure showing how two different probabilistic distributions are. Then there is Newton’s method of approximation, which is fundamental, but which can be bettered in relevant cases by the Expectation Maximization algorithm. Among all this theory and methodology, we must not forget such basic probability ideas as the vertical bar notation, which allows probabilities to be reexpressed using the product rule, . These are all fine and dandy, but is there an algorithm that is based on the normal distribution, you may inquire? And I answer, yes, there is! And we’re talking about it.

Much of what we shall do in the probabilistic formalism is to make models of the input data – this being particularly true of EM algorithm, and EM being the subject of our post, is designed for generating accurate statistical models of data. But what types of models are to be used? Gaussian distribution is key, this is because it accurately models the inaccuracies of measurement due to random noise.

Before proceeding to describe the Expectation Maximization algorithm, and its justification, it will be useful to look at the sort of problems that we will want to apply to it. In particular, suppose we have a 1-D distribution of data points which we wish to fit: Perhaps, the most obvious way to model it is by using a set of individual Gaussian distributions, each of which will correspond to one of the peaks of the input distribution. Mathematically, we can model this as a mixture of Gaussians in which each Gaussian has its own mixture coefficient m. Furthermore, if we are to follow our probabilistic strategy, we will need to express both the input distribution and the result as probability distributions.

The first thing to do is to represent the Gaussian distribution as a probability distribution integrating to unity:

and respectively, are the mean and the standard deviation of the distribution. In addition, we follow standard usage in denoting the Gaussian by its alternate name, the normal distribution, using symbol to represent it. Probability of this distribution is denoted as:

If we take the integral of both sides we get:

So what does it all mean? It basically means that the normal or Gaussian distribution sums up to 1.

If we assume that the probability of EM sample distribution vector z = {z1, …, zk} is mk if zk = 1 and 1 if zk = 0, then according to Baye’s theorem we have the conditional probability of EM vector sample and the Gaussian distribution:

Thus we’re done with the Expectation part of the EM algorithm. But don’t forget about the Probability Density Function! When fitting data to a single Gaussian distribution, it is very necessary to take a products of PDFs of all the individual data points:

In the Maximization step of EM algorithm we take the mixture of parameters to be fixed and solve for the Gaussian parameters . This step, along with the former step, are recycled as many times as necessary to proceed from an initial approximation to a final, much more accurate one.

We can generalize EM algorithm for n-dimensions like so:

Let’s put EM algorithm to test, shall we? Look at this figure:

The mean positions of this distributions are (1, 1.5), (2, 5), (-2, 5) and the covariance matrix for it is:

The 200 points randomly extracted from each of these Gaussians overlap in nontrivial wways, thereby providing a reasonably complex task for the EM algorithm. Next, we move on to a more immediately useful situation: we segment the samples into a number of subareas. It takes a lot of iterations to get a low error rate, but a powerful computer can do it – based on the data structure that’s holding the Gaussians and so many other factors. In fact, training an EM algorithm takes a long time – to test all the contours of the classifiers… And overall, the separability of the data plays a large role which shan’t be ignored.

Above you see the effect of EM algorithm in multilevel thresholding. The intensity of histogram of image A is shown s a green trace in image C. The EM algorithm is used to obtain a GMM as shown in red by the six Gaussians in C. All pixels contributing to the green trace between adjacent Gaussian crossings are assigned to the mean intensity of the intervening Gaussian and rensterted into he image, as in B. The fit to the cloud intensities is naturally relatively poor, but he other intensities are reasonably matched.

That is it for this short blog post! I hope you’ve enjoyed it. I care neither about quality, nor quantity: I just wanna learn. Perhaps 50% of the post is mine, and the rest is credited to the author, so keep that in mind.

Whilst you’re sipping your sherries, I will be smoking Kent Red and reading chapter 13 of Davies’ book: Classification algorithms. I will make a post about it, I promise!

Enjoy life, as if it’s the last day of it – Drinking this much sherry, it might as well be!

## PAC-Learnable Algorithms: Probably Approximately Correct

Hiyaaaa! It’s been almost a week. I know if I’m going to post my mundane mendacity, which all bring me much satiety, in form of a blog post in an arbitrary time pattern, it’ll take a machine learning algorithm to predict whenever I’m going to make one next, but wait! Let’s drop every mundane hackneyed thing I was gonna talk about and stick with machine learning!

I fell in love with machine learning a few years back, and I just recently bought one of the most intricate books in this subject, MIT Press’s Foundations of Machine Learning, written by two Persians and one Indian. Well whilst they are waiting at the TSA checkpoints, let us not fret and make a series of posts on the damn book! It would be my privilege, nay, my honor! First episode: PAC framework. I don’t mean to credit for what I’ve not written myself, so you should know that almost everything is taken from this book. If you enjoy this post, please purchase this book. Of course it’s not the entire book, because that would be illegal, it’s just tidbits from the book. I did so because I can’t learn without a purpose, I need can’t do passive learning in my brain, learning’s gotta be active – meaning I shan’t be content with learning something without doing something along with it, so I make blog posts about it! But I have made a visualization of PAC, in Cinema 4D, which you can see when we get to it! Well, at least I’m not a content pirate! (Yeah, Chubak, keep telling yourself that…).

Let the explanations commence!

When we are designing a machine learning algorithm, several fundamental questions shall occupy a healthy man’s mind. Such as the efficiency of what can be learned, the fact that somethings are inherently hard or easy to learn, how many examples are needed, that if there’s a general model for learning, and so on and so forth. In this episode, we begin by formalizing an answer to these questions, in the form of PAC framework.

PAC stands for Probably Approximately Correct. The PAC framework helps define the class of learnable concepts in terms of number of sample points needed to achieve an approximate solution. Meaning that it introduces a sample complexity and the time and space complexity of the learning algorithm, all in one package. Imagine the normal attributes for a machine learning algorithm as some sort of a Chinese food, all in separate packages, and this so-called PAC as a nice Turkey sandwich you can take to work with.

Let’s first introduce several definitions and the notation needed to represent the PAC model, which we’ll also use in later posts as well.

We denote the set of all possible examples or instances as 𝒳, which is also sometimes referred to as the input space. The set of all possible labels or target values is denoted by ყ. In this episode ყ is limited to 0 and 1, which corresponds to the so-called binary classification.

𝒳 maps to ყ; . Let’s call this mapping a concept, or Շ. You can see how we denote it in equation 1-1. Since ყ = {0, 1}, we can identify Շ as a subset of 𝒳, in which Շ is either 0 or 1. 𝒜 concept may be a triangle, or a rectangle, or a hexagon. You can see it more clearly in figure 1-1, but we’re not going to talk about figure 1-1 yet – I just said this to give you an idea.

Let’s assume that examples are i.i.d: independently and identically distributed according to some fixed but unknown distribution, 𝔇. The learning problem is then formulated as follows: The learner considers a fixed set of possible concepts, 𝓗, called a hypothesis set, which may or may not coincide with the concept, Շ. It receives a sample S = (x1, …, xm) that is i.i.d according to 𝔇 as well as the labels (Շ(x1), …, Շ(xm)), which are based on a specific target concept that is a member of Շ. The task is then to use the labeled sample S to select a hypothesis hs that is a member of 𝓗 that has a small generalization error with respect to the concept Շ. The generalization error of a hypothesis h that is a member of 𝓗, also referred to as risk or true error or simply, error of h and is denoted by R(h) and summed up in equation below.

Let’s have a headcount. h(x) is the hypothesis of the input, Շ(x) is the concept of the input, 𝔇 is the distribution, and 1w is the indicator function of the even w. Indicator functions are 1 for all the elements of the function, and 0 for anything else. P is probability, and E is the expected value. In the upcoming visualization, the marbles which fall into the container are 1 and the marbles which fall out are 0.

The generalization error of a hypothesis is not directly accessible to the learner since both distribution 𝔇 and the target concept Շ are unknown. However, the learner can measure the empirical error of a hypothesis on the labeled sample S. You can see this empirical risk, for sample S = (x1, … , xm), formalized in equation:

Thus, the empirical error of h ∈ 𝓗 is its average error over the sample S, while the generalization error is its expected error based on the distribution 𝔇. We can already note that for a fixed h ∈ 𝓗, the expectation of the empirical error based on an i.i.d sample S is equal to the generalization error, as you can see it notarized in equation below. Remember that m is the maximum index of the sample – hence, size of the sample.

But what is this distribution? Distribution for a discretely-valued function like this is basically the list of all the probabilities for each outcome of 𝒳.

So let’s introduce PAC, or in other words, Probably Approximately Correct: Let n be a number such that the computational cost of representing any element, x ∈ 𝒳 is at most O(n) and denote by size(c) the maximal cost of the computational representation of c ∈ Շ. For example, 𝒳 may be a vector in Rn. For which the cost of any array-based representation would be in O(n). In addition, let hs denote the hypothesis returned by algorithm 𝒜 after received a labeled sample S. To keep notation simple, the dependency of hs on 𝒜 is not explicitly indicated.

So, a concept class Շ is said to be PAC-learnale if there exists an algorithm 𝒜 and a polynomial functiony poly(.,.,.,.) such that for any ε > 0 and > 0, for all distributions 𝔇 on 𝒳 and for any target concept c ∈ Շ, the following holds true for any sample size m ≥ poly(1/ ε, 1 /𝛿, n, size(c):

When such an algorithm 𝒜 exists, it’s said thatՇ has a PAC-learning algorithm.

A concept class Շ is thus PAC-learnable if the hypothesis returned by the algorithm 𝒜 after observing a number of points polynomial in 1/ ε and 1 /𝛿 is approximately correct – meaning the error is at most ε, with the high probability (at least 1 – 𝛿), which justifies the PAC terminology. The parameter 𝛿 > 0 is used to define confidence 1 – 𝛿 and ε > 0 the accuracy 1 – ε. So error rate must be between delta and epsilon. Note that if the running time of the algorithm is polynomial in 1 / ε and 1 / 𝛿, then the sample size m must also be polynomial if the full sample is received by the algorithm.

As you might have noticed, there are two things in PAC definition which correspond with its name. The probability, and the error rate. The first one is for the P in PAC, the second one is for the AC in PAC.

Several key points of the PAC definition are worth emphasizing. First, the PAC framework is a distribution-free model: no particular assumption is made about the distribution 𝔇 from which examples are drawn. Second, the training sample and the test examples used to define the error are drawn according to the same distribution 𝔇. This is a natural and necessary assumption for generalization to be possible in general. It can be relaxed to include favorable domain adaptation problems. Finally, the PAC framework deals with the questionm of learnability for a concept class Շ and not a particular concept. Note that the concept class Շ is known to the algorithm 𝒜, but of course the target concept c ∈ Շ is unknown.

Now let’s take a look at this visualization I have cooked up in Cinema 4D:

Let me explain how it works. The marbles are the samples. We strain the samples, and the ones that land inside the shared container between 𝓗 and Շ are the expectation of PAC, the correct part. The ones that fall out are the error, the probably part. And the ones that fall in the containers that aren’t shared are the approximate part.

Sometimes, the hypothesis hS returned by the algorithm is always consistent, that is, it admits no error on the training sample S. In this part of the blog post, we present a general sample complexity bound, or equivalently, a generalization bound, for consistent hypotheses, in the case where cardinality |𝓗| of the hypothesis set is infinite. Since we consider consistent hypotheses, we will assume that the target concept c is in 𝓗.

So, let 𝓗 be a finite set of functions mapping from 𝒳 to ყ – Let 𝒜 be an algorithm that for any target concept c ∈ 𝓗 and i.i.dsample S returns a consistent hypothesis hS : RS (hS) = 0. Then for any ε, 𝛿 > 0, the inequality:

holds if:

The price to pay for coming up with a consistent algorithm is the use of a larger hypothesis set 𝓗 containing target concepts. Of course, the upper bound increases with |𝓗|, the cardinal rule of PAC. However, that dependency is only logarithmic. Note that the term log|𝓗|,o or the related term log2|𝓗| from which it differs by a constant factor, can be interpreted as the number of bits needed to present 𝓗. Thus the generalization, guarantee of the theorem is controlled by the ratio of this number of bits, log2|𝓗|, and the sample size m.

Along with consistent hypotheses, we also have inconsistent hypotheses. Some argue that they aren’t useful, but dare I say that they may be? I’m not sure.

Let’s talk about Baye’s error. Given a distribution 𝔇 over 𝒳 ×ყ , , the Baye’s error R* is defined as:

Finally, let’s talk about noise. Noise is the minimum of the conditional probability of ყ with 𝒳.

Well, that’s it for today! My next blog post is going to be about something completely unrelated to machine learning, yes, another signal processing post! Keep in mind that I write to learn, but if I appease you, so be it!

If you have any questions, ask, and I will try my best to answer. If you see any errors in this post, tell me and I’ll add it to the addendum just below this very line. Thank you for reading my blog post!