Hiyaaaa! It’s been almost a week. I know if I’m going to post my mundane mendacity, which all bring me much satiety, in form of a blog post in an arbitrary time pattern, it’ll take a machine learning algorithm to predict whenever I’m going to make one next, but wait! Let’s drop every mundane hackneyed thing I was gonna talk about and stick with machine learning!

I fell in love with machine learning a few years back, and I just recently bought one of the most intricate books in this subject, MIT Press’s Foundations of Machine Learning, written by two Persians and one Indian. Well whilst they are waiting at the TSA checkpoints, let us not fret and make a series of posts on the damn book! It would be my privilege, nay, my honor! First episode: PAC framework. I don’t mean to credit for what I’ve not written myself, so you should know that almost everything is taken from this book. If you enjoy this post, please purchase this book. Of course it’s not the entire book, because that would be illegal, it’s just tidbits from the book. I did so because I can’t learn without a purpose, I need can’t do passive learning in my brain, learning’s gotta be active – meaning I shan’t be content with learning something without doing something along with it, so I make blog posts about it! But I have made a visualization of PAC, in Cinema 4D, which you can see when we get to it! Well, at least I’m not a content pirate! (Yeah, Chubak, keep telling yourself that…).

Let the explanations commence!

When we are designing a machine learning algorithm, several fundamental questions shall occupy a healthy man’s mind. Such as the efficiency of what can be learned, the fact that somethings are inherently hard or easy to learn, how many examples are needed, that if there’s a general model for learning, and so on and so forth. In this episode, we begin by formalizing an answer to these questions, in the form of PAC framework.

PAC stands for Probably Approximately Correct. The PAC framework helps define the class of learnable concepts in terms of number of sample points needed to achieve an approximate solution. Meaning that it introduces a sample complexity and the time and space complexity of the learning algorithm, all in one package. Imagine the normal attributes for a machine learning algorithm as some sort of a Chinese food, all in separate packages, and this so-called PAC as a nice Turkey sandwich you can take to work with.

Let’s first introduce several definitions and the notation needed to represent the PAC model, which we’ll also use in later posts as well.

We denote the set of all possible examples or instances as 𝒳, which is also sometimes referred to as the input space. The set of all possible labels or target values is denoted by ყ. In this episode ყ is limited to 0 and 1, which corresponds to the so-called binary classification.

𝒳 maps to ყ; x \rightarrow y. Let’s call this mapping a concept, or Շ. You can see how we denote it in equation 1-1. Since ყ = {0, 1}, we can identify Շ as a subset of 𝒳, in which Շ is either 0 or 1. 𝒜 concept may be a triangle, or a rectangle, or a hexagon. You can see it more clearly in figure 1-1, but we’re not going to talk about figure 1-1 yet – I just said this to give you an idea.

Let’s assume that examples are i.i.d: independently and identically distributed according to some fixed but unknown distribution, 𝔇. The learning problem is then formulated as follows: The learner considers a fixed set of possible concepts, 𝓗, called a hypothesis set, which may or may not coincide with the concept, Շ. It receives a sample S = (x1, …, xm) that is i.i.d according to 𝔇 as well as the labels (Շ(x1), …, Շ(xm)), which are based on a specific target concept that is a member of Շ. The task is then to use the labeled sample S to select a hypothesis hs that is a member of 𝓗 that has a small generalization error with respect to the concept Շ. The generalization error of a hypothesis h that is a member of 𝓗, also referred to as risk or true error or simply, error of h and is denoted by R(h) and summed up in equation below.

R(h) = \underset{x \sim D}{P}[h(x) \neq c(x)] = \underset{x \sim D}{E}[1_{h(x) \neq c(x)}]

Let’s have a headcount. h(x) is the hypothesis of the input, Շ(x) is the concept of the input, 𝔇 is the distribution, and 1w is the indicator function of the even w. Indicator functions are 1 for all the elements of the function, and 0 for anything else. P is probability, and E is the expected value. In the upcoming visualization, the marbles which fall into the container are 1 and the marbles which fall out are 0.

The generalization error of a hypothesis is not directly accessible to the learner since both distribution 𝔇 and the target concept Շ are unknown. However, the learner can measure the empirical error of a hypothesis on the labeled sample S. You can see this empirical risk, for sample S = (x1, … , xm), formalized in equation:

\hat{R}_S(h) = \frac{1}{m} \sum_{i = 1}^{m}1_{h(x_i) \neq c(x_i) .

Thus, the empirical error of h ∈ 𝓗 is its average error over the sample S, while the generalization error is its expected error based on the distribution 𝔇. We can already note that for a fixed h ∈ 𝓗, the expectation of the empirical error based on an i.i.d sample S is equal to the generalization error, as you can see it notarized in equation below. Remember that m is the maximum index of the sample – hence, size of the sample.

\underset{S \sim D^m}{E} [\hat{R}_S(h) ] = R(h)

But what is this distribution? Distribution for a discretely-valued function like this is basically the list of all the probabilities for each outcome of 𝒳.

So let’s introduce PAC, or in other words, Probably Approximately Correct: Let n be a number such that the computational cost of representing any element, x ∈ 𝒳 is at most O(n) and denote by size(c) the maximal cost of the computational representation of c ∈ Շ. For example, 𝒳 may be a vector in Rn. For which the cost of any array-based representation would be in O(n). In addition, let hs denote the hypothesis returned by algorithm 𝒜 after received a labeled sample S. To keep notation simple, the dependency of hs on 𝒜 is not explicitly indicated.

So, a concept class Շ is said to be PAC-learnale if there exists an algorithm 𝒜 and a polynomial functiony poly(.,.,.,.) such that for any ε > 0 and > 0, for all distributions 𝔇 on 𝒳 and for any target concept c ∈ Շ, the following holds true for any sample size m ≥ poly(1/ ε, 1 /𝛿, n, size(c):

\underset{S \sim D^m} {P} [R(h_S) \leq \epsilon] \geq 1 - \delta

When such an algorithm 𝒜 exists, it’s said thatՇ has a PAC-learning algorithm.

A concept class Շ is thus PAC-learnable if the hypothesis returned by the algorithm 𝒜 after observing a number of points polynomial in 1/ ε and 1 /𝛿 is approximately correct – meaning the error is at most ε, with the high probability (at least 1 – 𝛿), which justifies the PAC terminology. The parameter 𝛿 > 0 is used to define confidence 1 – 𝛿 and ε > 0 the accuracy 1 – ε. So error rate must be between delta and epsilon. Note that if the running time of the algorithm is polynomial in 1 / ε and 1 / 𝛿, then the sample size m must also be polynomial if the full sample is received by the algorithm.

As you might have noticed, there are two things in PAC definition which correspond with its name. The probability, and the error rate. The first one is for the P in PAC, the second one is for the AC in PAC.

Several key points of the PAC definition are worth emphasizing. First, the PAC framework is a distribution-free model: no particular assumption is made about the distribution 𝔇 from which examples are drawn. Second, the training sample and the test examples used to define the error are drawn according to the same distribution 𝔇. This is a natural and necessary assumption for generalization to be possible in general. It can be relaxed to include favorable domain adaptation problems. Finally, the PAC framework deals with the questionm of learnability for a concept class Շ and not a particular concept. Note that the concept class Շ is known to the algorithm 𝒜, but of course the target concept c ∈ Շ is unknown.

Now let’s take a look at this visualization I have cooked up in Cinema 4D:

 

 

Figure 1-1

Let me explain how it works. The marbles are the samples. We strain the samples, and the ones that land inside the shared container between 𝓗 and Շ are the expectation of PAC, the correct part. The ones that fall out are the error, the probably part. And the ones that fall in the containers that aren’t shared are the approximate part.

Sometimes, the hypothesis hS returned by the algorithm is always consistent, that is, it admits no error on the training sample S. In this part of the blog post, we present a general sample complexity bound, or equivalently, a generalization bound, for consistent hypotheses, in the case where cardinality |𝓗| of the hypothesis set is infinite. Since we consider consistent hypotheses, we will assume that the target concept c is in 𝓗.

So, let 𝓗 be a finite set of functions mapping from 𝒳 to ყ – Let 𝒜 be an algorithm that for any target concept c ∈ 𝓗 and i.i.dsample S returns a consistent hypothesis hS : RS (hS) = 0. Then for any ε, 𝛿 > 0, the inequality:

\underset{S \sim D^m} {P} [R(h_S) \leq \epsilon] \geq 1 - \delta

holds if:

m \leq \frac{1}{\epsilon} \left(log|H| + log \frac{1}{\delta}\right)

The price to pay for coming up with a consistent algorithm is the use of a larger hypothesis set 𝓗 containing target concepts. Of course, the upper bound increases with |𝓗|, the cardinal rule of PAC. However, that dependency is only logarithmic. Note that the term log|𝓗|,o or the related term log2|𝓗| from which it differs by a constant factor, can be interpreted as the number of bits needed to present 𝓗. Thus the generalization, guarantee of the theorem is controlled by the ratio of this number of bits, log2|𝓗|, and the sample size m.

Along with consistent hypotheses, we also have inconsistent hypotheses. Some argue that they aren’t useful, but dare I say that they may be? I’m not sure.

Let’s talk about Baye’s error. Given a distribution 𝔇 over 𝒳 ×ყ , , the Baye’s error R* is defined as:

R^\star = \underset{h - measurable}{inf} R(h)

Finally, let’s talk about noise. Noise is the minimum of the conditional probability of ყ with 𝒳.

Well, that’s it for today! My next blog post is going to be about something completely unrelated to machine learning, yes, another signal processing post! Keep in mind that I write to learn, but if I appease you, so be it!

If you have any questions, ask, and I will try my best to answer. If you see any errors in this post, tell me and I’ll add it to the addendum just below this very line. Thank you for reading my blog post!