## Introduction to Boosting and AdaBoost

I know I promised a post on regression, but then I realized I only have a shallow understanding of Boosting and AdaBoost. So I biked to the nearest public library, when to the index cards, search for ‘Boost’ and after perusing through hundreds of self-help books, I found the greatest resource on AdaBoost: “How to Boost Your Spirit by Ada MacNally”. Nah, kidding. Such book doesn’t exist. The following, as always, is my study notes, taken from Foundations of Machine Learning by MIT Press and Machine Learning in Action by Peter Harrington published by Manning. The first book is heavy on math. The second book is more of a fluff book and is much simpler than the first one.

It is often difficult, for a non-trivial learning task, to directly devise an accurate algorithm satisfying the strong PAC-learning requirements that we saw in the PAC-learnable algorithm post I wrote before (link) but, there can be more hope for finding simple predictors guaranteed only to perform slightly better than random. The following gives a formal definition of such weak learners. As in the PAC-learning post, we let $n$ be a number such that the computational cost of representing any element $x \in \mathcal{X}$ is at most $O(n)$ and denote by size(c) the maximal cost of computational representation of $c \in \mathcal{C}$.

Let’s define weak learning. A concept class $\mathcal{C}$ is said to be weakly PAC-learnable if there exists an algorithm $\mathcal{A}, \gamma > 0$ and a polynomial function $poly(., . , .)$ such that for any $\delta > 0$, for all distributions $\mathcal{D}$ on $\mathcal{X}$ and for any target concept $c \in \mathcal{C}$, the following holds for any sample size $m \geq poly(1/\delta,n,size(c))$:

$\underset{\mathcal{S} \sim \mathcal{D}^m}{\mathbb{P}}\left[R(h_s) \leq \frac{1}{2} – \gamma \right] \leq 1 – \delta,$

where $h_s$ is the hypothesis returned by algorithm $\mathcal{A}$ when trained on sample $S$. When such an algorithm $\mathcal{A}$ exists, it is called a weak learning algorithm for $\mathcal{C}$ or a weak learner. The hypothesis returned by a weak learning algorithm is called a base classifier.

The key idea behind boosting techniques is to use a weak learning algorithm to build a strong learner. That is, an accurate PAC-learning algorithm. To do so, boosting techniques use an ensemble method: they combine different base classifiers returned by a weak learner by a weak learner to create more accurate predictors. But which base classifier should be used and how should they be combined?

Another visualization shall be:

The algorithm takes as input a labeled sample $S=((x_1, y_1),\ldots,(x_m, y_m))$, with $(x_i, y_i) \in \mathcal{X} \times \{-1, +1\}$ for all $i \in [m]$, and maintains a distribution over the indices $\{-1, \ldots, m\}$. Initially, lines 1-2, the distribution uniform ($\mathcal{D}_1$). At each round of boosting, that is each iteration $t \in [T]$, of the loop in lines 3-8, a new base classifier $h_t \in \mathcal{H}$ is selected that minimizes the error on the training sample weighted by the distribution $\mathcal{D}_t$:

$h_t \in \underset{h \in \mathcal{H}}{argmin}\underset{i \sim \mathcal{D}_t}{\mathbb{P}} [h(x_i) \neq y_i] = \underset{h \in \mathcal{H}}{argmin}\sum_{i=1}^{m}\mathcal{D}_t(i)1_{h(x_i) \neq y_i} .$

$Z_t$ is simply a normalization factor to ensure that the weights $D_{t+1}(i)$ sum to one. The precise reason for the definition of the coefficient at $\alpha_t$ will become clear later. For now, observe that if $\epsilon_t$, the error of the base classifier, is less than $\frac{1}{2}$, then $\frac{1-\epsilon_t}{\epsilon_t} > 1$ and $\alpha_t$ is positive. Thus, the new distribution $\mathcal{D}_{t+1}$ is defined from $\mathcal{D}_t$ by substantially increasing the weight on $i$ if point $x_i$ is incorrectly classified, and, on the contrary, decreasing it if $x_i$ is correctly classified. This has the effect of focusing more on the points incorrectly classified at the next round of boosting, less on those correctly classified by $h_t$.

After $T$ rounds of boosting, the classifier returned by AdaBoost is based on the sing of function $f$, which is a non-negative linear combination of the base classifiers $h_t$. The weight $\alpha_t$ assigned to $h_t$in that um is a logarithmic function of the ratio of the accuracy $1-\epsilon_t$ and error $\epsilon_t$ of $h_t$. Thus, more accurate base classifiers are assigned a larger weight in that sum.

For any $t \in [T]$, we will denote by $f_t$ the linear combination of the base classifiers after $t$ rounds of boosting: $f_t = \sum_{s=1}^{t}\alpha_sh_s$. In particular, we have,e $f_T = f$. The distribution $\mathcal{D}_{t+1}$ can be expressed in terms of $f_t$ and the normalization factors $Z_s$, $s \in [t]$ as follows:

$\forall i \in [m], \mathcal{D}_{t+1}(i)=\frac{e^{y_if_t(x_i)}}{ m \prod_{s=1}^{t}Z_s} .$

The AdaBoost algorithm can be generalized in several ways:

– Instead of a hypothesis with minimal weighted error, $h_t$ can be more generally the base classifier returned by a weak learner algorithm trained on $D_t$;

– The range of the base classifiers could be $[-1, +1]$, or more generally bounded subset of $\mathbb{R}$. The coefficients $\alpha_t$ can then be different and may not even admit a closed form. In general, they are chosen to minimize an upper bound on the empirical error, as discussed in the next section. Of course, in that general case the hypotheses $h_t$ are not binary classifiers, but their sign could define the label and their magnitude could be interpreted as a measure of confidence.

AdaBoost was originally designed to address the theoretical question of whether a weak learning algorithm could be be used to derive a strong learning one. Here, we will show that it coincides in fact with a very simple algorithm, which consists of applying a general coordinate descent technique to a convex and differentiable objective function.

For simplicity, in this section, we assume that the base classifier set $\mathcal{H}$ is finite, with cardinality $N: \mathcal{H} = /{h_1, \ldots, h_N/}$. An ensemble function $f$ such as the one returned by AdaBoost can then be written as $f = \sum_{j=1}^N\bar{\alpha}_j h_j$. Given a labeled sample $S = ((x_1, y_1), \ldots, (x_m, y_m))$ let $F$ be the objective function defined for all $\boldsymbol{\bar{\alpha}} = (\bar{alpha}_1, \ldots, \bar{\alpha}_N) \in \mathbb{R}^N$ by:

$F(\boldsymbol{\bar{\alpha}}) = \frac{1}{m}\sum_{i=1}^{m}e^{y_i f(x_i)} = \frac{1}{m}\sum_{i=1}^m e^{y_i \sum_{j=1}{N}\bar{\alpha}h_j(x_i)}.$

$F$ is a convex function of $\boldsymbol{\bar{\alpha}}$ since it is a sum of convex functions, each obtained by composition of the (convex) exponential function with an affine function of $\boldsymbol{\bar{\alpha}}$.

The AdaBoost algorithm takes the input dataset, the class labels, and the number of iterations. This is the only parameter you need to specify in most ML libraries.

Here, we briefly describe the standard practical use of AdaBoost. An important requirement for the algorithm is the choice of the base classifiers or that of the weak learner. The family of base classifiers typically used with AdaBoost in practice is that of decision trees, which are equivalent to hierarchical partitions of the space. Among decision trees, those of depth one, also known as stumps, are by far the most frequently used base classifiers.

Boosting stumps are threshold functons associated to a single feature. Thus, a stump corresponds to a single axis-aligned partition of space. If the data is in $\mathbb{R}^N$, we can associate a stump to each ot he N components. Thus, to determine the stump with the minimal weighted error at each round of boosting, the best component and the best threshold for each component must be computed.

Now, let’s create a weak learner with a decision stump, and then implement a different AdaBoost algorithm using it. If you’re familiar with decision trees, that’s OK, you’ll understand this part. However, if you’re not, either learn about them, or wait until I cover them tomorrow. A decision tree with only one split is a decision stump.

The pseudocode to generate a simple decision stump looks like this:

Set the minError to +∞

     For every feature in dataset:

           For every step:

                    For each inequality:

                           Build a decision stump and test it with the weighted dataset

                            If the error is less than minError:

 Set this stump as the best stump

Return the best stump

Now that we have generated a decision stump, let’s train it:

For each iteration:

         Find the best stump using buildStump()

           Add the best stump to the stump array

           Calculate α

           Calculate the new eight vector

           Update the aggregate class estimate

           If the error rate is 0:

               Break out of the loop

Well, that is it for today! I know I keep promising one thing, and deliver another, but this time I’m really going to talk about decision trees!

Semper Fudge!