Although machine learning is great for shape classification, for shape recognition, we must still use the old methods. Methods such as Hough Transform, and RANSAC.

In this post, we’ll look into using Hough Transform for recognizing straight lines. The following is taken from E. R. Davies’ book, Computer Vision: Principals, Algorithms, Application, Learning and Image Digital Image Processing by Gonzalez and Woods.

Straight edges are amongst the most common features of the modern world, arising in perhaps the majority of manufactured objects and components – not least in the very buildings in which we live. Yet, it is arguable whether true straight lines ever arise in the natural state: possibly the only example of their appearance in virgin outdoor scenes is in the horizon – although even this is clearly seen from space as a circular boundary! The surface of water is essentially planar, although it is important to realize that this is a deduction: the fact remains that straight lines seldom appear in completely natural scenes. Be all this as it may, it is clearly vital both in city pictures and in the factory to have effective means of detecting straight edges. This chapter studies available methods for locating these important features.

Historically, HT has been the main means of detecting straight edges, and since the method was originally invented by Hough in 1962, it has been developed and refined for this purpose. We’re going to concentrate on it on this blog post, and this also prepares you to use HT to detect circles, ellipses, corners, etc, which we’ll talk about in the not-too-distant future. We start by examining the original Hough scheme, even thoupgh it is now seen to be wasteful in computation since it has evolved.

First, let us introduce Hough Transform. Often, we have to work in unstructured environments in which all we have is an edge map and no knowledge about where objects of interest might be. In such situations, all pixels are candidates for linking, and thus have to be accepted or eliminated based on predefined global properties. In this section, we develop an approach based on whether set of pixels lie on curves of a specified shape. Once detected, these curves form the edge or region boundaries of interest.

Given $n$ points in the image, suppose that we want to find subsets of these points that lie on straight lines. One possible soltion is to fine all lines determined by every pair of points, then find all subsets of points that are close to particular lines. This approach involves finding $n(n-1)/2 \sim n^2$ lines, then performing $(n)(n(n-1))/2 \sim n^3$ comparisons for every points to all lines. As you might have guessed, this is extremely computationally expensive task. Imagine this, we check every pixel for neighboring pixels and compare their distance to see if they form a straight line. Impossible!

Hough, as we said, in 1962, proposed an alternative approach to this scanline method. Commonly referred to as the Hough transform. Let $(x_i, y_i)$ denote p point in the xy-plane and consider the general equation of a straight line in slope-intercept form: $y_i = ax_i + b$. Infinitely many lines pass through $(x_i, y_i)$ but they all satisfy the equation we saw, for varying values of $a$ and $b$. However, writing this equation as $b = -x_i a+y_i$ and considering the ab-plane – also called parameter space, yields the equation of a single line in parameter space associated with it, which intersects the line associated with $(x_i, y_i)$ at some point $(a\prime, b\prime)$ in parameter space, where $a\prime$ is the slope and $b\prime$ is the intercept of the line containing the both $(x_i, y_i)$s in the xy-plane, and of course, we are assuming that lines are not parallel, in fact, all points on this line have lines in parameter space that intersect at $(a\prime, b\prime)$. Here, this figure illustrates s the concepts:


In principle, the parameter space lines corresponding to all points $(x_k, y_k)$ in the xy-plane could be plotted, and the principal (goddammit, principle, principal, fuck this language!) lines in that plane could be found by identifying points in parameter space where large numbers of parameter-space lines intersect. However, a difficulty with this approach is that $a$, approaches infinity as the lines approaches vertical direction. One way around this difficulty is to use the normal representation of a line:

\[ x \cos(\theta) + y sin(\theta) = \rho \]

Figure on the right below demonstrates the geometrical interpretation of parameters $\rho$ and $\theta$. A horizontal line has $\theta = 0^\circ$, with $\rho$ being equal to the positive x-intercept. Similarly, a vertical line has $\theta = 90^\circ$, with $\rho$ being equal to positive y-intercept. Each sinusoidal curve in the middle of the figure below represents the family of lines that pass through a particular point $(x_k. y_k)$ in xy-plane.

Let’s talk about the properties of Hough transform. Figure below illustrates the Hough transform based on the equation above.

On the top, you see an image of size $M\times M \bigvee M=101$ with five labeled white points, and below it shows each of these points mapped into the parameter space, $\rho\theta$-plane using subdivisions of one unit for the $\rho$ and $\theta$ axes. The range of $\theta$ values is $\pm 90^\circ$ and the range of $\rho$ values is $\pm \sqrt{2} M$ As the bottom image shows, each curve has a different sinusoidal shape. The horizontal line resulting from the mapping of point 1 is a sinusoid of zero amplitude.

The points labeled A and B in the image on the bottom illustrate the colinearity detection property of the Hough transform. For exampele, point B marks the intersection of the curves corresponding to points 2, 3, and 4 in the xy image plane. The location of point A indicates that these thre points line on a straight line passing through the origin $(\rho = 1)$ and oriented at $-45^\circ$. Similarly, the curves intersecting at point B in parameter space indicate that 2, 3, and 4 line on a straight like oriented at $45^\circ$, and whose distance from origin is $\rho = 71$. Finally, Q, R, and S illustrate the fact that Hough transform exhibits a reflective adjacency relationship at the right and left edges of the parameter space.

Now that we know the basics of HT and line detection using HT, let’s take a look at Longitudinal Line Localization.

The previous method is insensitive to where along the infinite idealized line an observed segment appear. He reason for this is that we only have two parameters, $\rho$ and $\theta$.There is some advantage to be gained in this, in parital occlusion of line does not prevent its detection: indeed, if several segments of a line are visible, they can all contribute to the peak in parameter space, hence improving senitivity. On the other hand, for full image interpretation, it is useful to have information about the longitudinal placement of line segments.

Ths is achieved by a further stage of processing. The additional stage involves finding which points contributed to each peak in the main parameter space, and carrying out connectivity analysis in each case. Some call this process xy-gruping. It is not vital that the line segments should be 4-connected (meaning, a neighborhood with only the vertical and horizontal neighbors) or 8-connected (with diagonal neighbors) – just that there should be sufficient points on them so that adjacent points are withing a threshold distance apart, i.e. groups of points arem erged if they are withing prespecified distance. Finally, segments shorter than a certain minimum length can be ignored too insignificant to help with image interpretation.

The alternative method for saving computation time is the Foot-of-Normal method. Created by the author of book I’m quoting from, it eliminates the use of trigonometric functions such as arctan by employing a different parametrization scheme. Both the methods we’ve described employ abstract parameter spaces in which poitns bear no immediately obvious relation to image space. N the alternative scheme, the parameter spaces in which points bear no immediately obvious visual relation to image space. In this alternative scheme, the parameter space is a second image space, whifcfh is congruent to image space.

This type of parameter space is obtained in the following way. First, each edge fragment in the image is produced much as required previously so that $\rho%” can be measured, but this time the foot of the normal from the origin is taken as a voting position in the parameter space. Taking %(x_0, y_0) as the foot of the normal from the origin to the relevant line, it is found that:

\[b/a = y_0/x_0 \]

\[(x-x_0)x_0 + (y-y_o)y_0 \]

Thes etwo equations are sufficient to compute the two coordinates, $(x_0, y_0)$. Solving for $x_0$ and $y_0$ gives:

\[ x_0 = va \]

\[y_0 = vb \]


\[ \frac{ax + by}{a^2 + b^2} \]

Well, we’re done for now! It’s time to take a shower, then study regression, as I’m done with classification. I’m going to write a post about regression, stay tuned!

Sherry is one of those beverages that only stuck-up yuppies drink. I’ve personally never tasted it, for obvious reasons, but I’d imagine it’ll taste like any other fermented grape drink – stingy and bitter. I think people drink these beverages mostly because they’re used to them! Aren’t I right? I think machine learning can be used in the alcoholic beverage industry… Like, how long should this casket sit in the cellar? I dunno.

I wanted to make another post based on Foundations of Machine Learning by MIT Press… But I decided to go the other way and be more succinct: make a general post about machine learning. In the last post I talked about chapter 14 of E. R. Davies book. This chapter is titled “Probabilistic Machine Learning”. I’m gonna make a post about this chapter, and just hope to Lord Almighty that I’ll learn something cool… Of course I’ll learn something cool! What am I saying? Let’s start talking about probabilistic machine learning. Starting off with EM algorithm, EM standing for Expectation Maximization.

Perhaps the main point about probabilistic optimization is that we are always in a situation where we have an absolute mathematical goal – to ensure that the solutions we are seeking are subject to ever-increasing probability. This is important because, when analyzing data involving a large component of randomness, we can never be sure whether any real improvement is being made. But if we can prove mathematically that the process of change can only increase the probability of correct interpretation, we have a crucially important tool at our fingertips!

This is all good and fine, but how exactly will we formulate probabilistic arguments in such a way as to achieve our aims? The answer to this questions lies in the fact that by the 2010s many tools have been developed to let this happne, and at this very moment in time progress in this area is accelerating.

There are concrete methodology, the most powerful of them is Baye’s theory, the sin qua non in the area of applied statistics. There’s then Jensen’s inequality, and Kulback-Leibler divergence formula, which gives a distance measure showing how two different probabilistic distributions are. Then there is Newton’s method of approximation, which is fundamental, but which can be bettered in relevant cases by the Expectation Maximization algorithm. Among all this theory and methodology, we must not forget such basic probability ideas as the vertical bar notation, which allows probabilities to be reexpressed using the product rule, p(A, B) = p(A|B)p(B). These are all fine and dandy, but is there an algorithm that is based on the normal distribution, you may inquire? And I answer, yes, there is! And we’re talking about it.

Much of what we shall do in the probabilistic formalism is to make models of the input data – this being particularly true of EM algorithm, and EM being the subject of our post, is designed for generating accurate statistical models of data. But what types of models are to be used? Gaussian distribution is key, this is because it accurately models the inaccuracies of measurement due to random noise.

Before proceeding to describe the Expectation Maximization algorithm, and its justification, it will be useful to look at the sort of problems that we will want to apply to it. In particular, suppose we have a 1-D distribution of data points which we wish to fit: Perhaps, the most obvious way to model it is by using a set of individual Gaussian distributions, each of which will correspond to one of the peaks of the input distribution. Mathematically, we can model this as a mixture of Gaussians in which each Gaussian has its own mixture coefficient m. Furthermore, if we are to follow our probabilistic strategy, we will need to express both the input distribution and the result as probability distributions.

The first thing to do is to represent the Gaussian distribution as a probability distribution integrating to unity:

    \[ \mathcal{N}(x|\mu, \sigma) = \frac{1}{(2\pi\sigma^2)^(1/2)}exp\left[-\frac{(x - \mu)^2}{2\sigma^2}\right] \]

\mu and \sigma respectively, are the mean and the standard deviation of the distribution. In addition, we follow standard usage in denoting the Gaussian by its alternate name, the normal distribution, using symbol \mathcal{N} to represent it. Probability of this distribution is denoted as:

    \[ p(x) = \sum_{k = 1}^{K}m_k\Nu(x|\mu_k, \sigma_k) \]

If we take the integral of both sides we get:

    \[ \sum_{k = 1}^{K} = 1 \]

So what does it all mean? It basically means that the normal or Gaussian distribution sums up to 1.

If we assume that the probability of EM sample distribution vector z = {z1, …, zk} is mk if zk = 1 and 1 if zk = 0, then according to Baye’s theorem we have the conditional probability of EM vector sample and the Gaussian distribution:

    \[ p(x) = \sum_{z}p(x|z)p(z) = \sum_{k = 1}^{K}m_k\mathcal{N}(x|\mu_k, \sigma_k) \]

    \[ \rho(z_k) = p(z_k = 1 | x) = \frac{p(x|z_k = 1)p(z_k = 1)}{\sum_{j = 1}^{K}}p(x|z_j=1)p(z_j = 1) = \frac{p(x|z)p(z)}{\sum_{z}p(x|z)p(z)} \]

Thus we’re done with the Expectation part of the EM algorithm. But don’t forget about the Probability Density Function! When fitting data to a single Gaussian distribution, it is very necessary to take a products of PDFs of all the individual data points:

    \[ p(x_1, … , x_I|\mu,\sigma) = \prod_{i = 1}^{N}p(x_i|\mu,\sigma) = \prod_{i = 1}^{N} \mathcal{N}(x_i|\mu, \sigma) \]

In the Maximization step of EM algorithm we take the mixture of parameters to be fixed and solve for the Gaussian parameters \mu_k, \sigma_k. This step, along with the former step, are recycled as many times as necessary to proceed from an initial approximation to a final, much more accurate one.

We can generalize EM algorithm for n-dimensions like so:

    \[ \mathcal{N}(x | \mu, \Sigma) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}exp\left[-\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x-\mu)\right] \]

Let’s put EM algorithm to test, shall we? Look at this figure:


Look at them contours! That’s what I want in a woman!

The mean positions of this distributions are (1, 1.5), (2, 5), (-2, 5) and the covariance matrix for it is:

    \[ \left( \begin{array}{ccc}2 & 0 \\0 & 0.4 \\\end{array} \right)\]

    \[ \left( \begin{array}{ccc}0.5 & 0 \\0 & 1.5 \\\end{array} \right)\]

    \[ \left( \begin{array}{ccc}1 & -0.5 \\-0.5 & 1 \\\end{array} \right)\]

The 200 points randomly extracted from each of these Gaussians overlap in nontrivial wways, thereby providing a reasonably complex task for the EM algorithm. Next, we move on to a more immediately useful situation: we segment the samples into a number of subareas. It takes a lot of iterations to get a low error rate, but a powerful computer can do it – based on the data structure that’s holding the Gaussians and so many other factors. In fact, training an EM algorithm takes a long time – to test all the contours of the classifiers… And overall, the separability of the data plays a large role which shan’t be ignored.

Threshold… Sounds like a Thrash Metal band!

Above you see the effect of EM algorithm in multilevel thresholding. The intensity of histogram of image A is shown s a green trace in image C. The EM algorithm is used to obtain a GMM as shown in red by the six Gaussians in C. All pixels contributing to the green trace between adjacent Gaussian crossings are assigned to the mean intensity of the intervening Gaussian and rensterted into he image, as in B. The fit to the cloud intensities is naturally relatively poor, but he other intensities are reasonably matched.

That is it for this short blog post! I hope you’ve enjoyed it. I care neither about quality, nor quantity: I just wanna learn. Perhaps 50% of the post is mine, and the rest is credited to the author, so keep that in mind.

Whilst you’re sipping your sherries, I will be smoking Kent Red and reading chapter 13 of Davies’ book: Classification algorithms. I will make a post about it, I promise!

Enjoy life, as if it’s the last day of it – Drinking this much sherry, it might as well be!

Oh boy, I’m on fire yo! Two posts in one day… To be fair, I wrote the last post yesterday. Anyhow, I don’t think anyone cares. But you reap what you sow. By the time I’m finished with this post my knowledge will be richer, and more robust. So what am I complaining about? As I said before I write to learn. So be it!

But first, let’s talk about one of my heroes, E. Roy Davies. He’s one of the forerunners of computer and machine vision, a subset of pattern recognition – one of the three things my profession is about. This post is based on a few chapters of his opus, Computer Vision: Algorithms, Applications, Learning.

Handsome Devil Behind It All

Alright. Enough sucking up to the masters of the field. Let’s talk about Artificial Neural Networks.

ANNs were launched off in the 1950s, and continued well into the 60s, and research into them still continues to this day. Beldoe and Browning developed n-tuple type of classifier that involved bitwise recording and lookup of binary feature data, leading to weightless or logical type of ANN. Rosenblatt’s perceptron, although, was more important than this algorithm. Let’s talk about this “perceptron”.

The simple preceptron is a linear classifier that classifies patterns into two classes. It takes a feature vector, x = (x1, x2, …, xN) as its input and produces a single scalar output sum_{i = 1}^{N} w_i x_i, the classification process being completed by applying a threshold function at \theta. The mathematics is simplified by writing - \theta as w0, and taking it to correspond to an input x0 which is maintained at a constant value of unity. The output of the linear part of the classified is then written in the form:

d = sum_{i = 1}^{N} w_i x_i - \theta = sum_{i = 1}^{N} w_i x_i + w_0 = um_{i = 1}^{N} w_i x_i

and the final output of the classifier is given by:

y = f(d) f \left( sum_{i = 1}^{N} w_i x_i \right)

This type of neuron – which as we said before, is called preceptron, can be trained using a variety of procedures, such as the fixed increment rule. The basic concept of this algorithm was to try to improve the overall error rate by moving the linear disciminant planea fixed distance toward a position where no misclassifcation would occur – by only doing this when a classification error has occurred.

w_i(k + 1) = w_i(k) \t y(k) = \omega(k)

w_i(k + 1) = w_i(k) + \eta[\omega(k) - y(k)]x_i(k) y(k) \neq \omega(k)

Let me interpret all this mumbo jumbo. What it basically means. First, take a look at this image, courtesy of Towards Data Science:



So you can clearly see what happens. A bunch of inputs are given, then they are weighted by their threshold – the hyperplane if you will. Then they are summed up. If this number is less than a value, it’s one binary choice, if it’s greater than the value, it’s another binary choice. This is a Heaviside function.

The photo below you shows the difference between separable and non-separable data. The hyperplane could do so much more if the data are separable!



So far, our preceptrons have been single-layered. Meaning that only one layer of hyperplanes can be ousted. The other concept is multilayer preceptron, or MLP. Rosenblatt himself suggested such networks, but was unable to work out his magic and represent them. Ten years later, in 1969, Minsky and Papert published their famous monograph in which they discussed MLP. It wasn’t until 1986 that Rumbelhart et al were successful in proposing a systemic approach to training of MLPs. Their solution is known as the back-propagation algorithm.

Let’s talk about it.

The problem of trainign a MLP can be simply stated as: a general layer of a MLP obtains its first feature data from the lower layers and receives its class data from higher levels. Henece, if all the weights in the MLP are potentially changeable, the information reaching a particular layer cannot be relied upon. There is no reason why training a layer in isolation would lead to overall convergence of the MLP toward and ideal classifier. Although it might be thought that this is a rather minor difficulty, in fact, this is not so; indeed, this is but one example of the so-called credit assignment problem. What is this problem? Well, it’s correctly determinign the local origins or global properties and making the right assignment of rewards, punishments, corrections, and so on.

The key to solving these problems was to modify the preceptron cwomposing the MLP by giving them a less hard activation function that the Heaviside thresholding function whch results in \theta and we call its negation w0, the hyperplane – we give started changing the Heaviside function to a sigmoid shape, such as the tanh function.

Once these softer activation functions were used, it became possible for each layer of the MLP to feel the data more precisely, and thus training procedures could be set up on a systematic basis. In particular, the rate of change of the data at each individual neuron could be communicated to other layers that could then be trained appropriately – though only on an incremntal basis. I’m not going to bore you with the mathematical details, just some points regarding the algorithm:

1) The outputs of one node are the inputs of the next, and an arbitrary choice is made to label all variables as output (y) parameters rather than input (x) variables, all output parameters are in the range 0 to 1 (because of tanh, duh!)

2) The class parameter \omega has been generalized as the target value t of the output variable y.

3) For all except the final outputs, the quantity of \delta_j has to be calculated using the formula \delta_j = y_j (1 - y_j)(\sum_{\delta_m}{w_{jm}}), the summation having to be taken over all the nodes in the layer above node j.

4) The sequence for computing the node weights involves starting with the output nodes and the proceeding downwards one layer at the time.

In the figure below you can see the difference between Heaviside activator, linear activator, and Sigmoid activator.

I have prepared a short video for people who are visual learners. Let’s hope it helps them:


Ok. That’s it! I’m not sure what the next post is going to be about. But a little birdie tells me it’s going to be about Deep Learning! So if you really, really read my blog, look out for it!

What am I going to do? Well, first I’m going to a casino and use my X-Ray Auto Blackjack Aviator Specs to win some hands. Then I’m gonna read chapter 14 of Davies books. It’s a refresher on basic Machine learning concepts. I hope I don’t fall asleep, I’ve been awake for 14 hours!