 The term “support vector machines” is a confusing name for a predictive analytics algorithm. The fact is this term is very much a misnomer: there is really no specialized machine (other than a basic PC) involved, and certainly no smart robot. But it is a very powerful algorithm that has been very successful in applications ranging from text analysis to pattern recognition.

There are a few basic definitions which one must digest. In this article we will cover these minimal definitions and highlight a fundamental advantage SVMs have over other data mining techniques.

Background

At the very basic level, a support vector machine is a classification method. It works on the principle of fitting a boundary to a region of points which are all alike (that is, belong to one class). Once a boundary is fitted (on the training sample), for any new points (test sample) that need to be classified, we must simply check whether they lie inside the boundary or not. The advantage of SVM is that once a boundary is established, most of the training data is redundant. All it needs is a core set of points which can help identify and set the boundary. These data points are called support vectors because they “support” the boundary. Why are they called vectors? Clearly because each data point (or observation) is a vector: that is, a row of data that contains values for a number of attributes.

This boundary is traditionally called a hyperplane. In a simple example of two dimensions (two attributes), this boundary can be a straight line or a curve (as shown above). In three dimensions it can be a plane or a complex surface. Higher dimensions are obviously impossible to visualize and a hyperplane is thus a generic name for the boundary in more than 3 dimensions.

As seen in the above image, a number of such hyperplanes can be found. Which one is the “correct” one? Clearly a boundary which cleanly separates the classes is the best one. Additionally, a boundary line which ensures that the average geometric distance between the two regions (or classes) is maximized is even better. This (n-dimensional) distance is called a margin. An SVM algorithm therefore essentially runs an optimization scheme to maximize this margin.

But it is not always possible to ensure that data can be cleanly separated. It may be rare to find that the data are linearly separable. When this happens, there may be many points within the margin. In this case the best hyperplane is the one which has a minimum number of such points within the margin. To ensure this, a penalty is charged for every “contaminant” inside the margin and the hyperplane which has a minimum aggregate penalty cost is chosen.

(There is one other important definition related to support vector machines: that of a basis function. We will skip that for now and cover it in a future article).

Avoiding over fitting

As stated earlier, once a hyperplane is found, most of the data other than the support vectors (which are points closest to the boundary) become redundant. This means that small changes to data cannot greatly affect the hyperplane and hence the SVM. Therefore SVMs tend to generalize very well. Consider the following example.

This is a simple two dimensional (2 attribute) dataset which has three classes: core, inner ring and outer ring. The job of a classifier is to separate the points into 3 classes. Clearly sophisticated rules could be established based on geometric properties of an ellipse to distinguish the rings.

If we use a decision tree with default parameter settings, it fails to separate the regions (try it for example with gain = 0.1, you will not see any branches in the resulting tree) and bins all points in the core – the resulting classification accuracy of 62% is pointless because the points already in the core probably accounted for nearly 33% of the data! We could however get a good separation (accuracy 93%) using a very low information gain (0.0001), but this leads to overfitting as shown below

Support vector machines are naturally resistant to overfitting and work very well when identifying boundary regions like in the 3 ring example above. A basic SVM with default settings will yield a classification accuracy of 95%. The chart below shows the quality of fit on the training data. 