Previously we discussed some of the advantages of support vector machines over other classification methods. We introduced some defining terms which the analyst has to make themselves familiar with such as hyperplane, margin, support vectors and the "linearly separable" concept. One key term was deliberately left undefined. This is the so called kernel function or basis function. In this article we will demonstrate how choosing the proper kernel function can significantly improve the SVM performance.
In the picture below, the data points belong to two main classes: an inner ring and an outer ring. Your intuition will tell you, correctly, that these two classes are not "linearly separable". In other words we cannot draw a straight line to split the two classes. However, it is also intuitively clear that an elliptical or circular "hyperplane" can easily separate the two classes.
In fact, if we were to run a simple linear SVM on this data, we would get a classification accuracy of around 46%. As seen in the result below, a linear SVM would classify about half the inner ring and half the outer ring correctly.
How can we classify more complex feature spaces? A simple trick would be to transform the two variables x and y into a new feature space involving x (or y) and a new variable z defined as z = sqrt(x^2+y^2). The representation of z is nothing more than the equation for a circle. When the data is transformed in this way, the resulting feature space involving x and z will appear as shown below. The two clusters of data correspond to the two radii of the rings: the inner one with an average radius of around 5.0 and the outer cluster with an average radius of around 8.0.
Clearly this new problem in x and z dimensions is now linearly separable and we can apply a standard SVM to do the classification. When we run a linear SVM on this transformed data, we get a classification accuracy of 100%. After classifying the transformed feature space, we can inverse the transformation to get back our original feature space.
Kernel functions offer the user this option of transforming nonlinear spaces into linear ones. Most packages which offer SVM will include several non linear kernels ranging from simple polynomial basis functions to sigmoid functions. The user does not have to do the transformation before hand, but simply has to select the appropriate kernel function and the software will take care of transforming the data, classifying it and retransforming the results back into the original space.
Unfortunately with a large number of attributes in a dataset it is difficult to know which kernel would work best. The most commonly used ones are polynomial and radial basis functions. From a practical standpoint, it is a good idea to start with a quadratic polynomial and work your way up into some of the more exotic kernel functions till we reach a desired accuracy level. This flexibility of support vector machines does come at the price of cost of computation.
This article is a brief excerpt from our upcoming book on DataMining and Predictive Analytics using RapidMiner. Take this survey and give us feedback on what else you would like to see in our book.