Feature selection or data dimension reduction or variable screening in predictive analytics refers to the process of identifying the few most important variables or parameters which help in predicting the outcome. In today's charged up world of high speed computing, one might be forgiven for asking, why bother? The most important reasons all come from practicality.
Reason 1: If two or more of the independent variables (or predictors) are correlated to the dependent (or predicted) variable, then the estimates of coefficients in a regression model tend to be unstable or counter intuitive.
Example: y = 45 + 0.8x1 and y = 45 + 0.1x2 are two linear regression models which predict y. Both clearly indicate that if x's increase, y also increases. If x1 and x2 show a strong correlation to y, then a multiple regression model might look like y = 45 + 0.02 x1 - 0.4 x2. In this case, because the three (x1, x2 and y) are strongly correlated, interaction effects between x1 and x2 lead to a situation where x2 is in a negative relationship with y, meaning y will decrease with increase in x2. This is not only the reverse of what was seen in the simple model, but is also counter-intuitive.
Reason 2: The law of averages states that the larger the set of predictors, the higher the probability of having missing values in the data. If we chose to delete cases which have missing values for some predictors, we may end up with a shortage of samples.
Example: A practical rule of thumb used by data miners is to have atleast 5(p+2) samples where p is the number of predictors. If your data set is sufficiently large and this rule is easily satisfied, then you may not be risking much by deleting cases. But if your data is from an expensive market survey for example, a systematic procedure to actually reduce the data set, may result in a situation where you dont have to address this problem of losing samples. It is better to lose variables which dont impact your prediction than to lose somewhat more expensive samples.
There are several other more technical reasons for reducing data dimensionality which will be explored in subsequent articles. In a next article, we will discuss some common techniques for actually implementing this process.
If you like tutorials like these, we invite you to sign up for visTASC, "a visual thesaurus of analytics, statistics and complex systems" which provides a portal for aggregrating analytics wisdom and current applications.