Feature selection or dimension reduction is a data preparation activity for most predictive analytics and data science work. One can argue that feature selection is one side of the coin and extracting key performance indicators (KPI) is the other side. Both of these activities require us to parse through available data and identify the big hitters or key players within the dataset. Where they differ is the final objective: in data mining, the final objective is to simply reduce the dimension of the data to optimize model building. In KPI analysis, the objective is to finalize what metrics to track.
This article gives an overview of the common data dimension techniques in use. There are two main classes of feature selection techniques: filter style and wrapper style. Wrapper type algorithms nest the main model building process inside them or “wrap” around them. Filter type algorithm work as pre-processors. Both have their strengths and weaknesses and are discussed in detail in the link shown earlier. Data type and model objectives (classification or numeric prediction) usually determine the appropriate technique to be used.
The list that follows below is by no means exhaustive, it is but a collection of most commonly employed feature selection techniques. Each technique is further categorized as “Numeric/Categorical” for data type, “Prediction/Classification” for predictive modeling objective and finally “Wrapper/Filter” based on the actual mechanics of using it.
1. Forward Selection (Numeric, Prediction, Wrapper)
This is commonly employed by multiple linear regression models. It begins by starting with one of the k independent variables and builds an initial single-predictor model using the one which gives the best performance (usually RMS error, adjusted R^2, t-ratio or F-value). The idea is to add new variables one by one and test for model performance after each addition. If the performance improves, then the newly added variable is saved, the model becomes a two-predictor model and analysis moves to the next iteration. If the performance degrades or remains unchanged, the variable is dropped (model reverts back to single-predictor) and the next one is picked up.
2. Backward Elimination (Numeric, Prediction, Wrapper)
In this case, the initial model is the full model which includes all k independent variables. It examines which of the coefficients have the smallest t-ratio and drops this variable and builds a new (k-1) predictor model. This process is continued until the only independent variables left have very high t-ratios (very low p-value).
3. Principal Component Analysis (Numeric, Prediction, Filter)
Principal component analysis technique uses the concept of transforming the variables into an “orthogonal” space. Via this transformation, the technique identifies the axis along which there is the most variance for the data. Imagine a two variable scatter plot that is shaped roughly like an ellipse. The long axis of this ellipse can be described by a linear transformation of the two variables. This is called the First Principal Component. The shorter axis is the Second Principal component. In this two variable example, each PC is a function of two (original) variables. In a multidimensional data set, each PC can be expressed as a function of several independent variables. Typically the first few PCs will explain all the variability in the model and the problem is now reduced from many (original) variables to a few PC variables.
4. Decision Trees (Categorical or Numeric, Classification, Filter)
The idea for using decision trees to reduce dimensions is fairly straight forward. The root node and the top 3-4 high level nodes are easily the variables around which the response may be neatly separated – and therefore are the most important variables for further analysis. Depending upon your data, you may use either regression trees (for numeric target variables) or standard decision trees (for categoric target variable).
5. Mutual Information (Numeric, Prediction or Classification, Filter)
Mutual Information works very similar to the decision tree logic. Decision trees use entropy to measure purity of data, mutual information uses entropy to measure information content in data: higher entropy implies lower information. The basic idea is to select only those variables which account for the biggest chunk of information in the data, and is very similar to the 80 20 rule or Pareto Law: select the few (20% or less) variables which account for 80% or more of the total information in the dataset.
Did we miss any other commonly used methods? Feel free to share your favorite technique in the comments below…
Originally posted on Wed, Sep 26, 2012 @ 09:05 AM