Subscribe to Email Updates

Learning data science: feature engineering

Posted by Bala Deshpande on Tue, Jan 19, 2016 @ 08:00 AM

Far too often, even experienced data scientists get confused about what feature engineering really means. They may mistake it for feature selection or worse adding new data sources. In my mind feature engineering encompasses several different data preparation techniques. But before we get into it we must define what a feature actually is.

For all machine learning models, the data must be presented in a format that allows a model to be built. Whether it is structured (numerical/categorical/nominal) data or unstructured (text, audio, images, video), the model ultimately requires a tabular format where observations or records or samples form the rows, and attributes or variables or features form the columns. So in short, a feature is nothing more than a column in the data table for modeling.variable-selection-feature-reduction-data-mining.png

What is feature engineering?

Feature Engineering is the art and science of selecting and/or generating the columns in a data table for a machine learning model. When we prepare a table for modeling, not all columns are useful in their raw form. In fact some columns (or attributes) may be useless - one example is the an ID type of attribute, for model building. Tools, such as RapidMiner, allow the data scientist to define such attributes before applying a modeling technique so that the algorithm may ignore such columns (and only keep it for presentation or descriptive purposes).

Feature engineering as a technique, has three sub categories of techniques: feature selection, dimension reduction and feature generation. Quiet frequently, the broader term is applied to mean any of these three sub categories. This could perhaps be because not all techniques are required all of the time.

Feature Selection

Sometimes called feature ranking or feature importance, this is the process of ranking the attributes by their value to predictive ability of a model. Algorithms such as decision trees automatically rank the attributes in the data set. The top few nodes in a decision tree are considered the most important features from a predictive stand point. As a part of a process, feature selection using entropy based methods like decision trees can be employed to filter out less valuable attributes before feeding the reduced dataset to another modeling algorithm. Regression type models usually employ methods such as forward selection or backward elimination to select the final set of attributes for a model.

Dimension Reduction

This is sometimes called feature extraction. The most classic example of dimension reduction is principle component analysis or PCA. PCA allows us to combine existing attributes into a new data frame consisting of a much reduced number of attributes by utilizing the variance in the data. The attributes which "explain" the highest amount of variance in the data form the first few principal components and we can ignore the rest of the attributes if data dimensionality is a problem from a computational standpoint. PCA results in a data table whose attributes do not look anything like the attributes of the raw dataset. 

Feature Generation or Feature Construction

This technique is the one which most people are actually referring to when they talk about feature engineering. Quite simply, this is the process of manually constructing new attributes from raw data. It involves intelligently (a.k.a. domain knowledge) combining or splitting existing raw attributes into new one which have a higher predictive power. For example a date stamp may be used to generate 2 new attributes such as AM and PM which may be useful in discriminating whether day or night has a higher propensity to influence the response variable. We may want to convert noisy numerical attributes into simpler nominal attributes, by calculating the mean value and determining if a given row is above or below that mean value. We may generate a new attribute such as number of claims a member has filed for in a given time period, by combining date attribute and a nominal attribute such as claim_filed (Y/N), for example. The possibilities are endless. Feature construction is essentially a data transformation process.

Here is a longer article on feature engineering which provides some excellent links and further readings for those interested.

 Predictive analytics and data mining

Topics: feature selection, feature engineering