Supposing you are building a multiple linear regression model using many factors, a first step is reducing the number of factors or predictors. This process is known as feature selection or dimension reduction among others. Two key points need to be kept in mind:
- Selecting predictors uncorrelated to the target or response variable increases variance of prediction (reduces precision)
- Not selecting predictors correlated with the response variable increases error (reduces accuracy)
With these in mind, you want to narrow down your predictor list to only a handful which not only are strongly correlated to the target variable, but are also mutually uncorrelated in order to reduce multi-collinearity effects.
There are two basic methods to do this and we will indicate how RapidMiner may be used to accomplish this task in both ways.
First Method: Build a correlation matrix
Step 1: Load the data set into RapidMiner without specifying the label or target variable
Step 2: Connect the data to "Correlation Matrix" operator and run the analysis
Step 3: Short list all variables which have a correlation coefficient > 0.5
Step 4: Among the shortlisted variables, check which ones are correlated to one another and if they are, pick the one with a stronger correlation to the target
Second Method: Build a decision tree
An important fact to keep in mind is that, because we are interested in building a regression model for prediction, our target variable (or label in RapidMiner terms) is a numeric value. However decision trees need to have a label that is categorical and not numeric. To overcome this limitation, use the "Discretize" operator.
Step 1: Discretize the numeric label into several categories using the "Discretize by User Specification" option. For example if the target variable ranges from 0 to 50, you can split it into 2 categories such as 0 to 25 and 26 to 50.
Step 2: Once discretization is done, use the Set Role operator to create a label variable and connect the results to the standard decision tree learner. Set the minimum lead size to be 5 or 10 depending upon the sample size. Run the analysis.
Step 3: Select the top 3 or 4 nodes of the tree as the predictors to use in the final regression model.
Finally compare the results from the two methods. If you have a clean data set, you should see at least 80% of the shortlisted predictors match between the two methods.
Want to learn how to use RapidMiner to build a regression model? Check out our free e-book below.