Logistic regression is a powerful tool for addressing classification problems. However it is also not the easiest of concepts for many beginners to grasp. There are three related steps which need to be fully grasped to be able to understand the workings of logistic regression and to understand how the predictor data is used to generate the classification.
Logistic regression is used to predict the binomial (Yes/No, 1/0 etc) outcome of a response (dependent) variable using one or several predictor (independent) variables as seen above. The predictors can be binomial, categorical, or numerical. As in multiple linear regression, we need a function that connects the independent variables to the dependent variable. However the main difference here is that the dependent variable can only take on two values: Yes/No or 1/0. So we need a means to map a continuous function of independent variables to a binary outcome. We also need a way to score the binary outcome: to determine the probability of a "Yes" or "No" result.
First Step: With the logistic response function or logit function, map the continuous predictors to a function (the logit) of the response variable, which is also continuous. Using the above example, the predictors can form a linear function such as
Logit(Customer?) = b + b1*Income + b2*Education + b3*Mortgage + b4*Experience
this can take on any value: from -Infinity to +Infinity
Second Step: Convert the logit into odds. This is easy because logit is nothing but the logarithm of odds of the response variable.
log (odds(Customer?)) = Logit(Customer?)
this can only be positive valued: from 0 to +Infinity
Third Step: Once we know the odds, we know the probability score. Probability, p
p = odds/(1+odds)
this can only be valued from 0 to 1
If you set up a cutoff value such as 0.5, then we know that a response is a "Yes" for all scores above the cutoff and vice-versa. In practice, with most analytics tools, you dont have to worry about this math behind the scenes. But knowing what is going on is important to build good quality predictive models.
RapidMiner Issues on Logistic Regression
- RapidMiner does not use the logit model to make the binomial predictions, but instead uses a Support Vector Machine (SVM) algorithm. Therefore the coefficients (the b's above) cannot be interpreted in a conventional manner. However it is easy to "deploy" and see the scores for each classified record.
- To see a traditional logit function developed using your data and also to see the odds function, use the Weka extension, or W-logistic operator within RapidMiner.
Download our FREE eBook for info on how to set up a logistic regression model.