Logistic regression for business analytics using RapidMiner: Part 2
In part 1 we gave a brief introduction to logistic regression and indicated when it might be appropriate to use it in business analytics settings. Probably the best definition of Logistic regression is this " ... a mathematical modeling approach in which the best-fitting, yet least-restrictive model is desired to describe the relationship between several independent explanatory variables and a dependent dichotomous response variable".
In this article we get into the details of how the model equation is developed and then show how to set up a simple analysis using RapidMiner.
How does logistic regression find the sigmoid curve?
A straight line can be depicted by only two parameters: the slope (m) and the intercept (c). The way in which X's and Y's are related to each other can be simply specified by m and c. However an S-shaped curve is a much more complex shape and representing it parametrically is not as easy. So how does one find a mathematical means to relate the X's to the Y's?
It turns out that if we transform the Y's to the logarithm of the odds of Y, then the transformed target variable is linearly related to the X's. In most cases where we need to use logistic regression, the Y is usually a YES-NO type of response. This is usually interpreted as the probability of an event happening (Y=1) or not happening (Y=0).
- If Y is an event (response, pass etc),
- and p is the probability of the event happening (Y=1),
- then (1-p) is the probability of the event not happening (Y=0),
- and p/(1-p) is the odds of the event happening
- It turns out that log(p/1-p) is linear in the predictors, X
We can write the model as
- log[p/1-p] = mX + c ------------------ Eq 1.
From the data given, we know the X and can compute the p for each value of X. After this of course the problem is essentially similar to linear regression. (To see the sigmoid curve, the variables need to be transformed from the p-space to the Y-space).
The logistic regression model from Eq. 1 ultimately delivers the probability of Y happening (i.e. Y=1), given specific value(s) of X.
7-steps to a simple logistic regression model in RapidMiner
The data we used comes from an example here for a credit scoring exercise. The objective is to predict DEFAULT (Y or N) based on two predictors: Loan age (business usage) and number of days of delinquency. There are 100 samples.
Step 1: Load speadsheet into RapidMiner. Use the process described here. Remember to set the DEFAULT column as "Label"
Step 2: Split data into train and test samples using the Split Validation operator as shown here
Step 3: Add the Logistic Regression operator in the "training" window of the split validation operator
Step 4: Add Apply Model operator in the "testing" window of split validation operator in a similar manner as discussed here. Just use default parameter values.
Step 5: Add Performance evaluation operator in the "testing" window of split validation operator as discussed here.
Step 6: Connect all ports as shown below
Step 7: Run the model and view results. In particular check for the Kernel Model which shows the coefficients for the two predictors and the intercept. Also check the confusion matrix for Accuracy, Sensitivity, and Specificity and finally view the ROC curves and check AUC.
The accuracy of the model based on the 30% testing sample is 83%. The ROC curves has an AUC of 0.863 which is quite acceptable. The next step would be to review the kernel model and prepare for deploying this model.
Download all our logistic regression articles in one digest e-book below