Logistic regression for business analytics using RapidMiner: Part 1
It can be argued that the most important step in a business analytics process is establishing a clear business objective. Once this is done, selecting the right technique becomes a matter of simple logic. At a very high level there are fundamentally two main classes of techniques: those that evolved purely from statistics (such as regression) and those that emerged from a blend of stats, computer science and mathematics (such as classification trees).
This article is about logistic regression and how it compares to its twin - linear regression, and when it makes sense to use it. In the second part, we discuss the mechanics of logistic regression and its implementation using RapidMiner for a simple business analytics application.
Why click through many articles, when you can download all of them in one shot? Get all our Logistic Regression articles in an ebook format!
A simple explanation of Logistic Regression
Recall that linear regression is the process of finding a straight line that passes through a bunch of points with the objective of being able to use the equation of the line as a model for prediction. The key assumptions here are that both the predictor and target variables are continuous as seen in this chart below. Intuitively, one can state that when X increases, Y increases along the slope of the line.
What happens if the target variable is not continuous? When the target (Y) variable is discrete, the straight line is no longer a fit as seen in this chart. Although intuitively we can still state that when X (say advertising spend) increases, Y (say response or no response to a mailing campaign) also increases, but there is no gradual transition, the Y value abruptly jumps from one binary outcome to the other. Thus the straight line is a poor fit for this data.
On the other hand, take a look at the S-shaped curve below. This is certainly a better fit for the data shown. If we then know the equation to this "sigmoid" curve, we can use it as effectively as we used the straight line in the case of linear regression.
Logistic regression is thus the process of obtaining an appropriate sigmoid curve to fit the data when the target variable is discrete.
Key facts to keep in mind
- Logistic Regression is the equivalent of linear regression to use when the target (or dependent) variable is discrete i.e. not continuous
- Logistic Regression is ideally suited for business analytics applications where the target variable is a binary decision (fail-pass, response-no response, etc)
- The predictors can be either continuous or categorical
In the second part of this article, we discuss the mechanics of logistic regression and also the process of implementing a simple analysis using RapidMiner.
If you like tips like these, we invite you to sign up for visTASC, "a visual thesaurus of analytics, statistics and complex systems" which provides a portal for aggregrating analytics wisdom and current applications.