“Truth is a Pathless Land”

...but finding an effective solution to your business problem does not have to be. Business analytics landscape does actually appear so, with a myriad techniques and vendor tools in the market.

Simafore provides tools and expertise to:

  • Integrate data
  • Select and deploy appropriate analytics
  • Institutionalize processes

About this Blog

The Analytics Compass Blog is aimed at two types of readers:

  • individuals who want to build analytics expertise and 

  • small businesses who want to understand how analytics can help them improve their business performance. 

If you fall into one of these categories, join hundreds of others and subscribe now!

Subscribe via E-mail

Your email:

Search SimaFore

FREE SMB Survey Report

describe the image
  • Our report (pdf) is a survey of analytics usage and needs of more than 100 SMBs
  • Find out which analytics applications have
    • Highest value
    • Most demand
    • Best ROI

Affordable Analytics for SMB

 

Browse by Tag

Blog - The Analytics Compass

Current Articles | RSS Feed RSS Feed

Understand 3 critical steps in developing logistic regression models

  
  
  

Logistic regression is a powerful tool for addressing classification problems. However it is also not the easiest of concepts for many beginners to grasp. There are three related steps which need to be fully grasped to be able to understand the workings of logistic regression and to understand how the predictor data is used to generate the classification.

typical business objective logistic regression

Logistic regression is used to predict the binomial (Yes/No, 1/0 etc) outcome of a response (dependent) variable using one or several predictor (independent) variables as seen above. The predictors can be binomial, categorical, or numerical. As in multiple linear regression, we need a function that connects the independent variables to the dependent variable. However the main difference here is that the dependent variable can only take on two values: Yes/No or 1/0. So we need a means to map a continuous function of independent variables to a binary outcome. We also need a way to score the binary outcome: to determine the probability of a "Yes" or "No" result.

First Step: With the logistic response function or logit function, map the continuous predictors to a function (the logit) of the response variable, which is also continuous. Using the above example, the predictors can form a linear function such as

Logit(Customer?) = b + b1*Income + b2*Education + b3*Mortgage + b4*Experience

this can take on any value: from -Infinity to +Infinity

Second Step: Convert the logit into odds. This is easy because logit is nothing but the logarithm of odds of the response variable. 

log (odds(Customer?)) = Logit(Customer?)

this can only be positive valued: from 0 to +Infinity

Third Step: Once we know the odds, we know the probability score. Probability, p

p = odds/(1+odds)

this can only be valued from 0 to 1

If you set up a cutoff value such as 0.5, then we know that a response is a "Yes" for all scores above the cutoff and vice-versa. In practice, with most analytics tools, you dont have to worry about this math behind the scenes. But knowing what is going on is important to build good quality predictive models.

RapidMiner Issues on Logistic Regression

  1. RapidMiner does not use the logit model to make the binomial predictions, but instead uses a Support Vector Machine (SVM) algorithm. Therefore the coefficients (the b's above) cannot be interpreted in a conventional manner. However it is easy to "deploy" and see the scores for each classified record
  2. To see a traditional logit function developed using your data and also to see the odds function, use the Weka extension, or W-logistic operator within RapidMiner.

Download our FREE eBook for info on how to set up a logistic regression model.

free guide to using logistic regression with rapidminer

Comments

good article
Posted @ Thursday, August 15, 2013 10:28 AM by S. affias
In Logistic Regression: 
Do we always have to normalize our data set (input variables) when multiple variables are used, where variables are of different ranges? 
There are different ways to do normalize or ( to make it non-dimensional) 
 
Say,  
“(Xi-Xmean )/ Std-div” ( Std-div= 1sigma) or  
Xi / |Xmin-Xmax| for continuous variable ( or say -1 to +1 range)  
For a discrete variable, say, Age (male=1 and female=2), do we have to normalize it ? If so how. 
 
Like to have your kind advice  
With regards, 
 
Chinmoy PAL @NATC 
 
PS: 
Whenever, I do PCA(principal comp. analysis) I always normalize data by (Xi-Xmean)/Std-div (or 1-sigma) 
In PCA it based on Eigen-value analysis.  
It is usually the practice to normalize the dataset, so that Eigen values of PCA analysis are not influenced by the order of the data. 
 
Posted @ Thursday, January 30, 2014 7:56 AM by chinmoy PAL
@Chinmoy, we dont *always* need to normalize data for logistic regression. If you have binomial variables such as "male" or "female", you dont need to normalize. But sometimes it may be useful to convert them to dummy variables, especially if you have polynomial variables - for example months of the year. IN this case you can set up 11 dummy variables each can take a value or 1 or 0. For example, if the month is December, in your raw data you may have "12" or the string. In this case all your dummy variables d1 to d11 will have a value 0. For November, d1 through d10 will be 0 and d11 will be 1 and so on.
Posted @ Sunday, February 02, 2014 10:14 AM by Bala Deshpande
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics