### Understand 3 critical steps in developing logistic regression models

Logistic regression is a powerful tool for addressing classification problems. However it is also not the easiest of concepts for many beginners to grasp. There are **three related steps** which need to be fully grasped to be able to understand the workings of logistic regression and to understand how the predictor data is used to generate the classification.

Logistic regression is used to predict the binomial (Yes/No, 1/0 etc) outcome of a response (dependent) variable using one or several predictor (independent) variables as seen above. The predictors can be binomial, categorical, or numerical. As in multiple linear regression, we need a function that connects the independent variables to the dependent variable. However the main difference here is that the dependent variable can only take on two values: Yes/No or 1/0. So we need a means to map a * continuous function of independent variables* to a binary outcome. We also need a way to

**the binary outcome: to determine the probability of a "Yes" or "No" result.**

*score***First Step:** With the logistic response function or logit function, map the continuous predictors to *a** function *(the logit) of the response variable, which is also continuous. Using the above example, the predictors can form a linear function such as

**Logit(Customer?) = b + b1*Income + b2*Education + b3*Mortgage + b4*Experience**

*this can take on any value: from -Infinity to +Infinity*

**Second Step:** Convert the logit into odds. This is easy because logit is nothing but the logarithm of odds of the response variable.

**log (odds(Customer?)) = Logit(Customer?)**

*this can only be positive valued: from 0 to +Infinity*

**Third Step:** Once we know the odds, we know the probability score. Probability, p

**p = odds/(1+odds)**

*this can only be valued from 0 to 1*

If you set up a cutoff value such as 0.5, then we know that a response is a "Yes" for all scores above the cutoff and vice-versa. In practice, with most analytics tools, you dont have to worry about this math behind the scenes. But knowing what is going on is important to build good quality predictive models.

**RapidMiner Issues on Logistic Regression**

- RapidMiner does not use the logit model to make the binomial predictions, but instead uses a Support Vector Machine (SVM) algorithm. Therefore the coefficients (the b's above) cannot be interpreted in a conventional manner. However it is easy to "deploy" and see the scores for each classified record.
- To see a traditional logit function developed using your data and also to see the odds function, use the Weka extension, or
operator within RapidMiner.**W-logistic**

*Download our FREE eBook for info on how to set up a logistic regression model.*

## Comments

Do we always have to normalize our data set (input variables) when multiple variables are used, where variables are of different ranges?

There are different ways to do normalize or ( to make it non-dimensional)

Say,

“(Xi-Xmean )/ Std-div” ( Std-div= 1sigma) or

Xi / |Xmin-Xmax| for continuous variable ( or say -1 to +1 range)

For a discrete variable, say, Age (male=1 and female=2), do we have to normalize it ? If so how.

Like to have your kind advice

With regards,

Chinmoy PAL @NATC

PS:

Whenever, I do PCA(principal comp. analysis) I always normalize data by (Xi-Xmean)/Std-div (or 1-sigma)

In PCA it based on Eigen-value analysis.

It is usually the practice to normalize the dataset, so that Eigen values of PCA analysis are not influenced by the order of the data.