While acknowledging the general overall risk in using models, it is important to know how to mitigate some of these risks. In this article, we will specifically focus on 6 checkpoints to ensure that bivariate analyses used to develop models (such as simple regression models), or to verify if two parameters are related, are valid. Finally, we will briefly mention some advantages of using mutual information over simple regression models for bivariate analysis.
Checkpoint 1: The first check point to consider before accepting any simple regression model is of course to quantify the r-squared, which is also known as the "coefficient of determination". r-squared effectively explains how much of variability in the dependent parameter is explained by the independent parameter.
Addendum to 1: (by Sangeetha Krishnan)
In most cases of Linear Regression the r-squared value lies between 0 and 1. The ideal range for r-squared varies across applications , for example, in social and behavioral science models typically low values are acceptable. Generally, very low values( ~ < 0.2) indicate that the variables in your do not explain the outcome satisfactorily. Similarly very high values (> 0 .8) values indicate too high a dependency making the predictive ability of the model low.
Checkpoint 2: Once a regression model is fit through the sample data points, the t-statistic must be used to check if the slope of the model is different from zero. But why not simply check the slope (even visually) of the model? The t-statistic check ensures that the population slope (not just the sample slope) are different from zero. This of course requires the assumption of normal distribution of all sample slopes that make up the population.
Checkpoint 3: Which brings us to the next check - which is to ensure that all error terms in the model are normally distributed. Fortunately most standard statistical packages do this automatically, but it is good to know that this check has been performed.
Checkpoint 4: Make sure that if you are using the model to predict, the domain of the predictor is within the range of the sample data used to build the model.
Checkpoint 5: Passing checks 1 and 2 will ensure that the independent and dependent variable are related. However this does not imply that the independent variable is the cause and the dependent is the effect. Remember that correlation is not causation!
Checkpoint 6: Highly non-linear relationships will result in simple regression models failing checks 1 through 3. However this does not mean that the two variables are not related. In such cases it may become necessary to resort to somewhat more advanced bivariate analysis methods. The use of mutual information for testing if two variables are related is highly effective in such cases.
Mutual information will very simply tell you if variable X is related to variable Y, and how much uncertainty is reduced in predicting Y if the uncertainty in knowing X is quantified. Furthermore, mutual information can handle jumps or discontinuities within the sample data - for example the X data may not be uniformly spaced. Such jumps in data are well captured by mutual information, as are non-linearities.
The issue for many business analytics problems is frequently knowing which technique best answers the questions posed. Many times it is a combination or ensemble of techniques. Sign up for our FREE tool visTASC below which will help you with this very important step.