3 basic concepts which underpin the chi-square test
In the last couple of articles we discussed what type of business analytics problems can be addressed by chi-square test for independence and also how to implement the test with an actual example. In this article we will discuss the mechanics of the technique itself and try to understand why and how the chi-square technique works.
Remember that the chi-square test is needed when the data are categorical (or nominal) in nature: for example if a variable is the type of financial investment, its range of values could be stocks, bonds and cash. Therefore the analysis involves counting occurrences (of stocks, bonds and cash) and comparing variables (type of investment, customer demographic etc) based on occurrences. Thus the chi-square test works by keeping track of frequencies of occurrences.
The chi-square test basically checks if the frequencies of occurrences across any pair of variables (such as type of investment and customer demographic) are correlated. Thus it is simply a means for comparing "categorical correlations".
The underlying principle is based on probabilities of occurrences: If event A happens, what is the probability that event B also happens (correlation)? The multiplication law of probabilities states that if event A happening is independent of event B, then the probabilities of A and B happening together is simply (pa * pb).
Each cell in a contingency table first computes this joint probability. The next step is to convert this joint probability into an "expected frequency" which is simply (pa*pb*N) where N is the sum of all occurrences in the dataset.
The test of independence between any two parameters is done by checking if this expected frequency is the same as the actual observed frequency for that cell in the table. If all expected frequencies are equal (or very close) to the corresponding observed frequencies, then the value of square of the difference between them (and hence the name CHI-SQUARE) will be very low. In such a case, we conclude the two parameters are independent (or not related).
- Use chi-square to test if two categorical variables are related or independent
- Chi-square test works on the multiplication law of probabilities
- Need to exercise caution when the expected frequencies are "very small". In a next article we will talk about some special cases of application and explore this fact in detail.
Chi-squared test of independence is a very useful tool for any predictive analytics professional. What other type of business problems are best solved by using these tools?