### 2 key assumptions to be aware of before applying the chi-square test

**(contributed by Sangeetha Krishnan) **In a few previous articles we introduced a few common business analytics applications of the chi-square test, described the common steps in implementing the chi-square test, and also discussed the core working of the chi-square technique. In this article we touch upon two very important assumptions that go into the chi-square test which analysts must pay heed to.

**1. Sample size assumption:**

The chi-square test can be used to determine differences in proportions using a two-by-two contingency table. It is however important to understand that the chi-square tests yields only an approximate p-value, on which a correction factor is then applied. This only works well when your datasets are large enough. When sample sizes are small, as indicated by more than 20% of the contingency cells having expected values < 5 a **Fisher's exact test**^{ } maybe more appropriate. This test is one of a class of “exact tests”, because the significance of the deviation from a “null hypothesis” can be calculated exactly, rather than relying on an approximation.

**2. Independence assumption:**

Secondly, the chi-square test cannot be used on correlated data. When you are looking to test differences in proportions among matched pairs in a before/after scenario, an appropriate choice would be the **McNemar's **test. In essence, it is a chi-square goodness of fit test on the two discordant cells, with a null hypothesis stating that 50% of the changes (agreements or disagreements) go in each direction. This test requires the same subjects to be included in the before and after measurements i.e. the pairs should be matched one-on-one.

**Chi-squared test of independence **is a very useful tool for any predictive analytics professional. What other type of business problems are best solved by using these tools?

## Comments

However abut the ratio, how do we know if the sample is large enough?

could quantitative data be analyzed using chi sq. ?

Sample size is a perennial question in statistics. Most techniques presume at least 30 samples as a bare minimum.

Chi-sq can be used for numerical or quantitative data - you simply convert the numbers into ranges. For example if you have a variable with values 2.5, 3.1,1.0, 5.6, 7.0, you can create 3 bins: less than 3.0, between 3.1 and 6.0, more than 6.1 and so on.