When you run a market basket analysis, the deliverable of the analysis is to generate practical rules that a store can apply in order to maximize their cross-selling opportunities. The objective of the analysis is to establish statistical confidences around the applicability of the rules. For example, when a rule states “IF X is purchased, THEN Y is also likely to be purchased”, we would also need to know what is the chance that this rule is not a random occurrence. The “X” is called an antecedent and “Y” is called the consequent.
A problem with market basket analysis is that sometimes too many rules are generated and it becomes important to filter these rules to select the strongest or the most relevant. What are the basic requirements for filtering these association rules?
Let us do a quick recap of the basic measures of interest which are important in this type of analysis. There are several interest measures, out of which the following three are used most frequently. We covered some of these in another article on association rules.
Support: Simply gives the probability of purchase occurrence of an item set. An item set can be a combination of goods (shoes + socks, nail polish + nail file, etc) or can also be a single item by itself.
p(purchasing an item set) = # of transactions with the itemset/# of total transactions
Confidence: Gives the conditional probability that a transaction will contain the consequent given that it includes the antecedent.
Confidence = p(consequent | antecedent)
Lift Ratio: Some books define Lift as the ratio of confidence to a “benchmark” confidence. A benchmark confidence is focused only on the consequent in an itemset
Benchmark confidence = # of trans with consequent /# of total transactions
This is simply the support of a consequent (see the earlier definition of Support). So if we define
Lift Ratio = Confidence/Benchmark Confidence, then
=> Lift Ratio = p(consequent|antecedent)/p(consequent)
This implies that lift ratio tells us how much more likely it is that the consequent is purchased as a result of purchasing antecedent. p(consequent) is the probability of purchasing the consequent alone and can be less than p(consequent |antecedent). This can result in a lift ratio >> 1. Now this is what is of interest!
We should be looking for those association rules which tell us that it is far more likely that two items X and Y are purchased together than item Y being purchased by itself. This is what is captured by the Lift Ratio. This is one good filter to use in reducing the number of association rules.
But there is one major caveat. We must be careful to start with choosing itemsets which have a certain minimum level of Support. In other words, we must choose itemsets which meet a minimum sales threshold. But what is this minimum support?
Sometime a simple percentage may be sufficient. For example, if you have a database of 2000 transactions, you can set a 10% support threshold. This would filter out any itemsets which have less than 200 transactions. Although it is a good starting point, this may not reveal the whole story sometimes. In a database of millions of point of sale (POS) transactions, if we find that an itemset has fewer than a dozen entries, then the support is very low. But what if the dollar value of the transaction is very high? Then this transaction will be of significance, so discarding this itemset because of low support would not be a good idea.
The bottom line is that for performing an efficient market basket analysis, simply applying the algorithm on the available data may result in a profusion of association rules. If we are not careful about how we cull these rules we may lose some valuable information. This applies to many different data science implementations of business use cases.