We have previously discussed how to apply a chi squared calculator to run a simple variant of a market basket analysis. In this article we will use RapidMiner to run a more sophisticated analysis.
A market basket database typically consists of a large number of transaction records. Each record lists all items purchased during a single customer transaction. The objective of this data mining exercise is to identify if certain groups of items are usually purchased together. The result is a set of rules, called association rules which summarize item associations as follows:
if [A] is purchased --> then [B] is also purchased, [x%] of time.
These association rules can be applied in an old-fashioned brick and mortar setting as well as in an online setting for real-time cross-selling or ad placement. In this article, we will cover some main aspects of applying association rules for running a market basket analysis and show how this can be set up using RapidMiner.
Two essential concepts - Support and Confidence:
A key idea to get comfortable with is that of frequent item sets. An item set can consist of one item or more. In our earlier example which consisted of customer transactions involving purchases of typical cosmetics items, one frequent item set example could be [brushes, nail polish].
Frequent item sets are quantified by support which is the ratio of the number of instances where [brushes, nail polish] appeared together in a single transaction to the total number of transactions.
Support = occurrences of [brushes, nail polish]/total # of transactions
The next important metric that you will need to run a market basket analysis is confidence. Extending the above example, the confidence of finding [brushes, nailpolish] together is defined as
confidence [brushes, nailpolish] = occurrences of [brushes, nailpolish]/total # of [brushes]
Setting up a market basket analysis using RapidMiner
In RapidMiner, association rules are extracted using two operators in a sequence. The first operator, called FP Growth, is required to generate frequent item sets. The second operator, Create Association Rules, then produces the IF-THEN rules based on the confidence requirement.
But before that you may need some pre-processing steps for selecting the attributes you want and more importantly, to convert the input data to binomial (true/false) format which is required by the FP Growth operator.
Tip 1: When using the FP Growth operator, the important parameter is "min support". RapidMiner will find only those item sets which exceed this minimum support value. However, if you check the box for "find min number of item sets", then the priority is given to "Min Number of item sets", in which case it will continue to reduce the support threshold until it finds at least that many item sets indicated in the "Min Number of item sets" field.
Tip 2: After finding the frequent item sets, the next step in the process is to extract rules which meet the confidence requirement. You can provide this in the "min confidence" field under the parameter options for Create Association Rules operator.
Tip 3: When the above process is run, RapidMiner will generate outputs for both FP Growth and Create Association Rules operators. The FP Growth output is a table with support values for the minimum number of item sets requested in Tip 1. The association rules output consists of a text view, table view and graphical views of the extracted rules. The simplest and most intuitive view is surprisingly the text view which will show rules such as these below:
Association Rules [Blush] --> [Concealer] (confidence: 0.738) [Brushes] --> [Nail Polish] (confidence: 1.000)
Are you interested in a datamining cookbook that explains many of these techniques and shows you how to apply them using open source products like RapidMiner? Take the anonymous survey below to give us feedback!