Market basket analysis aims to detect relationships or associations between specific items in a large "catalog" of objects. A simple example would be the occurrence of diapers and baby formula in the same sales transaction. However the real value from association rules analysis is finding connections between seemingly non-intuitive items: it would be indeed surprising (and valuable) to the retailer if it was found that there was a strong association between beer and diapers! But how do you know if this association rule has statistical value and not just some random coincidence? One way to test this is by calculating the Lift Ratio.
We discussed a couple of key measures which are needed to correctly apply association rules for market basket analysis: Support and Confidence. A third quantity of importance is lift. Lift Ratio is defined as the ratio between the confidence of a rule and a benchmark confidence. Benchmark confidence is simply the confidence of the two items if they were independent. Lift ratio higher than 1.0 indicates that there is some value to the rule.
There are three simple steps for generating usable recommender systems using a data mining tool such a RapidMiner: data preparation, rule generation and rule application. In this article we will discuss the first two steps.
Your data may consist of products sold, transaction details, customer ids if we are discussing retail. For a website, this could be pages viewed and session id for example. Below is one example of transaction data. The "p_id" is product id and the "amount" is the number of such products sold in a given transaction.
This data has to be transformed into a format where each row is a separate transaction id (or session id) and each column is a product sold (page viewed). The cells of this table will be 0's or 1's depending upon whether that particular product was sold during a given transaction. The transformed data is shown below. The column headings - "purchased_1.0", "purchased_10.0", etc indicate whether the product with pid 1.0 or pid 10.0 was involved during the transaction designated by row id t_1, t_10 etc.
The following process steps will make it easy for you to set up a market basket analysis using your data.
- Insert operator “Select attributes” and select the attribute subset that you need. In the above example, we selected “amount”, “p_id”, and “t_id” (see the top table - transaction data)
- Using operator "Rename", change the name of "amount" to "purchased"
- Add the operator “Pivot” to transform the data by grouping by “t_id” and indexing by “p_id”. This step will generate the output shown in the second table. This article provides more information on how pivot tables work in RapidMiner.
- Add a new operator “Set role” and change the role of the attribute “t_id” to “id”.
- Add a new operator “Replace Missing Values” and replace all missings with 0.
- Add a new operator “Numerical to Binominal”.
- Add a new operator “FP-Growth” to the process. Deliver the frequent item sets named “fre” to the result port “res”.
- Finally, add a new operator “Create Association Rules” and connect the “fre” output of FP-Growth with the “fre” input of this operator. Deliver both the frequent item sets “fre” and the association rules “rul” to the result ports “res” on the right side.
The entire process is shown below.
Analyze the results
In our example, the transaction data had a total of 20 different products. When we run the analysis by setting "min support" threshold in FP-Growth operator to 0.7 and "min confidence" threshold to 0.8 in "Create Association Rules" operator, we find that there are basically 4 products which meet these criteria, with p_ids 3, 6, 12 and 13. These are shown in the table below.
When you use "Lift" as a criterion (click on "Create Association Rules" operator and select lift from the drop down menu under the "Parameters" tab on the right), these results will be different. For example, in this dataset none of the item sets meet a lift of 1.5.
Are you interested in a datamining cookbook that explains many of these techniques and provides datasets and processes? Take the anonymous survey below to give us feedback!