There are many reasons for an analyst to like RapidMiner. But one of the most endearing factors is the plug and play or drag and drop paradigm which will save you - the analyst - from writing code or remembering syntax. That in itself is worth the "price of admission" a.k.a learning curve for the tool!
Consider customer segmentation: here is a task which is relevant to many businesses. The basic idea is to understand the relationships between purchase behavior and customer demographics. For example, do customers that spend more on fast food also spend more on our services? Do customers that drive minivans spend less on our products? These are very simple cases of separating prospects into appropriate buyer personas, and can easily be extended into more complex behaviours.
Performing such analysis can be very cumbersome depending upon how your data is stored. You will need transaction data and demographic data which typically exist in separate databases or flatfiles and you will need to write querying scripts to get all the data onto one stage before you can run any meaningful analysis on them. This is where RapidMiner truly excels: it provides you well packaged operators to do the data transformation and of course hundreds of algorithms to mine those nuggests of knowledge.
A simple 3-step process for customer segmentation
There are three basic steps you would need to run to do a basic customer segmentation analysis like the one described. Let us suppose you have three datasets - a product data set consisting of product id and price lists, a transaction dataset consisting of date of sale, customer id and amount of transaction and finally a customer dataset consisting of demographic info based on customer id. Typically transaction data would run into millions of rows, customer data would run into thousands (one row per customer) where as your product data would be few dozen rows.
The first step is to join the disparate datasets on some common attribute. For the product and transaction datasets you can use a "Join" operator and perform an inner join to arrange the transactions by product id. At the end of the join you will have a new dataset consisting of millions of rows and each row will now be tagged with the appropriate product id.
The second step is to aggregate the data on the amount of money spent. For each row, we need to aggregate the total spend by customer id. But before doint this we need to calculate the money spent for each transaction (row). This is done by multiplying the price of the product by the number of products. The "Generate Attributes" operator allows this. Now we are ready to aggregate the money spent by each customer using the "Aggregate" operator and using the customer id as the attribute used for aggregating the spend.
The final step is to bin the spending in various levels using the "Discretize by Frequency" operator. These three steps are captured here in the graphic.
Insight from a basic plotting:
The results generated by such a process can reveal very useful patterns. In this case, we see that customers who tend to favor fast food and drive SUVs are the highest revenue generators for the products sold by the business used in the example. Secondly, we see that customers who drive family cars and choose organic food spend the least. One can expand on these using more advanced techniques such as decision trees which would then allow you to generate scores for new prospects and offer them the right combination of products to increase your top line.
This very basic customer segmentation analysis can be set up and run using RapidMiner in less than 20 minutes with no coding! How can you not love this tool?
Do you want to systematically learn how to set up analysis for similar problems using RapidMiner? Take our anonymous survey and tell us what you would like us to include in our upcoming book!