Thanks to Amazon, Netflix and the like, recommender systems have become synonymous with data science and in many ways are the benchmark indicators of predictive analytics maturity in an organization, particularly in retail and e-commerce. Its adoption is going to increase to more non-tech type companies and even smaller businesses, thanks to the open source movement in analytics and big data infrastructure. At the end of the day, all content providers (including this website, via Add This!) will make use of recommender engines.
In this article and the next few to follow, we will take a peek under the hood (or bonnet) of these very valuable and interesting algorithms. In particular, we will explore setting them up using RapidMiner. But before we get into that, let us expand a little bit more on the taxonomy of recommender algorithms.
Types of Recommender Systems
Recommender systems fall into 3 broad categories: user based, item based and feature based. Let us suppose that you are frequent consumer at a retail/entertainment website and the provider needs to verify if a new (to you) item or movie should be recommended. User based recommenders try to identify a short list of other users who are “similar” to you who have rated this new item. They then take a weighted average of the rating for the item from these users and predict that number as your likely rating for that item.
Item based recommenders try to identify a short list of other items which are “similar” to the item under question and take a weighted average of the ratings for those top few items which you provided and predict that number as your likely rating for that item. Both user based and item based recommenders require some history of usage from you, i.e., you must have rated some items in the past for these to work. These two types are usually called Collaborative Filtering methods because, we filter objects based on the similarities in behavior using “collaboration” between users or items.
But what if you are a totally new user to the site? This requires the third type of recommender system – the content based or attribute based recommender. Content-based recommenders try to find intrinsic similarities between objects so that a user’s past history of ratings is not really necessary. Content based systems create a profile for each user or item and then apply machine learning techniques such as clustering or classification to find similar items. Profile vectors are made up of tags or categories for items (movies: sci-fi, fantasy, A-list actors, directors etc) or users (age, gender, race, income, other demographics). We will discuss these in a later article in more detail. But for now, let us revert back to Collaborative filtering (user-based and item-based) and establish baseline data pieces which are required.
What data is needed for Collaborative Filtering?
The heart of the dataset is what is called an utility matrix. This is a data frame or cross tabulation (table) that shows ratings organized by users along rows and items along columns.
A user based recommender computes the similarity between a given user A (for example) and other users by comparing rows of this matrix. The actual comparison method could be a dot product (for example: A.B = 4×2 + 5X0 + 5×0 + 1×3 … etc) called Cosine similarity or a correlation coefficient between rows A and B called Pearson similarity.
An item based recommender computes the similarity between a given movie HP1 (for example) and others by comparing the columns of this matrix in an exactly similar way as the user based recommender does with rows. In either case, the output this computation is “Similarity Matrix” which looks like this
Image courtesy: Raj Bandopadhyay of Damballa.
Now it is easy to capture the top “n” most similar users to A (B and C) or top most similar item to HP1 (TW1). The similarity matrix for items is actually 9×9 but we are showing only a subset of that.
Now that we have introduced the intuition behind collaborative filtering, in the next article we will dig into the implementation of such a system using a GUI based tool such as RapidMiner.
Originally posted on Wed, Aug 13, 2014 @ 08:15 AM