So I am currently taking a Coursera course called Introduction to Recommender Systems. As the course is approaching completion, I want to write something important down in case I forget in the future.
Basically I am quite familiar with most of the algorithms including both content-based filtering and collaborative filtering, but I am a little bit struggling with the evaluation of the recommender system algorithms because it is not easy to judge if an algorithm is best since we have so many measures to choose from and we also need to consider the business goal rather than the accuracy of our model only. So I took some notes when I was watching the course videos again because I am sure (or I hope) that I will need to use them in the future :P
Instead of writing about traditional machine learning evaluation metrics I am focusing on something I didn't know before I took this course:
There are some fallacies of hidden data evaluation we should pay attention to when we are developing the recommender system algorithms:
First, if we use offline-evaluation, sometimes what a good recommender is doing is actually trying to recover the ratings we already know in our dataset. In another word, it tends to penalize the algorithm that gives us something we never rated before even if the algorithm is recommending something new and good to us. What it causes is that the recommender system will always recommend us something we already watched or listened before. You can think of it as an overfitting issue. This is especially the case for unary data where for example we only know the users have clicked some items but we didn't know whether the user didn't like the item or just didn't see it if he or she didn't click on this item at that time.
Second, it is usually the case that only items in the top list will be noticed by the users. So recommending the items in the top 10 list is more important than recommending items in the top 10-20 list. But for tuning our algorithms to improve evaluation metric like RMSE, what we end up with might be predicting how bad the bad rating is. It matters if we predict a movie has 3.5 but its actual rating is 4.5. However, it doesn't matter much to us if we predict a movie has 1.5 rating but its rating is actually 2.5.
Third, sometimes the data we collected to evaluate the recommender system is skewed. For example, assuming we already have a recommender system working and recommending items to users and we want to improve it, the data we got is actually the ratings of items the recommender system recommended to the users. We can still improve our recommender system but it's hard to discover something new for the users.
The solution is quite straightforward. That is to test with users! But it requires a live system. The approach to doing so is:
We first use these traditional evaluation metrics to rule out useless models and select some promising candidates, then evaluate the algorithms on the user base with on-going data. We also need to consider other measures that are more related to the business goals and user experience. After all, what a recommender system does is to help people find some new stuffs they will like!
So what are the metrics that focus on user experience and business goal?
First, Coverage. It is the measure of percentage of products for which a recommender can make a prediction. We might want it higher.
Second, Diversity. It is the measure of how different the items recommended are in top-n list. Research shows that people like recommendations with high diversity. The common way to achieve diversification is to penalize/remove the items from the top-n list that are too similar to prior items already recommended (never touch #1). Alternative approaches are bundling similar items through clustering or scatter-gather interface. But diversification limits how many substitutions will be made in the top-n list.
Third, Serendipity. It is the measure of how likely some new and unexpected stuffs that users will love are recommended. It's more about discovery. So that the recommender will not waste time on recommending something we are sure people will buy with or without recommendation. There are some metrics for it but the easiest way is to simply downgrade the highly popular items. Diversification factor approach mentioned above might also increase serendipity.
Last and most importantly, we need users to tell us if the metrics we are using to pick the best recommender is correct. There are some useful business and user related metrics that can be used to measure the performance of a recommender system such as net lift, user satisfaction, customer life time value, retention rate, referrals, change in purchase, user immediate feedback and so on .
Some techniques used to do user-centered testing are:
1. Usage Log
2. Polls and surveys
3. Lab experiments
4. A/B testing
One more thing, sometimes we can improve the performance of our recommender system by using temporal evaluation which takes timestamp attribute into consideration when evaluating the recommender systems.