Recommender System Evaluation

Paper (written in a collaborative way?!? using zwiki)

  • Deadline: ?


  • to evaluate RS some metrics have been used, we have reviewed them, we think no-one of them represents user satisfaction; the only possible way to evaluate user-satisfaction in with an online RS; the way that requires less assumtions is a "paired test" of two different systems (with the same interface and interaction mode) where the user receives a combination of the two lists recommended by the two systems, assuming a user click on one item means "i like this item the most" --> in this way it is possible to have a relative measure of the user satisfaction relative to the two systems
  • bla bla bla
  • here we can make a review of the measures most used in literature (something very similar to the list Paolo Massa provided in the presentation made at trinity some days ago
  • what we want to say in here is that essentially standard metrics don't represent user satisfaction, and, for this reason, it is not worth trying to increase the accuracy (measured as one of the standard metrics) from, let's say, 0.85 to 0.9 while this improvement is not probably perceived by the user, because the user satisfaction

is not represented by these metrics.

  • these metrics cannot be applied very well to evaluate systems that have different methods for making prediction (for example, content-based or collaborative filtering) or even to evaluate systems that does not explicitly provide a prediction of the degree of liking of the user (such as content-based systems!)
  • on-line paired test of two different algos for the same RS
  • stating that it is not possible to compute a metric that "represent" user satisfaction, the only way is to evaluate a real RS online by means of the interaction of real users; in this case, a problem in evaluation is that the interface and the ease of use of a RS influence a lot the user's perceived goodness of a RS.
  • How it is possible to cope with that?
  • What it is possible to do is only to compare two systems in the same conditions (interface, interaction mode) and say which one is better (relative measure but no absolute measure!) moreover you cannot say why one algo is better than the other and if you want for example to invent a new one, you have to try some hypothesis and then to verify if your new algo is better than the past best one!!!
  • another advantage: comparing two systems at the same time, the evaluation does not suffer of the changing in the community of the user or on the changing conditions (such as number of items to recommend, number of new items, ...)
  • EXAMPLE: for example, it is not possible to say "google" works well looking on the clicks make on the list returned,

but if you present both the result of "google" and "altavista" then the information provided by a user's click can tell you if the user chooses an item recommended by google or altavista and so which of the two engine is better; in this way the two compared systems are in the same conditions (same interface) so that the judgement is not influenced by the interface (google's one is poorer, while altavista's one has, etc...) and it is possible to compare the two systems only about the returned lists and so only comparing the underlying algorithms!

  • PROBLEM ON ASSUMPTIONS: the assumption made by this way of evaluate are the minimum that is possible: essentially, a user is presented with a combination of two lists returned by two different RS (in the same condition) and is ask to choose the one he prefers; so the very basic assumtion is: "a user click on the item he likes more" (Problem: this assumption needs that the user is able to evaluate the goodness of an item simply looking at the list! (this is maybe possible for a list of sites returned by google but it is surely not possible, for example, for cocoa that returns a list of compilations: the user has to browse the compilations to decide if he likes them so that the click on an item of the returned list cannot be considered as a statement of the user evaluation of this item!!! in cocoa for example you can evaluate how many tracks the user takes from the recommended compilations to put them in his partial one, but then you add a kind of measure (and some assumptions about the user satisfaction) and you are from the beginning!!!

in a musical RS if i get some songs i don't know, i have to probably listen it before i can judge and so my click does not mean "i like this song" but only "i want to listen it"; maybe we can assume that if the user listen the song, he likes it and if he skips it, he dislikes it but then again we make some assumptions about the user satisfaction!!!)

  • PROBLEM: how to apply the "paired test evaluation methodology" to movielens for example (a classical CF system)? this works also if the RS is not called with an argument but with all the user profile? as CF...

that is, if the system returns always the same list, you can compare the two systems only once and so this is not significant!

  • we should probably run some very simple experiments to give at least a little scientific support to our hypothesis:
  • possible experiments are to demonstrate:
  • MAE (mean absoulte error) does not corresponds to User satisfaction
  • user stability (as hill did showing a user correlate with himself 0.83 after 6 weeks!)
  • we can ask to rerate the same items
  • OPEN ISSUES ON THE PROPOSED METHOD: we want to say that this evaluation methodology is viable for every kind of RS (but we have to think if this is true!!!):
  • Collaborative Filtering and Content-based
  • Query-driven (google) and profile-driven (amazon)
  • returning best items (movielens-a movie with predicted rating) or returning best sets of items(cocoa-) or returning ordered lists of items(a radio))
  • Very similar to our idea: Swearingen and sinha propose the Turing Test for Music Recommender Systems (Comparing systems, friends and experts - Anonymize the source of recommendation)
  • Possible ways to present to the user two lists returned by two different systems (we have to think about pros and cons!!)
  1. merge two lists in one list (alternating items)
  1. presents two lists in the same page but clearly separated (alternating the position of the two lists)
  1. presents one list (A) and a link to the other one (B) and next time viceversa
  1. presents one list (A) and next time the other list (B) [alternating the two systems in recommending]?
  • for Swearingen and sinha "What convinces a user to sample the recommendation?"
  1. Judging recommendations: What is a good recommendation from the user’s perspective?
  1. Trust in a Recommender System: What factors lead to trust in a system?
  1. System Transparency: Do users need to know why an item was recommended?
  • In our "paired test", trust in recommender system is an invariant? System transparency is an invariant? what about diversity?
  • for the trust in a RS it is very important the balancing between:
  1. useful rec (not yet tried items)
  1. trust generating rec (already tried items)
  • what if system A returns more of "already tried items" and system B returns more of "not already tried items"?
  • Positive Experiences lead to “trust
  • Negative Experiences with Recommended Items lead to mistrust of system
  • for this reason, i think it is important that the returned list are kept separated!!! and that the user is aware of the system logic (the fact that there are two lists returned by two different algorithms!!!)
  • for Swearingen and sinha, Systems need to provide a mix of different kinds of items to cater to different users:
  • Trust Generating Items: A few very popular ones, which the system has high confidence in
  • Unexpected Items: Some unexpected items, whose purpose is to allow users to broaden horizons.
  • Transparent Items: At least some items for which the user can see the clear link between the items he /she rated and the recommendation.
  • New Items: Some items which are new.
  • Swearingen and Sinha ask themselves: Should these be presented as a sorted list / unsorted list/ different categories of recommendations?