Recommender systems

7 min readOct 28, 2020

Preferences model:

Recommenders mine what users say and what they do to learn preferences

(1)explicit:

rating:

continuous scales, pairwise preference, hybrid (e.g. 1–100+never again), temporary (e.g. Pandora 30-day suspend)

When the ratings are provided: consumption-during or immediately after experiencing the item; review: some time after experience; expectation: the item has not been experienced

Difficulties with ratings: are ratings reliable and accurate? do user preferences change? but data not updated. what does a rating mean? different users have different behaviors.

vote

(2)implicit:

data collected from user actions. key difference: user action is for some other purpose. Their actions say a lot!

How long did user read? Track listening, listen a song all the way, skip a song, start watching and don’t finish

click on link (ad, result click on 3rd link instead of 1st),

Don’t click on link

purchase, follow

Ratings

it tends to be plentiful
implicit purchase ratings can be higher quality than explicit ratings
just model a click/purchase/whatever as a positive rating of some arbitrary (yet consistent) value
do not model the absence of a click/purchase as a negative rating -it’s just a missing data
Popularity is a behavior based data

Difficulties: What does the action mean? Purchase: they might still hate it; don’t click: expect bad, or didn’t see; how to scale/represent actions; how to combine multiple types of actions, how you represent them. It is possible to look at time-spent-browsing, clicks, purchases, review text, forwards, and lots of other things all together. The trick is putting those into a formula that provides a combined measure of preference.

Advantage: greater volume

Predictions

estimates of how much you’ll like an item

often scaled to match some rating scale
often tied to search or browsing for specific products
Pro: helps quantify item
Con: provides something falsifiable

Predictions estimate a person’s rating (or liking/consumption) of an item; recommendations select items for the person from a larger set.

Analytical Framework

Dimensions of Analysis

Domain/Purpose/Recommendation Context/Whose Opinions/Personalization Level/Privacy and Trustworthiness/Interfaces/Recommendation Algorithms

Recommendation Algorithms

Basic Model

Users: user attributes (demographics)

2. Items: item attributes (properties, genres, etc)

3. Ratings

4. (Community)

Non-Personalized Summary Statistics:

External Community Data: best-seller, most popular; trending hot
Summary of Community ratings： best-liked

Content-based filtering:

Information Filtering: user ratings * item attributes => model

model applied to new items via attributes

2. Knowledge-based: item attributes form model of item space: users navigate/browse that space

Collaborative filtering

User-user

User-based nearest-neighbor collaborative filtering:

Assumption: Our past agreement predicts our future agreement.

To predict how much a target user will like an target item, the system first generates a neighborhood of other users who have agreed with that user on other items in the past. Then the system computes a weighted, normalized average of what those other users rated the target item.

很大的问题：user 和 user 可能在有些 domain similar，但是另一些 domain 不 similar，这样建立的 neighborhood 肯定有 noise.

In theory, the more the better, if you have a good similarity metric.

We don’t want too many neighbors, low-similarity neighbors may add more noise than signal.

Normalization: subtract user mean rating, convert to z-score, subtract item or item-user mean. Must reverse normalization after computing.

Similarities: pearson correlation, weighting similarity

Good baseline configuration: 1) top N neighbors(~30); 2) Weighted averaging; 3) user-mean or z-score normalization; 4) cosine similarity over normalized ratings

Cons: issues of sparsity, with large item sets, small numbers of ratings, too often there are points where no recommendation can be made. 人少东西多，新用户找不到和别人的similarity，每个人买的东西可能差别很大，不一定有overlap，现在选择面这么广。

It accounts for the fact that different users will use different parts of the rating scale.

2. Item-item

Computational Performance: with millions of users, computing all pairs correlations is expensive; even incremental approaches were expensive; user profiles could change quickly-needed to compute in real time to keep users happy

item-item similarity is fairly stable

the benefits of item-item depend on having more users than items.

In general, item-item is a pretty good algorithm with better performance and stability than user-user. Under which of these circumstances is it likely to fail to outperform user-user: 人少货多。In an application where there is a relatively small number of customers, and many more products. E.g., if you have a fixed customer base of 50,000 people but millions of products you are selling to them.

Pros: It acutallly works quite well: good prediction accuracy; good performance on top n predictions; efficient implementation: at least in cases where U >> I; benefits of precomputability; broad applicability and flexibility: as easy to apply to a shopping cart as to a user profile.

Many domains have fewer items than users, so fewer potential neighbors need to be searched.

Cons/Assumption: item-item relationships need to be stable: many of these issues are general temporal issues, calendar, short lived book(candidate biography), christimas tree 等短时流行的, lower serendipity短时惊喜

Cosine Similarity:

Pros: 1) When we treat missing values as 0, it introduces a useful self-damping effect when items have few users 大部分空 in common but many ratings. 2) It is equivalent to the Pearson correlation, so it is statistically meaningful.

Picking Neighbors

Neighbors are usually k most similar items

Good value of k is important:

k too small-> inaccurate scores;

k too large-> too much noise

k=20 often works well

Building model:

pre-compute similarities for all pairs of items: item stability makes similarity pre-computation feasible; Naively: O(I²): if symmetric: only need to compute one direction.

Item-item on unary data (implicit feedback)

clicks, plays, purchases

Data representation: 1) rating value: user-item rating matrix; 2) logical 1/0 user-item ‘purchase’ matrix, purchase count matrix; 3) what is 0? 用户没有click一个东西，不代表他不喜欢，有可能他没见过, we just ignore that for item-item.

Data normalization: standard mean-centering not meaningful; but we can normalize user vectors to unit vectors: intuition: users who like many items provide less information about any particular pair; could also consider: logging counts(logarithm) to decrease the impact of large data

Aggregating Scores:

For binary(0,1), just sum neighbor similarities. Fixed neighborhood size mean it is bounded; neighborhood selection unchanged(most similar)

Computing Similarities:

cosine similarity: Wij = cos(Ri^, R^)

conditional probablity: Wij = P(Ri|Rj)

3. Dimensionality Reduction

Others

Critique/Interview based recommendations

Context aware filtering

白天，晚上，中午买啥

买啥之后买啥

时间，地点

Hybrid Techniques

Recommendation Systems

Recommend — Things/Content/Music/Dating/Search (Web Pages)

Not only on user behavior

Explicit feedback — rating/thumb down/thumb up/star/review

extra work from user
sparse data
culture difference in rating

Implicit feedback — things you click on/you purchase/you consume

fraud
click on: indication of interest

purchase: great interest

videoviewing

Top N Recommender

individual interests → candidate generation → item similarities

candidate ranking

filtering (remove items already seen, offensive, below some minimum quality score, minimum rating threshold)

Measure Accuracy:

train/test (full data set) movie ratings etc

training set → machine learning → predictions

test set → measure accuracy

A recommender system is a machine learning system. You train it using prior user behavior and then use it to make predictions about items new users might like.

K-fold cross validation:

full data set (movie ratings etc)

fold 1 → ML → measure accuracy

fold 2 → ML → measure accuracy …

fold k-1 →ML → measure accuracy

(take average on above)

test set

Mean absolute error (MAE)

root mean square error (RMSE): penalize more when you way off, inflates the penalty for larger errors

Ideally, you want to measure how real people react to recommendations for things they’ve never seen before, and you just can’t do that offline.

Evaluate top-n recommenders

“hit rate”: hits/users

How to measure “hit rate”:

leave one out cross validation -> only for large data set

average reciprocal hit rate (ARHR): SUMi(1/ranki)/users

penalize good recommendation item that appear too low in the list, user has to work to find them

“cumulative hit rate” (CHR)

rating hit rate (rHR)

coverage: % of <user, item>pairs that can be predicted

diversity: (1-S), S:average pair wise similarity between recommendation pairs

novelty: mean population rank of recommended items

> strike a balance between familiar (trust) and popular items

Churn: How often does recommendation for a user change?

> measure the sensitivity of your recommendation system to new user behavior

If a user rate a new movie, does that change their recommendation substantially?

If so, then your churn score will be high. Maybe just showing the same recommendations to the same user too many times is a bad idea.

If you keep showing users same recommendation, but user doesn’t click it, you should consider something else.

responsiveness: how quickly does new user behavior influence your recommendations?

online A/B testing

perceived quality.

RMSE: root mean squared error, lower values mean better accuracy

MAE: mean absolute error, lower values mean better accuracy

HR: hit rate, how often we are able to recommend a left our rating. Higher is better.

cHR: cumulative hit rate. Hit rate, confined to ratings above a certain threshold. High is better.

ARHR: average reciprocal hit rank — hit rate that takes the ranking into account. Higher is better.

Coverage: ratio of users for whom recommendations above a certain threshold exist. Higher is better.

Diversity: 1-s, where s is the average similarity score between every possible pair of recommendations for a given user. Higher means more diverse.

Novelty: average popularity rank of recommended items. Higher means more novel.