Monday, 13 February 2012

Recommendations 101: Using Pearson Correlation in real systems


Hundreds of algorithms have been used in the design of recommender systems. In fact a good typical recommendation system  typically features 30-80 separate algorithms which can be configured in either manually in older generation systems or using sophisticated machine learning in more advanced system. Here I'm covering some of the more interesting algos that form the basis of most systems.The Pearson Correlation is a measure of the correlation (linear dependence) between two variables X and Y, giving a value between +1 and −1 inclusive. In a social network, we can use this to manipulate data in the social graph such as explicit preference data (such as Likes, Craves, purchases or just views) and how they relate in a particular user's neighborhood to other users with similar taste or interest. By collecting the preference data of top-N nearest neighbors of a particular user (weighted by similarity), the user's preference can be predicted and expressed as the Pearson correlation coefficient.  It's pretty basic stuff but gives you a first order approximation of these basic relationships. In simple terms Pearson's correlation coefficient between two variables is defined as the covariance of the two variables divided by the product of their standard deviations:








The above formula defines the population correlation coefficient, commonly represented by the Greek letter ρ (rho). You can learn more at http://hsc.uwe.ac.uk/dataanalysis/quantInfAssPear.asp and http://www.earthwatch.org/europe/downloads/Get_Involved/Pearson.pdf