The Predictive Analytics Blog

Monday, 13 February 2012

Recommendations 101: Using Pearson Correlation in real systems

Hundreds of algorithms have been used in the design of recommender systems. In fact a good typical recommendation system typically features 30-80 separate algorithms which can be configured in either manually in older generation systems or using sophisticated machine learning in more advanced system. Here I'm covering some of the more interesting algos that form the basis of most systems.The Pearson Correlation is a measure of the correlation (linear dependence) between two variables X and Y, giving a value between +1 and −1 inclusive. In a social network, we can use this to manipulate data in the social graph such as explicit preference data (such as Likes, Craves, purchases or just views) and how they relate in a particular user's neighborhood to other users with similar taste or interest. By collecting the preference data of top-N nearest neighbors of a particular user (weighted by similarity), the user's preference can be predicted and expressed as the Pearson correlation coefficient. It's pretty basic stuff but gives you a first order approximation of these basic relationships. In simple terms Pearson's correlation coefficient between two variables is defined as the covariance of the two variables divided by the product of their standard deviations:

The above formula defines the population correlation coefficient, commonly represented by the Greek letter ρ (rho). You can learn more at http://hsc.uwe.ac.uk/dataanalysis/quantInfAssPear.asp and http://www.earthwatch.org/europe/downloads/Get_Involved/Pearson.pdf

Tuesday, 27 December 2011

How do current recommendation systems work?

In online recommendation systems typically the simplest solutions are the most effective. And probably the very simplest of all - and the one that generates significant lift for most providers of on-line recommendation technology is the K-Nearest Neighbor approach.

K-nearest neighbor classification

One of the most commonly used algorithms in recommender systems is the k-nearest neighborhood (k-NN) approach. The k-NN algorithm is a method for classifying objects based on the properties of its closest neighbors in the feature space. In k-NN, an object is classified through a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbor.

Predictive analytics forms the basis for a number of engineering based commercial systems. In particular I've spent a number of years developing and refining recommender systems or recommendation systems which area subclass of information filtering system that seek to predict a user's purchasing preferences for retails goods such as apparel, music, books, or movies. To be commercially effective the systems that I've developed have to anticipate that which the prospective purchaser has not yet considered. Typically I've done this by using a model built from the characteristics of an item (content-based approaches) or the user's social environment (collaborative filtering approaches).

Recommender systems are considered vital in online commerce in recent years. For example when viewing a product on Amazon.com, the store will recommend additional items based on a matrix of what other shoppers bought along with the currently-selected item. Systems like Pandora Radio takes an initial input of a song or musician and plays music with similar characteristics (based on a series of keywords attributed to the inputted artist or piece of music). The stations created by Pandora can be refined through user feedback (emphasizing or deemphasizing certain characteristics). The leading US provider of on-demand internet streaming media Netflix uses predictive analytics to anticipate which movies that a user might like to watch based on the user's previous ratings and watching habits (as compared to the behavior of other users), also taking into account the characteristics (such as the genre) of the film.

As this blog progresses I'm going to look at new developments and approaches and compare their effectiveness and commercial value in the markets.