A New Similarity Measure for User-based Collaborative Filtering in Recommender Systems

Collaborative filtering is a popular approach in recommender Systems that helps users in identifying the items they may like in a wagon of items. Finding similarity among users with the available item ratings so as to predict rating(s) for unseen item(s) based on the preferences of likeminded users for the current user is a challenging problem. Traditional measures like Cosine similarity and Pearson correlation‟s correlation exhibit some drawbacks in similarity calculation. This paper presents a new similarity measure which improves the performance of Recommender System. Experimental results on MovieLens dataset show that our proposed distance measure improves the quality of prediction. We present clustering results as an extension to validate the effectiveness of our proposed method


I Introduction
In the age of digital erudition, glut of massive data is generated in every field of science and Technology due to availability of automated tools and techniques for data generation and data collection. Research of the day is to uncover the obscure information from the unrefined data.
Recommender systems call for Intelligent Information Retrieval Techniques to provide a solution to the problem of triumphant information search by applying the practice of knowledge detection in the available colossal data to provide individual personalized recommendations. Recommender systems can be described as services for suggesting a list of products to people who might tend to like the same.
The extent of liking of a product by a user is termed as rating in Recommender systems. e.g., Bob gave a rating of 4 (out of 5) to the movie "Iron Man".
Recommender systems are usually classified into the following categories, based on how recommendations are made [1]

Collaborative recommendation
The user will be recommended items based on recommendations of other people who have similar ratings history as that of current user.

Content-based recommendation
The user will be recommended items based on a comparison between the descriptions of the items and a profile of the user that assigns weight to the characteristics of the item.

Hybrid approaches
The user will be recommended items based on a combination of Content-Based and Collaborative methods to overcome the limitations of both the methods.
Collaborative filtering (CF) is a popular approach in recommender Systems. [6] Collaborative filtering approaches known as social filtering [18] is to make automated prophecy of the preferences of item(s) for an active user based on the user"s earlier likings and the opinions of other users in the closest vicinity of the user referred to as nearest neighbors.
Similarity Measure plays central role in identifying the nearest neighbors when k-nn approach is used in Collaborative Filtering [20]. Cosine similarity and Pearson correlation are widely used for similarity calculation in collaborative filtering [7].
We propose a new similarity measure which performs better than Cosine similarity and Pearson correlation. We tested the proposed method on the MovieLens 100k dataset and as an extension we applied our proposed measure for testing cluster purity on famous Iris dataset and Wine dataset.
The remainder of this paper is organized as follows. In section II review on previous work is presented. In Section III we present our proposed method. In section IV experimentation with Results are discussed followed with conclusion in section V.

II Related Work
Collaborative filtering algorithms can be grouped into two categories; model-based and memory-based [13].
Model-based techniques use data mining and machine learning algorithms to train a model using known data and then the model will be used to make predictions for real data. Some of the successful model-based CF techniques include clustering [4], matrix factorization [21], Dimensionality reduction techniques such as SVD is used in CF to deal effectively with the data sparsity and scalability problems [3].principal component analysis which transforms an original set of variables into a smaller set of uncorrelated variables is used along with clustering to construct a model [15].A detailed survey can be found in [22] Memory-based algorithms operate on the entire database. It tries to find the vicinity of the active user based on the concord of his past ratings and uses their bias to guess ratings for new items. Memory-based algorithms can be further divided into user-based [11], item-based [2] and unification of both [14].
The same set of similarity estimation techniques is applicable for both user-based and item-based CF systems for finding nearest neighbors. Memory-based techniques are more popular due to their simplicity and proven results.
In this paper a new Metric for estimating the similarity is proposed and its applicability to user-based CF is established with by applying it on benchmark data.

Distance/Similarity Measures
The Scenario of recommender systems with n users and m items is represented by a × Rating matrix consisting of elements , indicating the ratings made by user u for item i. The similarity between two users is calculated using one of the traditional estimate namely Cosine similarity, Pearson correlation, Spearman"s rank correlation, entropy, mean squared difference etc. J u n e 27, 2 0 1 5 Cosine similarity and Pearson correlation are extensively used for finding similarity between pair of items and pair of users respectively in Recommender Systems [7]. Pearson correlation performs better than Cosine similarity in measuring useruser affinity [13] and also the other estimates such as Spearman"s rank correlation, entropy, means squared difference in Collaborative Filtering [12]. A detailed survey on different Distance or Similarity Measures is given in [5].

Cosine similarity
Cosine similarity angle between two users rating vectors is given as Where I u & I u ′ denotes the items rated by user u and u ′ .Cosine similarity normalizes the data with reference to the origin.
Cosine similarity has a range from 0 and 1.

Pearson correlation
Pearson correlation between two users u and u ′ with common ratings is given as Where I u & I u ′ denotes the items rated by user u and u ′ . Pearson correlation can also be viewed as Cosine similarity normalized by the offset of the corresponding objects to find the degree of linearity. Pearson correlation has a range from -1 to 1.
Pearson correlation is widely used in Recommender systems to identify users who exhibit linear relationship i.e., similar tastes. For example consider the Table 1 that displays a small hypothetical rating matrix similar to MovieLens dataset.
The ratings are on a numeric 5 star scale with 1,2 represents negative ratings, 3 represents satisfactory,4 represents good and 5 represents very good or excellent.
Using Pearson correlation, it can be concluded that Alice, Bob, Carol, Dave and Eve are having similar tastes because for their linear offset from their mean, Pearson correlation groups patterns irrespective of their range of expression. For example Alice might belong to a group of users whose expression has a narrow range (2 to 4) while Eve might belong to a group of users with wider range of expression (1 to 5) to rank the items. We argue that Eve is more similar to Alice than others. Bob did not like any item, Carol liked all the items, Eve is more similar to Alice followed by Dave based on their rating pattern.

III Proposed Method
Correlation coefficient is is dependent upon ranges of both X and Y, the dependency is removed by dividing with their respective standard deviations. i.e., values of X by its standard deviation (σ X ) and values of Y by its standard deviation (σ Y ) Since dependency is removed, Person Correlation doesn"t take magnitude into consideration, but still exhibit commutative property. We derive a measure which exhibit commutative property only when the patterns are linear with respective to distance.
Let two sets of data objects are represented by X={x1,x2,.....,xn} and Y={y1,y2,.....,yn} Then  That is they are similar with respect to linear relationship and also magnitude. Substituting x for y or y for x in the equation of correlation coefficient, does not have any effect on variance. But there will be change in standard deviation as it describes how the points are scattered with respect to the offset. The proposed similarity measure has a range from -1 to 1.

Prediction of Rating
The purpose of similarity estimate is used as given below to predict the ratings of a user "u" for additional item "i" based on the ratings given by the other users u ′ in his neighborhood , N.
Where μ is the global mean of ratings available in training set for all items and users.We choose baseline predictors [16] as it is used to adjust the effect of giving over ratings by a user or receiving higher ratings by a item and also provide ratings for new users.

IV Experimentation and Results
Experiments were conducted with java and open source R.

MovieLens-100K
MovieLens datasets [17] were maintained by University of Minnesota as part of GroupLens Research Project. The rating record is maintained as a triplet <UserID,MovieID,rating>

Iris and Wine datasets
We also used Iris [9] and Wine [19] datasets to test the effectiveness of our measure in grouping similar objects.

b. Evaluation Metrics b.i Prediction Evaluation Metrics
Success of a recommender system is gauzed by the quality of prediction.
We used MAE (mean absolute error) to report prediction accuracy on the test set. MAE = pred u, i − r u,i i∈testset u u∈U testset u u∈U MAE is a statistical accuracy metric which compares the prediction with the actual rating in the test set for a particular user.

b.ii Recommendation evaluation metrics
To evaluate the top-N recommendation given to a user we used Normalized Discounted Cumulative gain (NDCG) NDCG is used to find the effectiveness of top k items that are retrieved compared to actual list Compute the Cluster purity by finding the class label that belongs to majority objects in each cluster and to find the total number of labeled objects of such class in each cluster [8] given as Cluster purity has a range of [0; 1], 1 indicates that all objects belong to one class.

c. Experiments c.i Prediction Experiment
First we worked on 5 fold cross validation of train/test set from u1...u5 We calculated bias of each user and for each item in the train set, this process is repeated for all the five folds separately and calculated prediction for each user -item pair in the test set of each fold.
The number of common items for similarity calculation is kept as minimum two but the prediction for active user is calculated based on top-k nearest neighbors. We repeated the experiment taking k-value from 5-100 increasing by 5.We calculated MAE of each fold and took the average of all five folds.
The same process is continued with UA and UB datasets.
We conducted second experiment on UA dataset with shrunk in similarity coefficient by sim u, u ′ = sim u, u ′ × number of common_items)/dampfactor where damp factor is kept as 25 if the number of common items less than 10.
Choosing damp factor more than 25 or increasing the number of common items does not have much effect.
The same process is continued with UB dataset.

Discussion on Prediction results
Fig: 1-6 shows that our method consistently performed better than Cosine similarity and Pearson correlation. We considered common items list size as minimum 2 for all validations unless otherwise stated.
As Cosine similarity always gives 1 if number of common items 1, Pearson correlation and our proposed method requires minimum 2 common items.
Cosine similarity performed better than Pearson correlation when top-5 neighbors is considered on ua dataset (Fig:3 &  Fig:4)but Pearson correlation performed better when neighborhood size is increased. Our method consistently performed better compared to other two measures.
Cosine similarity performed better than Pearson correlation when top-5 and top-10 neighbors is considered for prediction on ub dataset (Fig:5),but Pearson correlation performed better when neighborhood size is increased. Our method consistently performed better compared to other two measures.
It is clear from Fig: 2, Fig: 4 and Fig: 6 that when a damping factor is applied to shrunk the similarity coefficient if the number of common items is less than a threshold (we considered 10) MAE decreased considerably to stabilize the prediction. J u n e 27, 2 0 1 5 In this we took ua and ub datasets for evaluation as they contain 10 ratings for each user in the test set..
We sorted the actual ratings given by a user in the test set and also the predicted ratings separately. As each user is having ten ratings in the test set we considered only seven ratings in the actual ratings list and compared with the predicted list if the items in the actual list appear in the same order and calculated discounted cumulative gain of their positions for top-5 list and top-7 list.   Table 5 cluster purity comparison of Cosine similarity,

Pearson correlation and proposed Similarity measure on Iris dataset
Discussion on Cluster purity Table 4 and Table 5 show that our proposed measure forms Quality clusters compared to Cosine similarity and Pearson correlation on normalized iris and wine datasets Fig: 7 and Fig: 9 show that intra-class similarity is high with our proposed measure. Fig: 8 and Fig: 10 show that inter-class similarity is low with our proposed method.

V Conclusions
Existing Distance/similarity Metrics or measures are not enough to deal with all kinds of data analysis. In Recommender Systems, finding similarity among users or Items to improve the prediction Quality is still a open research area. Our proposed similarity measure is consistent and has done exceedingly well compared to Cosine similarity and marginally well with Pearson correlation. We have also shown that our proposed measure is efficient in clustering similar objects.