To be honest I haven't seen the code of this similarity (do you have?). But then as I see it, it ignore other side - this time popular items and additional it looks like it ignore value of ratig - has only 1 or 0.
N. -----Original Message----- From: [email protected] [mailto:[email protected]] Sent: Thursday, December 11, 2014 12:00 PM To: [email protected] Subject: Re: Collaborative filtering item-based in mahout - without isolating users > otherwise we recommend only very popular items this is why you have loglikelihood ratio, right? m On Thu, Dec 11, 2014 at 11:51 AM, Gruszowska Natalia < [email protected]> wrote: > Mario, > I think in terms of correctness. In similarities like Euclidean, > Pearson correlation or Cosine Similarity better results are if we > consider only common users (users who rated both compared items). This > assumption let to find similar item for those which are unpopular, > otherwise we recommend only very popular items. For my data it is > unacceptable. > > "But if you take, for example, the cosine similarity, you shouldn't > throw away the data." - you should, it result in dimension reduction > and it is good. Everything is still in the same space but for each > pair the space is reduced. > > My question is why someone who wrote this code ignored this so > important assumption? It was by accident or due to some important > reasons like effectiveness or computational complexity? > > > Natalia > > > -----Original Message----- > From: [email protected] [mailto:[email protected]] > Sent: Wednesday, December 10, 2014 7:05 PM > To: [email protected] > Subject: Re: Collaborative filtering item-based in mahout - without > isolating users > > Hi Natalia > > Regarding example 1, if you think in terms of likelihood that the two > products have been bought together because they are similar (opposed > to by chance), the similarity is undefined. As everyone buys 12, of > course the person who bought 11 bough also 12, right? > > This if you compute the similarity through a co-occurence matrix (and > loglikelihood ratio) > > But you say "In the theory, similarity between two items should be > calculated only for users who ranked both items". > > I guess you mean: "Users [1,2,4] don't know about item 11, therefore > they do not collaborate in building the similarity between the two > items. User [3], on the contrary, does, and gives the same rating to > the two products, therefore the similarity is 1". > > But if you take, for example, the cosine similarity, you shouldn't > throw away the data. Here, you build a space with four dimensions -the > ratings of four users. You can't say product 11 is on another space > when it relates with user 1,2,4 because hasn't been rated by those > users. They all are there. They are dimensions, like in physics. > Therefore you must use this information too. Items are in the user-space... > all. > > Even intuitively, items 11 and 12 are not similar at all -one has been > bought by every customer, the other by just one customer. How could > you tell the next customer who buys 12 (everyone does...) that she > would really like 11...? > > Mario > > > On Wed, Dec 10, 2014 at 4:40 PM, Gruszowska Natalia < > [email protected]> wrote: > > > Hi All, > > > > In mahout there is implemented method for item based Collaborative > > filtering called itemsimilarity, which returns the "similarity" > > between each two items. > > In the theory, similarity between two items should be calculated > > only for users who ranked both items. During testing I realized that > > in mahout it works different. > > Below two examples. > > > > Example 1. items are 11-12 > > In below example the similarity between item 11 and 12 should be > > equal 1, but mahout output is 0.36. It looks like mahout treats null as 0. > > Similarity between items: > > 101 102 0.36602540378443865 > > > > Matrix with preferences: > > 11 12 > > 1 1 > > 2 1 > > 3 1 1 > > 4 1 > > > > Example 2. items are 101-103. > > Similarity between items 101 and 102 should be calculated using only > > ranks for users 4 and 5, and the same for items 101 and 103 (that > > should be based on theory). Here (101,103) is more similar than > > (101,102), and it shouldn't be. > > Similarity between items: > > 101 102 0.2612038749637414 > > 101 103 0.4340578302732228 > > 102 103 0.2600070276638468 > > > > Matrix with preferences: > > 101 102 103 > > 1 1 0.1 > > 2 1 0.1 > > 3 1 0.1 > > 4 1 1 0.1 > > 5 1 1 0.1 > > 6 1 0.1 > > 7 1 0.1 > > 8 1 0.1 > > 9 1 0.1 > > 10 1 0.1 > > > > > > Both examples were run without any additional parameters. > > Is this problem solved somewhere, somehow? Any ideas? Why null is > > treated as 0? > > Source: http://files.grouplens.org/papers/www10_sarwar.pdf > > > > > > > > Kind regards, > > Natalia Gruszowska > > > > > > >
