> otherwise we recommend only very popular items this is why you have loglikelihood ratio, right? m
On Thu, Dec 11, 2014 at 11:51 AM, Gruszowska Natalia < [email protected]> wrote: > Mario, > I think in terms of correctness. In similarities like Euclidean, Pearson > correlation or Cosine Similarity better results are if we consider only > common users (users who rated both compared items). This assumption let to > find similar item for those which are unpopular, otherwise we recommend > only very popular items. For my data it is unacceptable. > > "But if you take, for example, the cosine similarity, you shouldn't throw > away the data." - you should, it result in dimension reduction and it is > good. Everything is still in the same space but for each pair the space is > reduced. > > My question is why someone who wrote this code ignored this so important > assumption? It was by accident or due to some important reasons like > effectiveness or computational complexity? > > > Natalia > > > -----Original Message----- > From: [email protected] [mailto:[email protected]] > Sent: Wednesday, December 10, 2014 7:05 PM > To: [email protected] > Subject: Re: Collaborative filtering item-based in mahout - without > isolating users > > Hi Natalia > > Regarding example 1, if you think in terms of likelihood that the two > products have been bought together because they are similar (opposed to by > chance), the similarity is undefined. As everyone buys 12, of course the > person who bought 11 bough also 12, right? > > This if you compute the similarity through a co-occurence matrix (and > loglikelihood ratio) > > But you say "In the theory, similarity between two items should be > calculated only for users who ranked both items". > > I guess you mean: "Users [1,2,4] don't know about item 11, therefore they > do not collaborate in building the similarity between the two items. User > [3], on the contrary, does, and gives the same rating to the two products, > therefore the similarity is 1". > > But if you take, for example, the cosine similarity, you shouldn't throw > away the data. Here, you build a space with four dimensions -the ratings of > four users. You can't say product 11 is on another space when it relates > with user 1,2,4 because hasn't been rated by those users. They all are > there. They are dimensions, like in physics. Therefore you must use this > information too. Items are in the user-space... all. > > Even intuitively, items 11 and 12 are not similar at all -one has been > bought by every customer, the other by just one customer. How could you > tell the next customer who buys 12 (everyone does...) that she would really > like 11...? > > Mario > > > On Wed, Dec 10, 2014 at 4:40 PM, Gruszowska Natalia < > [email protected]> wrote: > > > Hi All, > > > > In mahout there is implemented method for item based Collaborative > > filtering called itemsimilarity, which returns the "similarity" > > between each two items. > > In the theory, similarity between two items should be calculated only > > for users who ranked both items. During testing I realized that in > > mahout it works different. > > Below two examples. > > > > Example 1. items are 11-12 > > In below example the similarity between item 11 and 12 should be equal > > 1, but mahout output is 0.36. It looks like mahout treats null as 0. > > Similarity between items: > > 101 102 0.36602540378443865 > > > > Matrix with preferences: > > 11 12 > > 1 1 > > 2 1 > > 3 1 1 > > 4 1 > > > > Example 2. items are 101-103. > > Similarity between items 101 and 102 should be calculated using only > > ranks for users 4 and 5, and the same for items 101 and 103 (that > > should be based on theory). Here (101,103) is more similar than > > (101,102), and it shouldn't be. > > Similarity between items: > > 101 102 0.2612038749637414 > > 101 103 0.4340578302732228 > > 102 103 0.2600070276638468 > > > > Matrix with preferences: > > 101 102 103 > > 1 1 0.1 > > 2 1 0.1 > > 3 1 0.1 > > 4 1 1 0.1 > > 5 1 1 0.1 > > 6 1 0.1 > > 7 1 0.1 > > 8 1 0.1 > > 9 1 0.1 > > 10 1 0.1 > > > > > > Both examples were run without any additional parameters. > > Is this problem solved somewhere, somehow? Any ideas? Why null is > > treated as 0? > > Source: http://files.grouplens.org/papers/www10_sarwar.pdf > > > > > > > > Kind regards, > > Natalia Gruszowska > > > > > > >
