Mario, 
I think in terms of correctness. In similarities like Euclidean, Pearson 
correlation or Cosine Similarity better results are if we consider only common 
users (users who rated both compared items). This assumption let to find 
similar item for those which are unpopular, otherwise we recommend only very 
popular items. For my data it is unacceptable.
        
"But if you take, for example, the cosine similarity, you shouldn't throw away 
the data." - you should, it result in dimension reduction and it is good. 
Everything is still in the same space but for each pair the space is reduced. 

My question is why someone who wrote this code ignored this so important 
assumption? It was by accident or due to some important reasons like 
effectiveness or computational complexity?  


Natalia


-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Wednesday, December 10, 2014 7:05 PM
To: [email protected]
Subject: Re: Collaborative filtering item-based in mahout - without isolating 
users

Hi Natalia

Regarding example 1, if you think in terms of likelihood that the two products 
have been bought together because they are similar (opposed to by chance), the 
similarity is undefined. As everyone buys 12, of course the person who bought 
11 bough also 12, right?

This if you compute the similarity through a co-occurence matrix (and 
loglikelihood ratio)

But you say "In the theory, similarity between two items should be calculated 
only for users who ranked both items".

I guess you mean: "Users [1,2,4] don't know about item 11, therefore they do 
not collaborate in building the similarity between the two items. User [3], on 
the contrary, does, and gives the same rating to the two products, therefore 
the similarity is 1".

But if you take, for example, the cosine similarity, you shouldn't throw away 
the data. Here, you build a space with four dimensions -the ratings of four 
users. You can't say product 11 is on another space when it relates with user 
1,2,4 because hasn't been rated by those users. They all are there. They are 
dimensions, like in physics. Therefore you must use this information too. Items 
are in the user-space... all.

Even intuitively, items 11 and 12 are not similar at all -one has been bought 
by every customer, the other by just one customer. How could you tell the next 
customer who buys 12 (everyone does...) that she would really like 11...?

Mario


On Wed, Dec 10, 2014 at 4:40 PM, Gruszowska Natalia < 
[email protected]> wrote:

> Hi All,
>
> In mahout there is implemented method for item based Collaborative 
> filtering called itemsimilarity, which returns the "similarity" 
> between each two items.
> In the theory, similarity between two items should be calculated only 
> for users who ranked both items. During testing I realized that in 
> mahout it works different.
> Below two examples.
>
> Example 1. items are 11-12
> In below example the similarity between item 11 and 12 should be equal 
> 1, but mahout output is 0.36. It looks like mahout treats null as 0.
> Similarity between items:
> 101     102     0.36602540378443865
>
> Matrix with preferences:
>             11       12
> 1                     1
> 2                     1
> 3           1         1
> 4                     1
>
> Example 2. items are 101-103.
> Similarity between items 101 and 102 should be calculated using only 
> ranks for users 4 and 5, and the same for items 101 and 103 (that 
> should be based on theory). Here (101,103) is more similar than 
> (101,102), and it shouldn't be.
> Similarity between items:
> 101     102     0.2612038749637414
> 101     103     0.4340578302732228
> 102     103     0.2600070276638468
>
> Matrix with preferences:
>             101      102        103
> 1                     1         0.1
> 2                     1         0.1
> 3                     1         0.1
> 4           1         1         0.1
> 5           1         1         0.1
> 6                     1         0.1
> 7                     1         0.1
> 8                     1         0.1
> 9                     1         0.1
> 10                    1         0.1
>
>
> Both examples were run without any additional parameters.
> Is this problem solved somewhere, somehow? Any ideas? Why null is 
> treated as 0?
> Source: http://files.grouplens.org/papers/www10_sarwar.pdf
>
>
>
> Kind regards,
> Natalia Gruszowska
>
>
>

Reply via email to