[ 
https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-389:
--------------------------------------

    Attachment: MAHOUT-389-2.patch

There might be cases where it makes sense to look not only at co-ratings, e.g. 
imagine you have 3 products: A, B and C

Let's say the pairs A,B and A,C have the same co-ratings (the same users bought 
them), but B is a topseller, which is bought by lots of people and C is a niche 
product, which only sells rarely.
A cosine which includes the zero assumption would decrease the value for the 
topseller and prefer the niche product, which might be a good thing depending 
on your use case.

But I definitely see your point here that the assumption is generally not 
holding and I also think that the distributed version should be modified. 

I attached a patch with a first proposal how this could be managed.

I tried to refactor the similarity computation out of the map-reduce code and 
make it possible to implement different similarity functions that have to 
follow this scheme:

 * in a early stage of the process, the similarity implementation can compute a 
weight (a single double) for each item-vector
 * in the end, it is given all co-ratings and the previously computed weights 
for each item-pair that has at least one co-rating

That should be sufficient to compute centered pearson-correlation as well as 
cosine or tanimoto coefficients.

I hope it's understandable what I'm trying to propose here, taking a look at 
org.apache.mahout.cf.taste.hadoop.similarity.DistributedSimilarity together 
with DistributedPearsonCorrelationSimilarity and 
DistributedUncenteredZeroAssumingCosineSimilarity will hopefully help to get a 
clearer picture. These implementations are merely for demonstration purposes, 
they could be merged with the already existing non-distributed implementations 
in case you like the approach described here.

> UncenteredCosineSimilarity 
> ---------------------------
>
>                 Key: MAHOUT-389
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-389
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Priority: Minor
>         Attachments: MAHOUT-389-2.patch, MAHOUT-389.patch
>
>
> org.apache.mahout.cf.taste.impl.similarity.UncenteredCosineSimilarity only 
> computes the cosine distance between those components of the vectors where 
> both vectors have a value greater zero.
> This is inconsistent with the definition of the cosine (correct me if I'm 
> wrong) and is inconsistent with the distributed cosine similarity computation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to