Ohh cool....all-pairs brute force is also part of this PR ? Let me pull it
in and test on our dataset...

Thanks.
Deb


On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com> wrote:

> Hi Deb,
>
> We are adding all-pairs and thresholded all-pairs via dimsum in this PR:
> https://github.com/apache/spark/pull/1778
>
> Your question wasn't entirely clear - does this answer it?
>
> Best,
> Reza
>
>
> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
>> Hi Reza,
>>
>> Have you compared with the brute force algorithm for similarity
>> computation with something like the following in Spark ?
>>
>> https://github.com/echen/scaldingale
>>
>> I am adding cosine similarity computation but I do want to compute an all
>> pair similarities...
>>
>> Note that the data is sparse for me (the data that goes to matrix
>> factorization) so I don't think joining and group-by on (product,product)
>> will be a big issue for me...
>>
>> Does it make sense to add all pair similarities as well with dimsum based
>> similarity ?
>>
>> Thanks.
>> Deb
>>
>>
>>
>>
>>
>>
>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <r...@databricks.com> wrote:
>>
>>> Hi Xiaoli,
>>>
>>> There is a PR currently in progress to allow this, via the sampling
>>> scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf
>>>
>>> The PR is at https://github.com/apache/spark/pull/336 though it will
>>> need refactoring given the recent changes to matrix interface in MLlib. You
>>> may implement the sampling scheme for your own app since it's much code.
>>>
>>> Best,
>>> Reza
>>>
>>>
>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <lixiaolima...@gmail.com>
>>> wrote:
>>>
>>>> Hi Andrew,
>>>>
>>>> Thanks for your suggestion. I have tried the method. I used 8 nodes and
>>>> every node has 8G memory. The program just stopped at a stage for about
>>>> several hours without any further information. Maybe I need to find
>>>> out a more efficient way.
>>>>
>>>>
>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <and...@andrewash.com>
>>>> wrote:
>>>>
>>>>> The naive way would be to put all the users and their attributes into
>>>>> an RDD, then cartesian product that with itself.  Run the similarity score
>>>>> on every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) 
>>>>> and
>>>>> take the .top(k) for each user.
>>>>>
>>>>> I doubt that you'll be able to take this approach with the 1T pairs
>>>>> though, so it might be worth looking at the literature for recommender
>>>>> systems to see what else is out there.
>>>>>
>>>>>
>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <lixiaolima...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am implementing an algorithm using Spark. I have one million users.
>>>>>> I need to compute the similarity between each pair of users using some
>>>>>> user's attributes.  For each user, I need to get top k most similar 
>>>>>> users.
>>>>>> What is the best way to implement this?
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to