Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~ 60M
and columns are 10M say with billion data points...

I have another version that's around 60M and ~ 10K...

I guess for the second one both all pair and dimsum will run fine...

But for tall and wide, what do you suggest ? can dimsum handle it ?

I might need jaccard as well...can I plug that in the PR ?



On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com> wrote:

> You might want to wait until Wednesday since the interface will be
> changing in that PR before Wednesday, probably over the weekend, so that
> you don't have to redo your code. Your call if you need it before a week.
> Reza
>
>
> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
>> Ohh cool....all-pairs brute force is also part of this PR ? Let me pull
>> it in and test on our dataset...
>>
>> Thanks.
>> Deb
>>
>>
>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com> wrote:
>>
>>> Hi Deb,
>>>
>>> We are adding all-pairs and thresholded all-pairs via dimsum in this PR:
>>> https://github.com/apache/spark/pull/1778
>>>
>>> Your question wasn't entirely clear - does this answer it?
>>>
>>> Best,
>>> Reza
>>>
>>>
>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <debasish.da...@gmail.com>
>>> wrote:
>>>
>>>> Hi Reza,
>>>>
>>>> Have you compared with the brute force algorithm for similarity
>>>> computation with something like the following in Spark ?
>>>>
>>>> https://github.com/echen/scaldingale
>>>>
>>>> I am adding cosine similarity computation but I do want to compute an
>>>> all pair similarities...
>>>>
>>>> Note that the data is sparse for me (the data that goes to matrix
>>>> factorization) so I don't think joining and group-by on (product,product)
>>>> will be a big issue for me...
>>>>
>>>> Does it make sense to add all pair similarities as well with dimsum
>>>> based similarity ?
>>>>
>>>> Thanks.
>>>> Deb
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Hi Xiaoli,
>>>>>
>>>>> There is a PR currently in progress to allow this, via the sampling
>>>>> scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf
>>>>>
>>>>> The PR is at https://github.com/apache/spark/pull/336 though it will
>>>>> need refactoring given the recent changes to matrix interface in MLlib. 
>>>>> You
>>>>> may implement the sampling scheme for your own app since it's much code.
>>>>>
>>>>> Best,
>>>>> Reza
>>>>>
>>>>>
>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <lixiaolima...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>> Thanks for your suggestion. I have tried the method. I used 8 nodes
>>>>>> and every node has 8G memory. The program just stopped at a stage for 
>>>>>> about
>>>>>> several hours without any further information. Maybe I need to find
>>>>>> out a more efficient way.
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <and...@andrewash.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The naive way would be to put all the users and their attributes
>>>>>>> into an RDD, then cartesian product that with itself.  Run the 
>>>>>>> similarity
>>>>>>> score on every pair (1M * 1M => 1T scores), map to (user, (score,
>>>>>>> otherUser)) and take the .top(k) for each user.
>>>>>>>
>>>>>>> I doubt that you'll be able to take this approach with the 1T pairs
>>>>>>> though, so it might be worth looking at the literature for recommender
>>>>>>> systems to see what else is out there.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <lixiaolima...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I am implementing an algorithm using Spark. I have one million
>>>>>>>> users. I need to compute the similarity between each pair of users 
>>>>>>>> using
>>>>>>>> some user's attributes.  For each user, I need to get top k most 
>>>>>>>> similar
>>>>>>>> users. What is the best way to implement this?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to