Also for tall and wide (rows ~60M, columns 10M), I am considering running a
matrix factorization to reduce the dimension to say ~60M x 50 and then run
all pair similarity...

Did you also try similar ideas and saw positive results ?



On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <debasish.da...@gmail.com>
wrote:

> Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~ 60M
> and columns are 10M say with billion data points...
>
> I have another version that's around 60M and ~ 10K...
>
> I guess for the second one both all pair and dimsum will run fine...
>
> But for tall and wide, what do you suggest ? can dimsum handle it ?
>
> I might need jaccard as well...can I plug that in the PR ?
>
>
>
> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com> wrote:
>
>> You might want to wait until Wednesday since the interface will be
>> changing in that PR before Wednesday, probably over the weekend, so that
>> you don't have to redo your code. Your call if you need it before a week.
>> Reza
>>
>>
>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>>> Ohh cool....all-pairs brute force is also part of this PR ? Let me pull
>>> it in and test on our dataset...
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com> wrote:
>>>
>>>> Hi Deb,
>>>>
>>>> We are adding all-pairs and thresholded all-pairs via dimsum in this
>>>> PR: https://github.com/apache/spark/pull/1778
>>>>
>>>> Your question wasn't entirely clear - does this answer it?
>>>>
>>>> Best,
>>>> Reza
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <debasish.da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Reza,
>>>>>
>>>>> Have you compared with the brute force algorithm for similarity
>>>>> computation with something like the following in Spark ?
>>>>>
>>>>> https://github.com/echen/scaldingale
>>>>>
>>>>> I am adding cosine similarity computation but I do want to compute an
>>>>> all pair similarities...
>>>>>
>>>>> Note that the data is sparse for me (the data that goes to matrix
>>>>> factorization) so I don't think joining and group-by on (product,product)
>>>>> will be a big issue for me...
>>>>>
>>>>> Does it make sense to add all pair similarities as well with dimsum
>>>>> based similarity ?
>>>>>
>>>>> Thanks.
>>>>> Deb
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Xiaoli,
>>>>>>
>>>>>> There is a PR currently in progress to allow this, via the sampling
>>>>>> scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>
>>>>>> The PR is at https://github.com/apache/spark/pull/336 though it will
>>>>>> need refactoring given the recent changes to matrix interface in MLlib. 
>>>>>> You
>>>>>> may implement the sampling scheme for your own app since it's much code.
>>>>>>
>>>>>> Best,
>>>>>> Reza
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <lixiaolima...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> Thanks for your suggestion. I have tried the method. I used 8 nodes
>>>>>>> and every node has 8G memory. The program just stopped at a stage for 
>>>>>>> about
>>>>>>> several hours without any further information. Maybe I need to find
>>>>>>> out a more efficient way.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <and...@andrewash.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The naive way would be to put all the users and their attributes
>>>>>>>> into an RDD, then cartesian product that with itself.  Run the 
>>>>>>>> similarity
>>>>>>>> score on every pair (1M * 1M => 1T scores), map to (user, (score,
>>>>>>>> otherUser)) and take the .top(k) for each user.
>>>>>>>>
>>>>>>>> I doubt that you'll be able to take this approach with the 1T pairs
>>>>>>>> though, so it might be worth looking at the literature for recommender
>>>>>>>> systems to see what else is out there.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <lixiaolima...@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I am implementing an algorithm using Spark. I have one million
>>>>>>>>> users. I need to compute the similarity between each pair of users 
>>>>>>>>> using
>>>>>>>>> some user's attributes.  For each user, I need to get top k most 
>>>>>>>>> similar
>>>>>>>>> users. What is the best way to implement this?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to