I looked at the code: similarColumns(Double.posInf) is generating the brute
force...

Basically dimsum with gamma as PositiveInfinity will produce the exact same
result as doing catesian products of RDD[(product, vector)] and computing
similarities or there will be some approximation ?

Sorry I have not read your paper yet. Will read it over the weekend.



On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com> wrote:

> For 60M x 10K brute force and dimsum thresholding should be fine.
>
> For 60M x 10M probably brute force won't work depending on the cluster's
> power, and dimsum thresholding should work with appropriate threshold.
>
> Dimensionality reduction should help, and how effective it is will depend
> on your application and domain, it's worth trying if the direct computation
> doesn't work.
>
> You can also try running KMeans clustering (perhaps after dimensionality
> reduction) if your goal is to find batches of similar points instead of all
> pairs above a threshold.
>
>
>
>
> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
>> Also for tall and wide (rows ~60M, columns 10M), I am considering running
>> a matrix factorization to reduce the dimension to say ~60M x 50 and then
>> run all pair similarity...
>>
>> Did you also try similar ideas and saw positive results ?
>>
>>
>>
>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~
>>> 60M and columns are 10M say with billion data points...
>>>
>>> I have another version that's around 60M and ~ 10K...
>>>
>>> I guess for the second one both all pair and dimsum will run fine...
>>>
>>> But for tall and wide, what do you suggest ? can dimsum handle it ?
>>>
>>> I might need jaccard as well...can I plug that in the PR ?
>>>
>>>
>>>
>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com> wrote:
>>>
>>>> You might want to wait until Wednesday since the interface will be
>>>> changing in that PR before Wednesday, probably over the weekend, so that
>>>> you don't have to redo your code. Your call if you need it before a week.
>>>> Reza
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <debasish.da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let me
>>>>> pull it in and test on our dataset...
>>>>>
>>>>> Thanks.
>>>>> Deb
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Deb,
>>>>>>
>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in this
>>>>>> PR: https://github.com/apache/spark/pull/1778
>>>>>>
>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>
>>>>>> Best,
>>>>>> Reza
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Reza,
>>>>>>>
>>>>>>> Have you compared with the brute force algorithm for similarity
>>>>>>> computation with something like the following in Spark ?
>>>>>>>
>>>>>>> https://github.com/echen/scaldingale
>>>>>>>
>>>>>>> I am adding cosine similarity computation but I do want to compute
>>>>>>> an all pair similarities...
>>>>>>>
>>>>>>> Note that the data is sparse for me (the data that goes to matrix
>>>>>>> factorization) so I don't think joining and group-by on 
>>>>>>> (product,product)
>>>>>>> will be a big issue for me...
>>>>>>>
>>>>>>> Does it make sense to add all pair similarities as well with dimsum
>>>>>>> based similarity ?
>>>>>>>
>>>>>>> Thanks.
>>>>>>> Deb
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <r...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Xiaoli,
>>>>>>>>
>>>>>>>> There is a PR currently in progress to allow this, via the sampling
>>>>>>>> scheme described in this paper:
>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>
>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though it
>>>>>>>> will need refactoring given the recent changes to matrix interface in
>>>>>>>> MLlib. You may implement the sampling scheme for your own app since 
>>>>>>>> it's
>>>>>>>> much code.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Reza
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <lixiaolima...@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi Andrew,
>>>>>>>>>
>>>>>>>>> Thanks for your suggestion. I have tried the method. I used 8
>>>>>>>>> nodes and every node has 8G memory. The program just stopped at a 
>>>>>>>>> stage for
>>>>>>>>> about several hours without any further information. Maybe I need to 
>>>>>>>>> find
>>>>>>>>> out a more efficient way.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <and...@andrewash.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The naive way would be to put all the users and their attributes
>>>>>>>>>> into an RDD, then cartesian product that with itself.  Run the 
>>>>>>>>>> similarity
>>>>>>>>>> score on every pair (1M * 1M => 1T scores), map to (user, (score,
>>>>>>>>>> otherUser)) and take the .top(k) for each user.
>>>>>>>>>>
>>>>>>>>>> I doubt that you'll be able to take this approach with the 1T
>>>>>>>>>> pairs though, so it might be worth looking at the literature for
>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I am implementing an algorithm using Spark. I have one million
>>>>>>>>>>> users. I need to compute the similarity between each pair of users 
>>>>>>>>>>> using
>>>>>>>>>>> some user's attributes.  For each user, I need to get top k most 
>>>>>>>>>>> similar
>>>>>>>>>>> users. What is the best way to implement this?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to