Awesome...Let me try it out... Any plans of putting other similarity measures in future (jaccard is something that will be useful) ? I guess it makes sense to add some similarity measures in mllib...
On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com> wrote: > Yes you're right, calling dimsum with gamma as PositiveInfinity turns it > into the usual brute force algorithm for cosine similarity, there is no > sampling. This is by design. > > > On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <debasish.da...@gmail.com> > wrote: > >> I looked at the code: similarColumns(Double.posInf) is generating the >> brute force... >> >> Basically dimsum with gamma as PositiveInfinity will produce the exact >> same result as doing catesian products of RDD[(product, vector)] and >> computing similarities or there will be some approximation ? >> >> Sorry I have not read your paper yet. Will read it over the weekend. >> >> >> >> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com> wrote: >> >>> For 60M x 10K brute force and dimsum thresholding should be fine. >>> >>> For 60M x 10M probably brute force won't work depending on the cluster's >>> power, and dimsum thresholding should work with appropriate threshold. >>> >>> Dimensionality reduction should help, and how effective it is will >>> depend on your application and domain, it's worth trying if the direct >>> computation doesn't work. >>> >>> You can also try running KMeans clustering (perhaps after dimensionality >>> reduction) if your goal is to find batches of similar points instead of all >>> pairs above a threshold. >>> >>> >>> >>> >>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <debasish.da...@gmail.com> >>> wrote: >>> >>>> Also for tall and wide (rows ~60M, columns 10M), I am considering >>>> running a matrix factorization to reduce the dimension to say ~60M x 50 and >>>> then run all pair similarity... >>>> >>>> Did you also try similar ideas and saw positive results ? >>>> >>>> >>>> >>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <debasish.da...@gmail.com> >>>> wrote: >>>> >>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~ >>>>> 60M and columns are 10M say with billion data points... >>>>> >>>>> I have another version that's around 60M and ~ 10K... >>>>> >>>>> I guess for the second one both all pair and dimsum will run fine... >>>>> >>>>> But for tall and wide, what do you suggest ? can dimsum handle it ? >>>>> >>>>> I might need jaccard as well...can I plug that in the PR ? >>>>> >>>>> >>>>> >>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com> >>>>> wrote: >>>>> >>>>>> You might want to wait until Wednesday since the interface will be >>>>>> changing in that PR before Wednesday, probably over the weekend, so that >>>>>> you don't have to redo your code. Your call if you need it before a week. >>>>>> Reza >>>>>> >>>>>> >>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das < >>>>>> debasish.da...@gmail.com> wrote: >>>>>> >>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let me >>>>>>> pull it in and test on our dataset... >>>>>>> >>>>>>> Thanks. >>>>>>> Deb >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Deb, >>>>>>>> >>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in >>>>>>>> this PR: https://github.com/apache/spark/pull/1778 >>>>>>>> >>>>>>>> Your question wasn't entirely clear - does this answer it? >>>>>>>> >>>>>>>> Best, >>>>>>>> Reza >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das < >>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Reza, >>>>>>>>> >>>>>>>>> Have you compared with the brute force algorithm for similarity >>>>>>>>> computation with something like the following in Spark ? >>>>>>>>> >>>>>>>>> https://github.com/echen/scaldingale >>>>>>>>> >>>>>>>>> I am adding cosine similarity computation but I do want to compute >>>>>>>>> an all pair similarities... >>>>>>>>> >>>>>>>>> Note that the data is sparse for me (the data that goes to matrix >>>>>>>>> factorization) so I don't think joining and group-by on >>>>>>>>> (product,product) >>>>>>>>> will be a big issue for me... >>>>>>>>> >>>>>>>>> Does it make sense to add all pair similarities as well with >>>>>>>>> dimsum based similarity ? >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> Deb >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <r...@databricks.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Xiaoli, >>>>>>>>>> >>>>>>>>>> There is a PR currently in progress to allow this, via the >>>>>>>>>> sampling scheme described in this paper: >>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf >>>>>>>>>> >>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though it >>>>>>>>>> will need refactoring given the recent changes to matrix interface in >>>>>>>>>> MLlib. You may implement the sampling scheme for your own app since >>>>>>>>>> it's >>>>>>>>>> much code. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Reza >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li < >>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Andrew, >>>>>>>>>>> >>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used 8 >>>>>>>>>>> nodes and every node has 8G memory. The program just stopped at a >>>>>>>>>>> stage for >>>>>>>>>>> about several hours without any further information. Maybe I need >>>>>>>>>>> to find >>>>>>>>>>> out a more efficient way. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash < >>>>>>>>>>> and...@andrewash.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> The naive way would be to put all the users and their >>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself. >>>>>>>>>>>> Run the >>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to >>>>>>>>>>>> (user, >>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user. >>>>>>>>>>>> >>>>>>>>>>>> I doubt that you'll be able to take this approach with the 1T >>>>>>>>>>>> pairs though, so it might be worth looking at the literature for >>>>>>>>>>>> recommender systems to see what else is out there. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li < >>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi all, >>>>>>>>>>>>> >>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one million >>>>>>>>>>>>> users. I need to compute the similarity between each pair of >>>>>>>>>>>>> users using >>>>>>>>>>>>> some user's attributes. For each user, I need to get top k most >>>>>>>>>>>>> similar >>>>>>>>>>>>> users. What is the best way to implement this? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >