Ohh cool....all-pairs brute force is also part of this PR ? Let me pull it in and test on our dataset...
Thanks. Deb On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com> wrote: > Hi Deb, > > We are adding all-pairs and thresholded all-pairs via dimsum in this PR: > https://github.com/apache/spark/pull/1778 > > Your question wasn't entirely clear - does this answer it? > > Best, > Reza > > > On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <debasish.da...@gmail.com> > wrote: > >> Hi Reza, >> >> Have you compared with the brute force algorithm for similarity >> computation with something like the following in Spark ? >> >> https://github.com/echen/scaldingale >> >> I am adding cosine similarity computation but I do want to compute an all >> pair similarities... >> >> Note that the data is sparse for me (the data that goes to matrix >> factorization) so I don't think joining and group-by on (product,product) >> will be a big issue for me... >> >> Does it make sense to add all pair similarities as well with dimsum based >> similarity ? >> >> Thanks. >> Deb >> >> >> >> >> >> >> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <r...@databricks.com> wrote: >> >>> Hi Xiaoli, >>> >>> There is a PR currently in progress to allow this, via the sampling >>> scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf >>> >>> The PR is at https://github.com/apache/spark/pull/336 though it will >>> need refactoring given the recent changes to matrix interface in MLlib. You >>> may implement the sampling scheme for your own app since it's much code. >>> >>> Best, >>> Reza >>> >>> >>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <lixiaolima...@gmail.com> >>> wrote: >>> >>>> Hi Andrew, >>>> >>>> Thanks for your suggestion. I have tried the method. I used 8 nodes and >>>> every node has 8G memory. The program just stopped at a stage for about >>>> several hours without any further information. Maybe I need to find >>>> out a more efficient way. >>>> >>>> >>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <and...@andrewash.com> >>>> wrote: >>>> >>>>> The naive way would be to put all the users and their attributes into >>>>> an RDD, then cartesian product that with itself. Run the similarity score >>>>> on every pair (1M * 1M => 1T scores), map to (user, (score, otherUser)) >>>>> and >>>>> take the .top(k) for each user. >>>>> >>>>> I doubt that you'll be able to take this approach with the 1T pairs >>>>> though, so it might be worth looking at the literature for recommender >>>>> systems to see what else is out there. >>>>> >>>>> >>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <lixiaolima...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I am implementing an algorithm using Spark. I have one million users. >>>>>> I need to compute the similarity between each pair of users using some >>>>>> user's attributes. For each user, I need to get top k most similar >>>>>> users. >>>>>> What is the best way to implement this? >>>>>> >>>>>> >>>>>> Thanks. >>>>>> >>>>> >>>>> >>>> >>> >> >