Also for tall and wide (rows ~60M, columns 10M), I am considering running a matrix factorization to reduce the dimension to say ~60M x 50 and then run all pair similarity...
Did you also try similar ideas and saw positive results ? On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <debasish.da...@gmail.com> wrote: > Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~ 60M > and columns are 10M say with billion data points... > > I have another version that's around 60M and ~ 10K... > > I guess for the second one both all pair and dimsum will run fine... > > But for tall and wide, what do you suggest ? can dimsum handle it ? > > I might need jaccard as well...can I plug that in the PR ? > > > > On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com> wrote: > >> You might want to wait until Wednesday since the interface will be >> changing in that PR before Wednesday, probably over the weekend, so that >> you don't have to redo your code. Your call if you need it before a week. >> Reza >> >> >> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <debasish.da...@gmail.com> >> wrote: >> >>> Ohh cool....all-pairs brute force is also part of this PR ? Let me pull >>> it in and test on our dataset... >>> >>> Thanks. >>> Deb >>> >>> >>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com> wrote: >>> >>>> Hi Deb, >>>> >>>> We are adding all-pairs and thresholded all-pairs via dimsum in this >>>> PR: https://github.com/apache/spark/pull/1778 >>>> >>>> Your question wasn't entirely clear - does this answer it? >>>> >>>> Best, >>>> Reza >>>> >>>> >>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <debasish.da...@gmail.com> >>>> wrote: >>>> >>>>> Hi Reza, >>>>> >>>>> Have you compared with the brute force algorithm for similarity >>>>> computation with something like the following in Spark ? >>>>> >>>>> https://github.com/echen/scaldingale >>>>> >>>>> I am adding cosine similarity computation but I do want to compute an >>>>> all pair similarities... >>>>> >>>>> Note that the data is sparse for me (the data that goes to matrix >>>>> factorization) so I don't think joining and group-by on (product,product) >>>>> will be a big issue for me... >>>>> >>>>> Does it make sense to add all pair similarities as well with dimsum >>>>> based similarity ? >>>>> >>>>> Thanks. >>>>> Deb >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <r...@databricks.com> >>>>> wrote: >>>>> >>>>>> Hi Xiaoli, >>>>>> >>>>>> There is a PR currently in progress to allow this, via the sampling >>>>>> scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf >>>>>> >>>>>> The PR is at https://github.com/apache/spark/pull/336 though it will >>>>>> need refactoring given the recent changes to matrix interface in MLlib. >>>>>> You >>>>>> may implement the sampling scheme for your own app since it's much code. >>>>>> >>>>>> Best, >>>>>> Reza >>>>>> >>>>>> >>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <lixiaolima...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Andrew, >>>>>>> >>>>>>> Thanks for your suggestion. I have tried the method. I used 8 nodes >>>>>>> and every node has 8G memory. The program just stopped at a stage for >>>>>>> about >>>>>>> several hours without any further information. Maybe I need to find >>>>>>> out a more efficient way. >>>>>>> >>>>>>> >>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <and...@andrewash.com> >>>>>>> wrote: >>>>>>> >>>>>>>> The naive way would be to put all the users and their attributes >>>>>>>> into an RDD, then cartesian product that with itself. Run the >>>>>>>> similarity >>>>>>>> score on every pair (1M * 1M => 1T scores), map to (user, (score, >>>>>>>> otherUser)) and take the .top(k) for each user. >>>>>>>> >>>>>>>> I doubt that you'll be able to take this approach with the 1T pairs >>>>>>>> though, so it might be worth looking at the literature for recommender >>>>>>>> systems to see what else is out there. >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <lixiaolima...@gmail.com >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> I am implementing an algorithm using Spark. I have one million >>>>>>>>> users. I need to compute the similarity between each pair of users >>>>>>>>> using >>>>>>>>> some user's attributes. For each user, I need to get top k most >>>>>>>>> similar >>>>>>>>> users. What is the best way to implement this? >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >