Nice to hear that! In order to get into the code, I suggest you first read the papers regarding ALS for Collaborative Filtering:
Large-scale Parallel Collaborative Filtering for the Netflix Prize http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf Collaborative Filtering for Implicit Feedback Datasets http://research.yahoo.com/pub/2433</p> There's also a slideset available from a university lecture from my department that has a few slides about ALS: http://de.slideshare.net/sscdotopen/latent-factor-models-for-collaborative-filtering After that, you should try to go through org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob On 21.03.2013 15:58, Han JU wrote: > Hi Sebastian, > > It runs much faster! On the same recommendation task it now terminates in > 45min rather than ~2h yesterday. I think with further tuning it can be even > faster. > > I'm trying reading the code, some hints for a good starting point? > > Thanks a lot! > > > 2013/3/20 Sebastian Schelter <[email protected]> > >> Hi JU, >> >> I reworked the RecommenderJob in a similar way as the ALS job. Can you >> give it a try? >> >> You have to try the patch from >> https://issues.apache.org/jira/browse/MAHOUT-1169 >> >> In introduces a new param to RecommenderJob called --numThreads. The >> configuration of the job should be done similar to the ALS job. >> >> /s >> >> >> On 20.03.2013 12:38, Han JU wrote: >>> Thanks again Sebastian and Seon, I set -Xmx4000m for >> mapred.child.java.opts >>> and 8 threads for each mapper. Now the job runs smoothly and the whole >>> factorization ends in 45min. With your settings I think it should be even >>> faster. >>> >>> One more thing is that the RecommendJob is kind of slow (for all users). >>> For example I want to have a list of top 500 items to recommend. Any >>> pointers about how to modify the job code so that it can consult a file >>> then calculates recommendations only for the users id in that file? >>> >>> >>> 2013/3/20 Han JU <[email protected]> >>> >>>> Hi Sebastian, >>>> >>>> I've tried the svn trunk. Hadoop constantly complains about memory like >>>> "out of memory error". >>>> On the datanode there's 4 physic cores and by hyper-threading it has 16 >>>> logical cores, so I set --numThreadsPerSolver to 16 and that seems to >> have >>>> a problem with memory. >>>> How you set your mapred.child.java.opts? Given that we allow only one >>>> mapper so that should be nearly the whole size of system memory? >>>> >>>> Thanks! >>>> >>>> >>>> 2013/3/19 Sebastian Schelter <[email protected]> >>>> >>>>> Hi JU, >>>>> >>>>> We recently rewrote the factorization code, it should be much faster >>>>> now. You should use the current trunk, make Hadoop schedule only one >>>>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), >> make >>>>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the >>>>> number of cores that you want to use per machine (use all if you can). >>>>> >>>>> I got astonishing results running the code like this on a 26 machines >>>>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs >> dataset >>>>> (700M datapoints). >>>>> >>>>> Let me know if you need more information. >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> On 19.03.2013 15:31, Han JU wrote: >>>>>> Thanks Sebastian and Sean, I will dig more into the paper. >>>>>> With a simple try on a small part of the data, it seems larger alpha >>>>> (~40) >>>>>> gets me a better result. >>>>>> Do you have an idea how long it will be for ParellelALS for the 700mb >>>>>> complete dataset? It contains ~48 million triples. The hadoop cluster >> I >>>>>> dispose is of 5 nodes and can factorize the movieLens 10M in about >>>>> 13min. >>>>>> >>>>>> >>>>>> 2013/3/18 Sebastian Schelter <[email protected]> >>>>>> >>>>>>> You should also be aware that the alpha parameter comes from a >> formula >>>>>>> the authors introduce to measure the "confidence" in the observed >>>>> values: >>>>>>> >>>>>>> confidence = 1 + alpha * observed_value >>>>>>> >>>>>>> You can also change that formula in the code to something that you >> see >>>>>>> more fit, the paper even suggests alternative variants. >>>>>>> >>>>>>> Best, >>>>>>> Sebastian >>>>>>> >>>>>>> >>>>>>> On 18.03.2013 18:06, Han JU wrote: >>>>>>>> Thanks for quick responses. >>>>>>>> >>>>>>>> Yes it's that dataset. What I'm using is triplets of "user_id >> song_id >>>>>>>> play_times", of ~ 1m users. No audio things, just plein text >> triples. >>>>>>>> >>>>>>>> It seems to me that the paper about "implicit feedback" matchs well >>>>> this >>>>>>>> dataset: no explicit ratings, but times of listening to a song. >>>>>>>> >>>>>>>> Thank you Sean for the alpha value, I think they use big numbers is >>>>>>> because >>>>>>>> their values in the R matrix is big. >>>>>>>> >>>>>>>> >>>>>>>> 2013/3/18 Sebastian Schelter <[email protected]> >>>>>>>> >>>>>>>>> JU, >>>>>>>>> >>>>>>>>> are you refering to this dataset? >>>>>>>>> >>>>>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile >>>>>>>>> >>>>>>>>> On 18.03.2013 17:47, Sean Owen wrote: >>>>>>>>>> One word of caution, is that there are at least two papers on ALS >>>>> and >>>>>>>>> they >>>>>>>>>> define lambda differently. I think you are talking about >>>>> "Collaborative >>>>>>>>>> Filtering for Implicit Feedback Datasets". >>>>>>>>>> >>>>>>>>>> I've been working with some folks who point out that alpha=40 >> seems >>>>> to >>>>>>> be >>>>>>>>>> too high for most data sets. After running some tests on common >> data >>>>>>>>> sets, >>>>>>>>>> alpha=1 looks much better. YMMV. >>>>>>>>>> >>>>>>>>>> In the end you have to evaluate these two parameters, and the # of >>>>>>>>>> features, across a range to determine what's best. >>>>>>>>>> >>>>>>>>>> Is this data set not a bunch of audio features? I am not sure it >>>>> works >>>>>>>>> for >>>>>>>>>> ALS, not naturally at least. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <[email protected]> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I'm wondering has someone tried ParallelALS with implicite >> feedback >>>>>>> job >>>>>>>>> on >>>>>>>>>>> million song dataset? Some pointers on alpha and lambda? >>>>>>>>>>> >>>>>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what >>>>> are >>>>>>>>> their >>>>>>>>>>> r values in the matrix. They said is based on time units that >> users >>>>>>> have >>>>>>>>>>> watched the show, so may be it's big. >>>>>>>>>>> >>>>>>>>>>> Many thanks! >>>>>>>>>>> -- >>>>>>>>>>> *JU Han* >>>>>>>>>>> >>>>>>>>>>> UTC - Université de Technologie de Compiègne >>>>>>>>>>> * **GI06 - Fouille de Données et Décisionnel* >>>>>>>>>>> >>>>>>>>>>> +33 0619608888 >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> *JU Han* >>>> >>>> Software Engineer Intern @ KXEN Inc. >>>> UTC - Université de Technologie de Compiègne >>>> * **GI06 - Fouille de Données et Décisionnel* >>>> >>>> +33 0619608888 >>>> >>> >>> >>> >> >> > >
