Hi Sebastian, It runs much faster! On the same recommendation task it now terminates in 45min rather than ~2h yesterday. I think with further tuning it can be even faster.
I'm trying reading the code, some hints for a good starting point? Thanks a lot! 2013/3/20 Sebastian Schelter <[email protected]> > Hi JU, > > I reworked the RecommenderJob in a similar way as the ALS job. Can you > give it a try? > > You have to try the patch from > https://issues.apache.org/jira/browse/MAHOUT-1169 > > In introduces a new param to RecommenderJob called --numThreads. The > configuration of the job should be done similar to the ALS job. > > /s > > > On 20.03.2013 12:38, Han JU wrote: > > Thanks again Sebastian and Seon, I set -Xmx4000m for > mapred.child.java.opts > > and 8 threads for each mapper. Now the job runs smoothly and the whole > > factorization ends in 45min. With your settings I think it should be even > > faster. > > > > One more thing is that the RecommendJob is kind of slow (for all users). > > For example I want to have a list of top 500 items to recommend. Any > > pointers about how to modify the job code so that it can consult a file > > then calculates recommendations only for the users id in that file? > > > > > > 2013/3/20 Han JU <[email protected]> > > > >> Hi Sebastian, > >> > >> I've tried the svn trunk. Hadoop constantly complains about memory like > >> "out of memory error". > >> On the datanode there's 4 physic cores and by hyper-threading it has 16 > >> logical cores, so I set --numThreadsPerSolver to 16 and that seems to > have > >> a problem with memory. > >> How you set your mapred.child.java.opts? Given that we allow only one > >> mapper so that should be nearly the whole size of system memory? > >> > >> Thanks! > >> > >> > >> 2013/3/19 Sebastian Schelter <[email protected]> > >> > >>> Hi JU, > >>> > >>> We recently rewrote the factorization code, it should be much faster > >>> now. You should use the current trunk, make Hadoop schedule only one > >>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), > make > >>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the > >>> number of cores that you want to use per machine (use all if you can). > >>> > >>> I got astonishing results running the code like this on a 26 machines > >>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs > dataset > >>> (700M datapoints). > >>> > >>> Let me know if you need more information. > >>> > >>> Best, > >>> Sebastian > >>> > >>> On 19.03.2013 15:31, Han JU wrote: > >>>> Thanks Sebastian and Sean, I will dig more into the paper. > >>>> With a simple try on a small part of the data, it seems larger alpha > >>> (~40) > >>>> gets me a better result. > >>>> Do you have an idea how long it will be for ParellelALS for the 700mb > >>>> complete dataset? It contains ~48 million triples. The hadoop cluster > I > >>>> dispose is of 5 nodes and can factorize the movieLens 10M in about > >>> 13min. > >>>> > >>>> > >>>> 2013/3/18 Sebastian Schelter <[email protected]> > >>>> > >>>>> You should also be aware that the alpha parameter comes from a > formula > >>>>> the authors introduce to measure the "confidence" in the observed > >>> values: > >>>>> > >>>>> confidence = 1 + alpha * observed_value > >>>>> > >>>>> You can also change that formula in the code to something that you > see > >>>>> more fit, the paper even suggests alternative variants. > >>>>> > >>>>> Best, > >>>>> Sebastian > >>>>> > >>>>> > >>>>> On 18.03.2013 18:06, Han JU wrote: > >>>>>> Thanks for quick responses. > >>>>>> > >>>>>> Yes it's that dataset. What I'm using is triplets of "user_id > song_id > >>>>>> play_times", of ~ 1m users. No audio things, just plein text > triples. > >>>>>> > >>>>>> It seems to me that the paper about "implicit feedback" matchs well > >>> this > >>>>>> dataset: no explicit ratings, but times of listening to a song. > >>>>>> > >>>>>> Thank you Sean for the alpha value, I think they use big numbers is > >>>>> because > >>>>>> their values in the R matrix is big. > >>>>>> > >>>>>> > >>>>>> 2013/3/18 Sebastian Schelter <[email protected]> > >>>>>> > >>>>>>> JU, > >>>>>>> > >>>>>>> are you refering to this dataset? > >>>>>>> > >>>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile > >>>>>>> > >>>>>>> On 18.03.2013 17:47, Sean Owen wrote: > >>>>>>>> One word of caution, is that there are at least two papers on ALS > >>> and > >>>>>>> they > >>>>>>>> define lambda differently. I think you are talking about > >>> "Collaborative > >>>>>>>> Filtering for Implicit Feedback Datasets". > >>>>>>>> > >>>>>>>> I've been working with some folks who point out that alpha=40 > seems > >>> to > >>>>> be > >>>>>>>> too high for most data sets. After running some tests on common > data > >>>>>>> sets, > >>>>>>>> alpha=1 looks much better. YMMV. > >>>>>>>> > >>>>>>>> In the end you have to evaluate these two parameters, and the # of > >>>>>>>> features, across a range to determine what's best. > >>>>>>>> > >>>>>>>> Is this data set not a bunch of audio features? I am not sure it > >>> works > >>>>>>> for > >>>>>>>> ALS, not naturally at least. > >>>>>>>> > >>>>>>>> > >>>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <[email protected]> > >>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> I'm wondering has someone tried ParallelALS with implicite > feedback > >>>>> job > >>>>>>> on > >>>>>>>>> million song dataset? Some pointers on alpha and lambda? > >>>>>>>>> > >>>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what > >>> are > >>>>>>> their > >>>>>>>>> r values in the matrix. They said is based on time units that > users > >>>>> have > >>>>>>>>> watched the show, so may be it's big. > >>>>>>>>> > >>>>>>>>> Many thanks! > >>>>>>>>> -- > >>>>>>>>> *JU Han* > >>>>>>>>> > >>>>>>>>> UTC - Université de Technologie de Compiègne > >>>>>>>>> * **GI06 - Fouille de Données et Décisionnel* > >>>>>>>>> > >>>>>>>>> +33 0619608888 > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >> > >> > >> -- > >> *JU Han* > >> > >> Software Engineer Intern @ KXEN Inc. > >> UTC - Université de Technologie de Compiègne > >> * **GI06 - Fouille de Données et Décisionnel* > >> > >> +33 0619608888 > >> > > > > > > > > -- *JU Han* Software Engineer Intern @ KXEN Inc. UTC - Université de Technologie de Compiègne * **GI06 - Fouille de Données et Décisionnel* +33 0619608888
