Hi Sebastian, I've tried the svn trunk. Hadoop constantly complains about memory like "out of memory error". On the datanode there's 4 physic cores and by hyper-threading it has 16 logical cores, so I set --numThreadsPerSolver to 16 and that seems to have a problem with memory. How you set your mapred.child.java.opts? Given that we allow only one mapper so that should be nearly the whole size of system memory?
Thanks! 2013/3/19 Sebastian Schelter <[email protected]> > Hi JU, > > We recently rewrote the factorization code, it should be much faster > now. You should use the current trunk, make Hadoop schedule only one > mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make > it reuse the JVMs and add the parameter --numThreadsPerSolver with the > number of cores that you want to use per machine (use all if you can). > > I got astonishing results running the code like this on a 26 machines > cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset > (700M datapoints). > > Let me know if you need more information. > > Best, > Sebastian > > On 19.03.2013 15:31, Han JU wrote: > > Thanks Sebastian and Sean, I will dig more into the paper. > > With a simple try on a small part of the data, it seems larger alpha > (~40) > > gets me a better result. > > Do you have an idea how long it will be for ParellelALS for the 700mb > > complete dataset? It contains ~48 million triples. The hadoop cluster I > > dispose is of 5 nodes and can factorize the movieLens 10M in about 13min. > > > > > > 2013/3/18 Sebastian Schelter <[email protected]> > > > >> You should also be aware that the alpha parameter comes from a formula > >> the authors introduce to measure the "confidence" in the observed > values: > >> > >> confidence = 1 + alpha * observed_value > >> > >> You can also change that formula in the code to something that you see > >> more fit, the paper even suggests alternative variants. > >> > >> Best, > >> Sebastian > >> > >> > >> On 18.03.2013 18:06, Han JU wrote: > >>> Thanks for quick responses. > >>> > >>> Yes it's that dataset. What I'm using is triplets of "user_id song_id > >>> play_times", of ~ 1m users. No audio things, just plein text triples. > >>> > >>> It seems to me that the paper about "implicit feedback" matchs well > this > >>> dataset: no explicit ratings, but times of listening to a song. > >>> > >>> Thank you Sean for the alpha value, I think they use big numbers is > >> because > >>> their values in the R matrix is big. > >>> > >>> > >>> 2013/3/18 Sebastian Schelter <[email protected]> > >>> > >>>> JU, > >>>> > >>>> are you refering to this dataset? > >>>> > >>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile > >>>> > >>>> On 18.03.2013 17:47, Sean Owen wrote: > >>>>> One word of caution, is that there are at least two papers on ALS and > >>>> they > >>>>> define lambda differently. I think you are talking about > "Collaborative > >>>>> Filtering for Implicit Feedback Datasets". > >>>>> > >>>>> I've been working with some folks who point out that alpha=40 seems > to > >> be > >>>>> too high for most data sets. After running some tests on common data > >>>> sets, > >>>>> alpha=1 looks much better. YMMV. > >>>>> > >>>>> In the end you have to evaluate these two parameters, and the # of > >>>>> features, across a range to determine what's best. > >>>>> > >>>>> Is this data set not a bunch of audio features? I am not sure it > works > >>>> for > >>>>> ALS, not naturally at least. > >>>>> > >>>>> > >>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <[email protected]> > >> wrote: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> I'm wondering has someone tried ParallelALS with implicite feedback > >> job > >>>> on > >>>>>> million song dataset? Some pointers on alpha and lambda? > >>>>>> > >>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what > are > >>>> their > >>>>>> r values in the matrix. They said is based on time units that users > >> have > >>>>>> watched the show, so may be it's big. > >>>>>> > >>>>>> Many thanks! > >>>>>> -- > >>>>>> *JU Han* > >>>>>> > >>>>>> UTC - Université de Technologie de Compiègne > >>>>>> * **GI06 - Fouille de Données et Décisionnel* > >>>>>> > >>>>>> +33 0619608888 > >>>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >> > >> > > > > > > -- *JU Han* Software Engineer Intern @ KXEN Inc. UTC - Université de Technologie de Compiègne * **GI06 - Fouille de Données et Décisionnel* +33 0619608888
