Hi JU, We recently rewrote the factorization code, it should be much faster now. You should use the current trunk, make Hadoop schedule only one mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make it reuse the JVMs and add the parameter --numThreadsPerSolver with the number of cores that you want to use per machine (use all if you can).
I got astonishing results running the code like this on a 26 machines cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset (700M datapoints). Let me know if you need more information. Best, Sebastian On 19.03.2013 15:31, Han JU wrote: > Thanks Sebastian and Sean, I will dig more into the paper. > With a simple try on a small part of the data, it seems larger alpha (~40) > gets me a better result. > Do you have an idea how long it will be for ParellelALS for the 700mb > complete dataset? It contains ~48 million triples. The hadoop cluster I > dispose is of 5 nodes and can factorize the movieLens 10M in about 13min. > > > 2013/3/18 Sebastian Schelter <[email protected]> > >> You should also be aware that the alpha parameter comes from a formula >> the authors introduce to measure the "confidence" in the observed values: >> >> confidence = 1 + alpha * observed_value >> >> You can also change that formula in the code to something that you see >> more fit, the paper even suggests alternative variants. >> >> Best, >> Sebastian >> >> >> On 18.03.2013 18:06, Han JU wrote: >>> Thanks for quick responses. >>> >>> Yes it's that dataset. What I'm using is triplets of "user_id song_id >>> play_times", of ~ 1m users. No audio things, just plein text triples. >>> >>> It seems to me that the paper about "implicit feedback" matchs well this >>> dataset: no explicit ratings, but times of listening to a song. >>> >>> Thank you Sean for the alpha value, I think they use big numbers is >> because >>> their values in the R matrix is big. >>> >>> >>> 2013/3/18 Sebastian Schelter <[email protected]> >>> >>>> JU, >>>> >>>> are you refering to this dataset? >>>> >>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile >>>> >>>> On 18.03.2013 17:47, Sean Owen wrote: >>>>> One word of caution, is that there are at least two papers on ALS and >>>> they >>>>> define lambda differently. I think you are talking about "Collaborative >>>>> Filtering for Implicit Feedback Datasets". >>>>> >>>>> I've been working with some folks who point out that alpha=40 seems to >> be >>>>> too high for most data sets. After running some tests on common data >>>> sets, >>>>> alpha=1 looks much better. YMMV. >>>>> >>>>> In the end you have to evaluate these two parameters, and the # of >>>>> features, across a range to determine what's best. >>>>> >>>>> Is this data set not a bunch of audio features? I am not sure it works >>>> for >>>>> ALS, not naturally at least. >>>>> >>>>> >>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <[email protected]> >> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm wondering has someone tried ParallelALS with implicite feedback >> job >>>> on >>>>>> million song dataset? Some pointers on alpha and lambda? >>>>>> >>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what are >>>> their >>>>>> r values in the matrix. They said is based on time units that users >> have >>>>>> watched the show, so may be it's big. >>>>>> >>>>>> Many thanks! >>>>>> -- >>>>>> *JU Han* >>>>>> >>>>>> UTC - Université de Technologie de Compiègne >>>>>> * **GI06 - Fouille de Données et Décisionnel* >>>>>> >>>>>> +33 0619608888 >>>>>> >>>>> >>>> >>>> >>> >>> >> >> > >
