Re: ALS-WR on Million Song dataset

Han JU Thu, 21 Mar 2013 07:59:13 -0700

Hi Sebastian,

It runs much faster! On the same recommendation task it now terminates in
45min rather than ~2h yesterday. I think with further tuning it can be even
faster.


I'm trying reading the code, some hints for a good starting point?

Thanks a lot!


2013/3/20 Sebastian Schelter <[email protected]>

> Hi JU,
>
> I reworked the RecommenderJob in a similar way as the ALS job. Can you
> give it a try?
>
> You have to try the patch from
> https://issues.apache.org/jira/browse/MAHOUT-1169
>
> In introduces a new param to RecommenderJob called --numThreads. The
> configuration of the job should be done similar to the ALS job.
>
> /s
>
>
> On 20.03.2013 12:38, Han JU wrote:
> > Thanks again Sebastian and Seon, I set -Xmx4000m for
> mapred.child.java.opts
> > and 8 threads for each mapper. Now the job runs smoothly and the whole
> > factorization ends in 45min. With your settings I think it should be even
> > faster.
> >
> > One more thing is that the RecommendJob is kind of slow (for all users).
> > For example I want to have a list of top 500 items to recommend. Any
> > pointers about how to modify the job code so that it can consult a file
> > then calculates recommendations only for the users id in that file?
> >
> >
> > 2013/3/20 Han JU <[email protected]>
> >
> >> Hi Sebastian,
> >>
> >> I've tried the svn trunk. Hadoop constantly complains about memory like
> >> "out of memory error".
> >> On the datanode there's 4 physic cores and by hyper-threading it has 16
> >> logical cores, so I set --numThreadsPerSolver to 16 and that seems to
> have
> >> a problem with memory.
> >> How you set your mapred.child.java.opts? Given that we allow only one
> >> mapper so that should be nearly the whole size of system memory?
> >>
> >> Thanks!
> >>
> >>
> >> 2013/3/19 Sebastian Schelter <[email protected]>
> >>
> >>> Hi JU,
> >>>
> >>> We recently rewrote the factorization code, it should be much faster
> >>> now. You should use the current trunk, make Hadoop schedule only one
> >>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1),
> make
> >>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
> >>> number of cores that you want to use per machine (use all if you can).
> >>>
> >>> I got astonishing results running the code like this on a 26 machines
> >>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs
> dataset
> >>> (700M datapoints).
> >>>
> >>> Let me know if you need more information.
> >>>
> >>> Best,
> >>> Sebastian
> >>>
> >>> On 19.03.2013 15:31, Han JU wrote:
> >>>> Thanks Sebastian and Sean, I will dig more into the paper.
> >>>> With a simple try on a small part of the data, it seems larger alpha
> >>> (~40)
> >>>> gets me a better result.
> >>>> Do you have an idea how long it will be for ParellelALS for the 700mb
> >>>> complete dataset? It contains ~48 million triples. The hadoop cluster
> I
> >>>> dispose is of 5 nodes and can factorize the movieLens 10M in about
> >>> 13min.
> >>>>
> >>>>
> >>>> 2013/3/18 Sebastian Schelter <[email protected]>
> >>>>
> >>>>> You should also be aware that the alpha parameter comes from a
> formula
> >>>>> the authors introduce to measure the "confidence" in the observed
> >>> values:
> >>>>>
> >>>>> confidence = 1 + alpha * observed_value
> >>>>>
> >>>>> You can also change that formula in the code to something that you
> see
> >>>>> more fit, the paper even suggests alternative variants.
> >>>>>
> >>>>> Best,
> >>>>> Sebastian
> >>>>>
> >>>>>
> >>>>> On 18.03.2013 18:06, Han JU wrote:
> >>>>>> Thanks for quick responses.
> >>>>>>
> >>>>>> Yes it's that dataset. What I'm using is triplets of "user_id
> song_id
> >>>>>> play_times", of ~ 1m users. No audio things, just plein text
> triples.
> >>>>>>
> >>>>>> It seems to me that the paper about "implicit feedback" matchs well
> >>> this
> >>>>>> dataset: no explicit ratings, but times of listening to a song.
> >>>>>>
> >>>>>> Thank you Sean for the alpha value, I think they use big numbers is
> >>>>> because
> >>>>>> their values in the R matrix is big.
> >>>>>>
> >>>>>>
> >>>>>> 2013/3/18 Sebastian Schelter <[email protected]>
> >>>>>>
> >>>>>>> JU,
> >>>>>>>
> >>>>>>> are you refering to this dataset?
> >>>>>>>
> >>>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
> >>>>>>>
> >>>>>>> On 18.03.2013 17:47, Sean Owen wrote:
> >>>>>>>> One word of caution, is that there are at least two papers on ALS
> >>> and
> >>>>>>> they
> >>>>>>>> define lambda differently. I think you are talking about
> >>> "Collaborative
> >>>>>>>> Filtering for Implicit Feedback Datasets".
> >>>>>>>>
> >>>>>>>> I've been working with some folks who point out that alpha=40
> seems
> >>> to
> >>>>> be
> >>>>>>>> too high for most data sets. After running some tests on common
> data
> >>>>>>> sets,
> >>>>>>>> alpha=1 looks much better. YMMV.
> >>>>>>>>
> >>>>>>>> In the end you have to evaluate these two parameters, and the # of
> >>>>>>>> features, across a range to determine what's best.
> >>>>>>>>
> >>>>>>>> Is this data set not a bunch of audio features? I am not sure it
> >>> works
> >>>>>>> for
> >>>>>>>> ALS, not naturally at least.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <[email protected]>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I'm wondering has someone tried ParallelALS with implicite
> feedback
> >>>>> job
> >>>>>>> on
> >>>>>>>>> million song dataset? Some pointers on alpha and lambda?
> >>>>>>>>>
> >>>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what
> >>> are
> >>>>>>> their
> >>>>>>>>> r values in the matrix. They said is based on time units that
> users
> >>>>> have
> >>>>>>>>> watched the show, so may be it's big.
> >>>>>>>>>
> >>>>>>>>> Many thanks!
> >>>>>>>>> --
> >>>>>>>>> *JU Han*
> >>>>>>>>>
> >>>>>>>>> UTC   -  Université de Technologie de Compiègne
> >>>>>>>>> *     **GI06 - Fouille de Données et Décisionnel*
> >>>>>>>>>
> >>>>>>>>> +33 0619608888
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >> --
> >> *JU Han*
> >>
> >> Software Engineer Intern @ KXEN Inc.
> >> UTC   -  Université de Technologie de Compiègne
> >> *     **GI06 - Fouille de Données et Décisionnel*
> >>
> >> +33 0619608888
> >>
> >
> >
> >
>
>


-- 
*JU Han*

Software Engineer Intern @ KXEN Inc.
UTC   -  Université de Technologie de Compiègne
*     **GI06 - Fouille de Données et Décisionnel*

+33 0619608888

Re: ALS-WR on Million Song dataset

Reply via email to