Re: ALS-WR on Million Song dataset

Han JU Wed, 20 Mar 2013 02:01:41 -0700

Hi Sebastian,

I've tried the svn trunk. Hadoop constantly complains about memory like
"out of memory error".
On the datanode there's 4 physic cores and by hyper-threading it has 16
logical cores, so I set --numThreadsPerSolver to 16 and that seems to have
a problem with memory.
How you set your mapred.child.java.opts? Given that we allow only one
mapper so that should be nearly the whole size of system memory?


Thanks!


2013/3/19 Sebastian Schelter <[email protected]>

> Hi JU,
>
> We recently rewrote the factorization code, it should be much faster
> now. You should use the current trunk, make Hadoop schedule only one
> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make
> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
> number of cores that you want to use per machine (use all if you can).
>
> I got astonishing results running the code like this on a 26 machines
> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset
> (700M datapoints).
>
> Let me know if you need more information.
>
> Best,
> Sebastian
>
> On 19.03.2013 15:31, Han JU wrote:
> > Thanks Sebastian and Sean, I will dig more into the paper.
> > With a simple try on a small part of the data, it seems larger alpha
> (~40)
> > gets me a better result.
> > Do you have an idea how long it will be for ParellelALS for the 700mb
> > complete dataset? It contains ~48 million triples. The hadoop cluster I
> > dispose is of 5 nodes and can factorize the movieLens 10M in about 13min.
> >
> >
> > 2013/3/18 Sebastian Schelter <[email protected]>
> >
> >> You should also be aware that the alpha parameter comes from a formula
> >> the authors introduce to measure the "confidence" in the observed
> values:
> >>
> >> confidence = 1 + alpha * observed_value
> >>
> >> You can also change that formula in the code to something that you see
> >> more fit, the paper even suggests alternative variants.
> >>
> >> Best,
> >> Sebastian
> >>
> >>
> >> On 18.03.2013 18:06, Han JU wrote:
> >>> Thanks for quick responses.
> >>>
> >>> Yes it's that dataset. What I'm using is triplets of "user_id song_id
> >>> play_times", of ~ 1m users. No audio things, just plein text triples.
> >>>
> >>> It seems to me that the paper about "implicit feedback" matchs well
> this
> >>> dataset: no explicit ratings, but times of listening to a song.
> >>>
> >>> Thank you Sean for the alpha value, I think they use big numbers is
> >> because
> >>> their values in the R matrix is big.
> >>>
> >>>
> >>> 2013/3/18 Sebastian Schelter <[email protected]>
> >>>
> >>>> JU,
> >>>>
> >>>> are you refering to this dataset?
> >>>>
> >>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
> >>>>
> >>>> On 18.03.2013 17:47, Sean Owen wrote:
> >>>>> One word of caution, is that there are at least two papers on ALS and
> >>>> they
> >>>>> define lambda differently. I think you are talking about
> "Collaborative
> >>>>> Filtering for Implicit Feedback Datasets".
> >>>>>
> >>>>> I've been working with some folks who point out that alpha=40 seems
> to
> >> be
> >>>>> too high for most data sets. After running some tests on common data
> >>>> sets,
> >>>>> alpha=1 looks much better. YMMV.
> >>>>>
> >>>>> In the end you have to evaluate these two parameters, and the # of
> >>>>> features, across a range to determine what's best.
> >>>>>
> >>>>> Is this data set not a bunch of audio features? I am not sure it
> works
> >>>> for
> >>>>> ALS, not naturally at least.
> >>>>>
> >>>>>
> >>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <[email protected]>
> >> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I'm wondering has someone tried ParallelALS with implicite feedback
> >> job
> >>>> on
> >>>>>> million song dataset? Some pointers on alpha and lambda?
> >>>>>>
> >>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what
> are
> >>>> their
> >>>>>> r values in the matrix. They said is based on time units that users
> >> have
> >>>>>> watched the show, so may be it's big.
> >>>>>>
> >>>>>> Many thanks!
> >>>>>> --
> >>>>>> *JU Han*
> >>>>>>
> >>>>>> UTC   -  Université de Technologie de Compiègne
> >>>>>> *     **GI06 - Fouille de Données et Décisionnel*
> >>>>>>
> >>>>>> +33 0619608888
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>


-- 
*JU Han*

Software Engineer Intern @ KXEN Inc.
UTC   -  Université de Technologie de Compiègne
*     **GI06 - Fouille de Données et Décisionnel*

+33 0619608888

Re: ALS-WR on Million Song dataset

Reply via email to