Re: ALS-WR on Million Song dataset

Sebastian Schelter Thu, 21 Mar 2013 08:07:47 -0700

Nice to hear that!

In order to get into the code, I suggest you first read the papers
regarding ALS for Collaborative Filtering:


Large-scale Parallel Collaborative Filtering for the Netﬂix Prize
http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf

Collaborative Filtering for Implicit Feedback Datasets
http://research.yahoo.com/pub/2433</p>

There's also a slideset available from a university lecture from my
department that has a few slides about ALS:

http://de.slideshare.net/sscdotopen/latent-factor-models-for-collaborative-filtering

After that, you should try to go through
org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob

On 21.03.2013 15:58, Han JU wrote:
> Hi Sebastian,
> 
> It runs much faster! On the same recommendation task it now terminates in
> 45min rather than ~2h yesterday. I think with further tuning it can be even
> faster.
> 
> I'm trying reading the code, some hints for a good starting point?
> 
> Thanks a lot!
> 
> 
> 2013/3/20 Sebastian Schelter <[email protected]>
> 
>> Hi JU,
>>
>> I reworked the RecommenderJob in a similar way as the ALS job. Can you
>> give it a try?
>>
>> You have to try the patch from
>> https://issues.apache.org/jira/browse/MAHOUT-1169
>>
>> In introduces a new param to RecommenderJob called --numThreads. The
>> configuration of the job should be done similar to the ALS job.
>>
>> /s
>>
>>
>> On 20.03.2013 12:38, Han JU wrote:
>>> Thanks again Sebastian and Seon, I set -Xmx4000m for
>> mapred.child.java.opts
>>> and 8 threads for each mapper. Now the job runs smoothly and the whole
>>> factorization ends in 45min. With your settings I think it should be even
>>> faster.
>>>
>>> One more thing is that the RecommendJob is kind of slow (for all users).
>>> For example I want to have a list of top 500 items to recommend. Any
>>> pointers about how to modify the job code so that it can consult a file
>>> then calculates recommendations only for the users id in that file?
>>>
>>>
>>> 2013/3/20 Han JU <[email protected]>
>>>
>>>> Hi Sebastian,
>>>>
>>>> I've tried the svn trunk. Hadoop constantly complains about memory like
>>>> "out of memory error".
>>>> On the datanode there's 4 physic cores and by hyper-threading it has 16
>>>> logical cores, so I set --numThreadsPerSolver to 16 and that seems to
>> have
>>>> a problem with memory.
>>>> How you set your mapred.child.java.opts? Given that we allow only one
>>>> mapper so that should be nearly the whole size of system memory?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> 2013/3/19 Sebastian Schelter <[email protected]>
>>>>
>>>>> Hi JU,
>>>>>
>>>>> We recently rewrote the factorization code, it should be much faster
>>>>> now. You should use the current trunk, make Hadoop schedule only one
>>>>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1),
>> make
>>>>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
>>>>> number of cores that you want to use per machine (use all if you can).
>>>>>
>>>>> I got astonishing results running the code like this on a 26 machines
>>>>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs
>> dataset
>>>>> (700M datapoints).
>>>>>
>>>>> Let me know if you need more information.
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>> On 19.03.2013 15:31, Han JU wrote:
>>>>>> Thanks Sebastian and Sean, I will dig more into the paper.
>>>>>> With a simple try on a small part of the data, it seems larger alpha
>>>>> (~40)
>>>>>> gets me a better result.
>>>>>> Do you have an idea how long it will be for ParellelALS for the 700mb
>>>>>> complete dataset? It contains ~48 million triples. The hadoop cluster
>> I
>>>>>> dispose is of 5 nodes and can factorize the movieLens 10M in about
>>>>> 13min.
>>>>>>
>>>>>>
>>>>>> 2013/3/18 Sebastian Schelter <[email protected]>
>>>>>>
>>>>>>> You should also be aware that the alpha parameter comes from a
>> formula
>>>>>>> the authors introduce to measure the "confidence" in the observed
>>>>> values:
>>>>>>>
>>>>>>> confidence = 1 + alpha * observed_value
>>>>>>>
>>>>>>> You can also change that formula in the code to something that you
>> see
>>>>>>> more fit, the paper even suggests alternative variants.
>>>>>>>
>>>>>>> Best,
>>>>>>> Sebastian
>>>>>>>
>>>>>>>
>>>>>>> On 18.03.2013 18:06, Han JU wrote:
>>>>>>>> Thanks for quick responses.
>>>>>>>>
>>>>>>>> Yes it's that dataset. What I'm using is triplets of "user_id
>> song_id
>>>>>>>> play_times", of ~ 1m users. No audio things, just plein text
>> triples.
>>>>>>>>
>>>>>>>> It seems to me that the paper about "implicit feedback" matchs well
>>>>> this
>>>>>>>> dataset: no explicit ratings, but times of listening to a song.
>>>>>>>>
>>>>>>>> Thank you Sean for the alpha value, I think they use big numbers is
>>>>>>> because
>>>>>>>> their values in the R matrix is big.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2013/3/18 Sebastian Schelter <[email protected]>
>>>>>>>>
>>>>>>>>> JU,
>>>>>>>>>
>>>>>>>>> are you refering to this dataset?
>>>>>>>>>
>>>>>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>>>>>>>>>
>>>>>>>>> On 18.03.2013 17:47, Sean Owen wrote:
>>>>>>>>>> One word of caution, is that there are at least two papers on ALS
>>>>> and
>>>>>>>>> they
>>>>>>>>>> define lambda differently. I think you are talking about
>>>>> "Collaborative
>>>>>>>>>> Filtering for Implicit Feedback Datasets".
>>>>>>>>>>
>>>>>>>>>> I've been working with some folks who point out that alpha=40
>> seems
>>>>> to
>>>>>>> be
>>>>>>>>>> too high for most data sets. After running some tests on common
>> data
>>>>>>>>> sets,
>>>>>>>>>> alpha=1 looks much better. YMMV.
>>>>>>>>>>
>>>>>>>>>> In the end you have to evaluate these two parameters, and the # of
>>>>>>>>>> features, across a range to determine what's best.
>>>>>>>>>>
>>>>>>>>>> Is this data set not a bunch of audio features? I am not sure it
>>>>> works
>>>>>>>>> for
>>>>>>>>>> ALS, not naturally at least.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <[email protected]>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I'm wondering has someone tried ParallelALS with implicite
>> feedback
>>>>>>> job
>>>>>>>>> on
>>>>>>>>>>> million song dataset? Some pointers on alpha and lambda?
>>>>>>>>>>>
>>>>>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what
>>>>> are
>>>>>>>>> their
>>>>>>>>>>> r values in the matrix. They said is based on time units that
>> users
>>>>>>> have
>>>>>>>>>>> watched the show, so may be it's big.
>>>>>>>>>>>
>>>>>>>>>>> Many thanks!
>>>>>>>>>>> --
>>>>>>>>>>> *JU Han*
>>>>>>>>>>>
>>>>>>>>>>> UTC   -  Université de Technologie de Compiègne
>>>>>>>>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>>>>>>>>
>>>>>>>>>>> +33 0619608888
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *JU Han*
>>>>
>>>> Software Engineer Intern @ KXEN Inc.
>>>> UTC   -  Université de Technologie de Compiègne
>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>
>>>> +33 0619608888
>>>>
>>>
>>>
>>>
>>
>>
> 
>

Re: ALS-WR on Million Song dataset

Reply via email to