This highlights the 'cheating' going on here to make this run so fast, and
I completely endorse it. The join proceeds by side-loading half of the join
into memory. This creates a new possible limiting bottleneck here, the
amount of memory allocated to the mapper (/reducer).

There are a number of things that could be done later to optimize this,
though they might be kind of painful to implement. One is using floats
instead of doubles. The other is precomputing which items are needed in
which mappers and then only loading that subset into memory.

In some of the development I have done these and other tricks make these
things run fast. The down-side is just that it takes effort to optimize
that far and a lot of the optimization doesn't generalize to other
algorithms.

I still quite buy the argument that there are going to be better frameworks
for this, where you don't fight and work around the paradigm to get things
done. But it's quite possible to make these things hum on the commodity
distributed paradigm of today.

(PS: anyone at Hadoop Summit Europe today, let me know)


On Wed, Mar 20, 2013 at 11:39 AM, Sebastian Schelter <[email protected]> wrote:

> Hi JU,
>
> the job creates an OpenIntObjectHashMap<Vector> holding the feature
> vectors as DenseVectors. In one map-job, it is filled with the
> user-feature vectors, in the next one with the item feature vectors.
>
> I used 4 gigabytes for a dataset with  1.8M users (using 20 features),
> so I guess that 2-3gig should be enough for your dataset.
>
> I used these settings:
>
> mapred.job.reuse.jvm.num.tasks=-1
> mapred.tasktracker.map.tasks.maximum=1
> mapred.child.java.opts=-Xmx4096m
>
> On 20.03.2013 10:01, Han JU wrote:
> > Hi Sebastian,
> >
> > I've tried the svn trunk. Hadoop constantly complains about memory like
> > "out of memory error".
> > On the datanode there's 4 physic cores and by hyper-threading it has 16
> > logical cores, so I set --numThreadsPerSolver to 16 and that seems to
> have
> > a problem with memory.
> > How you set your mapred.child.java.opts? Given that we allow only one
> > mapper so that should be nearly the whole size of system memory?
> >
> > Thanks!
> >
> >
> > 2013/3/19 Sebastian Schelter <[email protected]>
> >
> >> Hi JU,
> >>
> >> We recently rewrote the factorization code, it should be much faster
> >> now. You should use the current trunk, make Hadoop schedule only one
> >> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make
> >> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
> >> number of cores that you want to use per machine (use all if you can).
> >>
> >> I got astonishing results running the code like this on a 26 machines
> >> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset
> >> (700M datapoints).
> >>
> >> Let me know if you need more information.
> >>
> >> Best,
> >> Sebastian
> >>
> >> On 19.03.2013 15:31, Han JU wrote:
> >>> Thanks Sebastian and Sean, I will dig more into the paper.
> >>> With a simple try on a small part of the data, it seems larger alpha
> >> (~40)
> >>> gets me a better result.
> >>> Do you have an idea how long it will be for ParellelALS for the 700mb
> >>> complete dataset? It contains ~48 million triples. The hadoop cluster I
> >>> dispose is of 5 nodes and can factorize the movieLens 10M in about
> 13min.
> >>>
> >>>
> >>> 2013/3/18 Sebastian Schelter <[email protected]>
> >>>
> >>>> You should also be aware that the alpha parameter comes from a formula
> >>>> the authors introduce to measure the "confidence" in the observed
> >> values:
> >>>>
> >>>> confidence = 1 + alpha * observed_value
> >>>>
> >>>> You can also change that formula in the code to something that you see
> >>>> more fit, the paper even suggests alternative variants.
> >>>>
> >>>> Best,
> >>>> Sebastian
> >>>>
> >>>>
> >>>> On 18.03.2013 18:06, Han JU wrote:
> >>>>> Thanks for quick responses.
> >>>>>
> >>>>> Yes it's that dataset. What I'm using is triplets of "user_id song_id
> >>>>> play_times", of ~ 1m users. No audio things, just plein text triples.
> >>>>>
> >>>>> It seems to me that the paper about "implicit feedback" matchs well
> >> this
> >>>>> dataset: no explicit ratings, but times of listening to a song.
> >>>>>
> >>>>> Thank you Sean for the alpha value, I think they use big numbers is
> >>>> because
> >>>>> their values in the R matrix is big.
> >>>>>
> >>>>>
> >>>>> 2013/3/18 Sebastian Schelter <[email protected]>
> >>>>>
> >>>>>> JU,
> >>>>>>
> >>>>>> are you refering to this dataset?
> >>>>>>
> >>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
> >>>>>>
> >>>>>> On 18.03.2013 17:47, Sean Owen wrote:
> >>>>>>> One word of caution, is that there are at least two papers on ALS
> and
> >>>>>> they
> >>>>>>> define lambda differently. I think you are talking about
> >> "Collaborative
> >>>>>>> Filtering for Implicit Feedback Datasets".
> >>>>>>>
> >>>>>>> I've been working with some folks who point out that alpha=40 seems
> >> to
> >>>> be
> >>>>>>> too high for most data sets. After running some tests on common
> data
> >>>>>> sets,
> >>>>>>> alpha=1 looks much better. YMMV.
> >>>>>>>
> >>>>>>> In the end you have to evaluate these two parameters, and the # of
> >>>>>>> features, across a range to determine what's best.
> >>>>>>>
> >>>>>>> Is this data set not a bunch of audio features? I am not sure it
> >> works
> >>>>>> for
> >>>>>>> ALS, not naturally at least.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <[email protected]>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I'm wondering has someone tried ParallelALS with implicite
> feedback
> >>>> job
> >>>>>> on
> >>>>>>>> million song dataset? Some pointers on alpha and lambda?
> >>>>>>>>
> >>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what
> >> are
> >>>>>> their
> >>>>>>>> r values in the matrix. They said is based on time units that
> users
> >>>> have
> >>>>>>>> watched the show, so may be it's big.
> >>>>>>>>
> >>>>>>>> Many thanks!
> >>>>>>>> --
> >>>>>>>> *JU Han*
> >>>>>>>>
> >>>>>>>> UTC   -  Université de Technologie de Compiègne
> >>>>>>>> *     **GI06 - Fouille de Données et Décisionnel*
> >>>>>>>>
> >>>>>>>> +33 0619608888
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>

Reply via email to