Looking forward to your debugging results. Could you have a look at the GC behavior? Maybe we should remove the per-tuple object instantiation in line 153:
topKItems.offer(new GenericRecommendedItem(itemID, (float) predictedRating)); /s On 06.03.2013 11:54, Josh Devins wrote: > The factorization at 2-hours is kind of a non-issue (certainly fast > enough). It was run with (if I recall correctly) 30 reducers across a 35 > node cluster, with 10 iterations. > > I was a bit shocked at how long the recommendation step took and will throw > some timing debug in to see where the problem lies exactly. There were no > other jobs running on the cluster during these attempts, but it's certainly > possible that something is swapping or the like. I'll be looking more > closely today before I start to consider other options for calculating the > recommendations. > > > > On 6 March 2013 11:41, Sean Owen <[email protected]> wrote: > >> Yeah that's right, he said 20 features, oops. And yes he says he's talking >> about the recs only too, so that's not right either. That seems way too >> long relative to factorization. And the factorization seems quite fast; how >> many machines, and how many iterations? >> >> I thought the shape of the computation was to cache B' (yes whose columns >> are B rows) and multiply against the rows of A. There again probably wrong >> given the latest timing info. >> >> >> On Wed, Mar 6, 2013 at 10:25 AM, Josh Devins <[email protected]> wrote: >> >>> So the 80 hour estimate is _only_ for the U*M', top-n calculation and not >>> the factorization. Factorization is on the order of 2-hours. For the >>> interested, here's the pertinent code from the ALS `RecommenderJob`: >>> >>> >>> >> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java?av=f#148 >>> >>> I'm sure this can be optimised, but by an order of magnitude? Something >> to >>> try out, I'll report back if I find anything concrete. >>> >>> >>> >>> On 6 March 2013 11:13, Ted Dunning <[email protected]> wrote: >>> >>>> Well, it would definitely not be the for time I counted incorrectly. >>>> Anytime I do arithmetic the result should be considered suspect. I do >>>> think my numbers are correct, but then again, I always do. >>>> >>>> But the OP did say 20 dimensions which gives me back 5x. >>>> >>>> Inclusion of learning time is a good suspect. In the other side of the >>>> ledger, if the multiply is doing any column wise access it is a likely >>>> performance bug. The computation is AB'. Perhaps you refer to rows of >> B >>>> which are the columns of B'. >>>> >>>> Sent from my sleepy thumbs set to typing on my iPhone. >>>> >>>> On Mar 6, 2013, at 4:16 AM, Sean Owen <[email protected]> wrote: >>>> >>>>> If there are 100 features, it's more like 2.6M * 2.8M * 100 = 728 >>> Tflops >>>> -- >>>>> I think you're missing an "M", and the features by an order of >>> magnitude. >>>>> That's still 1 day on an 8-core machine by this rule of thumb. >>>>> >>>>> The 80 hours is the model building time too (right?), not the time to >>>>> multiply U*M'. This is dominated by iterations when building from >>>> scratch, >>>>> and I expect took 75% of that 80 hours. So if the multiply was 20 >> hours >>>> -- >>>>> on 10 machines -- on Hadoop, then that's still slow but not out of >> the >>>>> question for Hadoop, given it's usually a 3-6x slowdown over a >> parallel >>>>> in-core implementation. >>>>> >>>>> I'm pretty sure what exists in Mahout here can be optimized further >> at >>>> the >>>>> Hadoop level; I don't know that it's doing the multiply badly though. >>> In >>>>> fact I'm pretty sure it's caching cols in memory, which is a bit of >>>>> 'cheating' to speed up by taking a lot of memory. >>>>> >>>>> >>>>> On Wed, Mar 6, 2013 at 3:47 AM, Ted Dunning <[email protected]> >>>> wrote: >>>>> >>>>>> Hmm... each users recommendations seems to be about 2.8 x 20M Flops >> = >>>> 60M >>>>>> Flops. You should get about a Gflop per core in Java so this should >>>> about >>>>>> 60 ms. You can make this faster with more cores or by using ATLAS. >>>>>> >>>>>> Are you expecting 3 million unique people every 80 hours? If no, >> then >>>> it >>>>>> is probably more efficient to compute the recommendations on the >> fly. >>>>>> >>>>>> How many recommendations per second are you expecting? If you have >> 1 >>>>>> million uniques per day (just for grins) and we assume 20,000 s/day >> to >>>>>> allow for peak loading, you have to do 50 queries per second peak. >>> This >>>>>> seems to require 3 cores. Use 16 to be safe. >>>>>> >>>>>> Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50 >> hours. >>>> I >>>>>> think that your map-reduce is under performing by about a factor of >>> 10. >>>>>> This is quite plausible with bad arrangement of the inner loops. I >>>> think >>>>>> that you would have highest performance computing the >> recommendations >>>> for a >>>>>> few thousand items by a few thousand users at a time. It might be >>> just >>>>>> about as fast to do all items against a few users at a time. The >>> reason >>>>>> for this is that dense matrix multiply requires c n x k + m x k >> memory >>>> ops, >>>>>> but n x k x m arithmetic ops. If you can re-use data many times, >> you >>>> can >>>>>> balance memory channel bandwidth against CPU speed. Typically you >>> need >>>> 20 >>>>>> or more re-uses to really make this fly. >>>>>> >>>>>> >>>> >>> >> >
