So the 80 hour estimate is _only_ for the U*M', top-n calculation and not the factorization. Factorization is on the order of 2-hours. For the interested, here's the pertinent code from the ALS `RecommenderJob`:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java?av=f#148 I'm sure this can be optimised, but by an order of magnitude? Something to try out, I'll report back if I find anything concrete. On 6 March 2013 11:13, Ted Dunning <[email protected]> wrote: > Well, it would definitely not be the for time I counted incorrectly. > Anytime I do arithmetic the result should be considered suspect. I do > think my numbers are correct, but then again, I always do. > > But the OP did say 20 dimensions which gives me back 5x. > > Inclusion of learning time is a good suspect. In the other side of the > ledger, if the multiply is doing any column wise access it is a likely > performance bug. The computation is AB'. Perhaps you refer to rows of B > which are the columns of B'. > > Sent from my sleepy thumbs set to typing on my iPhone. > > On Mar 6, 2013, at 4:16 AM, Sean Owen <[email protected]> wrote: > > > If there are 100 features, it's more like 2.6M * 2.8M * 100 = 728 Tflops > -- > > I think you're missing an "M", and the features by an order of magnitude. > > That's still 1 day on an 8-core machine by this rule of thumb. > > > > The 80 hours is the model building time too (right?), not the time to > > multiply U*M'. This is dominated by iterations when building from > scratch, > > and I expect took 75% of that 80 hours. So if the multiply was 20 hours > -- > > on 10 machines -- on Hadoop, then that's still slow but not out of the > > question for Hadoop, given it's usually a 3-6x slowdown over a parallel > > in-core implementation. > > > > I'm pretty sure what exists in Mahout here can be optimized further at > the > > Hadoop level; I don't know that it's doing the multiply badly though. In > > fact I'm pretty sure it's caching cols in memory, which is a bit of > > 'cheating' to speed up by taking a lot of memory. > > > > > > On Wed, Mar 6, 2013 at 3:47 AM, Ted Dunning <[email protected]> > wrote: > > > >> Hmm... each users recommendations seems to be about 2.8 x 20M Flops = > 60M > >> Flops. You should get about a Gflop per core in Java so this should > about > >> 60 ms. You can make this faster with more cores or by using ATLAS. > >> > >> Are you expecting 3 million unique people every 80 hours? If no, then > it > >> is probably more efficient to compute the recommendations on the fly. > >> > >> How many recommendations per second are you expecting? If you have 1 > >> million uniques per day (just for grins) and we assume 20,000 s/day to > >> allow for peak loading, you have to do 50 queries per second peak. This > >> seems to require 3 cores. Use 16 to be safe. > >> > >> Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50 hours. > I > >> think that your map-reduce is under performing by about a factor of 10. > >> This is quite plausible with bad arrangement of the inner loops. I > think > >> that you would have highest performance computing the recommendations > for a > >> few thousand items by a few thousand users at a time. It might be just > >> about as fast to do all items against a few users at a time. The reason > >> for this is that dense matrix multiply requires c n x k + m x k memory > ops, > >> but n x k x m arithmetic ops. If you can re-use data many times, you > can > >> balance memory channel bandwidth against CPU speed. Typically you need > 20 > >> or more re-uses to really make this fly. > >> > >> >
