Looking forward to your debugging results. Could you have a look at the
GC behavior? Maybe we should remove the per-tuple object instantiation
in line 153:

topKItems.offer(new GenericRecommendedItem(itemID, (float)
predictedRating));

/s

On 06.03.2013 11:54, Josh Devins wrote:
> The factorization at 2-hours is kind of a non-issue (certainly fast
> enough). It was run with (if I recall correctly) 30 reducers across a 35
> node cluster, with 10 iterations.
> 
> I was a bit shocked at how long the recommendation step took and will throw
> some timing debug in to see where the problem lies exactly. There were no
> other jobs running on the cluster during these attempts, but it's certainly
> possible that something is swapping or the like. I'll be looking more
> closely today before I start to consider other options for calculating the
> recommendations.
> 
> 
> 
> On 6 March 2013 11:41, Sean Owen <[email protected]> wrote:
> 
>> Yeah that's right, he said 20 features, oops. And yes he says he's talking
>> about the recs only too, so that's not right either. That seems way too
>> long relative to factorization. And the factorization seems quite fast; how
>> many machines, and how many iterations?
>>
>> I thought the shape of the computation was to cache B' (yes whose columns
>> are B rows) and multiply against the rows of A. There again probably wrong
>> given the latest timing info.
>>
>>
>> On Wed, Mar 6, 2013 at 10:25 AM, Josh Devins <[email protected]> wrote:
>>
>>> So the 80 hour estimate is _only_ for the U*M', top-n calculation and not
>>> the factorization. Factorization is on the order of 2-hours. For the
>>> interested, here's the pertinent code from the ALS `RecommenderJob`:
>>>
>>>
>>>
>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java?av=f#148
>>>
>>> I'm sure this can be optimised, but by an order of magnitude? Something
>> to
>>> try out, I'll report back if I find anything concrete.
>>>
>>>
>>>
>>> On 6 March 2013 11:13, Ted Dunning <[email protected]> wrote:
>>>
>>>> Well, it would definitely not be the for time I counted incorrectly.
>>>>  Anytime I do arithmetic the result should be considered suspect.  I do
>>>> think my numbers are correct, but then again, I always do.
>>>>
>>>> But the OP did say 20 dimensions which gives me back 5x.
>>>>
>>>> Inclusion of learning time is a good suspect.  In the other side of the
>>>> ledger, if the multiply is doing any column wise access it is a likely
>>>> performance bug.  The computation is AB'. Perhaps you refer to rows of
>> B
>>>> which are the columns of B'.
>>>>
>>>> Sent from my sleepy thumbs set to typing on my iPhone.
>>>>
>>>> On Mar 6, 2013, at 4:16 AM, Sean Owen <[email protected]> wrote:
>>>>
>>>>> If there are 100 features, it's more like 2.6M * 2.8M * 100 = 728
>>> Tflops
>>>> --
>>>>> I think you're missing an "M", and the features by an order of
>>> magnitude.
>>>>> That's still 1 day on an 8-core machine by this rule of thumb.
>>>>>
>>>>> The 80 hours is the model building time too (right?), not the time to
>>>>> multiply U*M'. This is dominated by iterations when building from
>>>> scratch,
>>>>> and I expect took 75% of that 80 hours. So if the multiply was 20
>> hours
>>>> --
>>>>> on 10 machines -- on Hadoop, then that's still slow but not out of
>> the
>>>>> question for Hadoop, given it's usually a 3-6x slowdown over a
>> parallel
>>>>> in-core implementation.
>>>>>
>>>>> I'm pretty sure what exists in Mahout here can be optimized further
>> at
>>>> the
>>>>> Hadoop level; I don't know that it's doing the multiply badly though.
>>> In
>>>>> fact I'm pretty sure it's caching cols in memory, which is a bit of
>>>>> 'cheating' to speed up by taking a lot of memory.
>>>>>
>>>>>
>>>>> On Wed, Mar 6, 2013 at 3:47 AM, Ted Dunning <[email protected]>
>>>> wrote:
>>>>>
>>>>>> Hmm... each users recommendations seems to be about 2.8 x 20M Flops
>> =
>>>> 60M
>>>>>> Flops.  You should get about a Gflop per core in Java so this should
>>>> about
>>>>>> 60 ms.  You can make this faster with more cores or by using ATLAS.
>>>>>>
>>>>>> Are you expecting 3 million unique people every 80 hours?  If no,
>> then
>>>> it
>>>>>> is probably more efficient to compute the recommendations on the
>> fly.
>>>>>>
>>>>>> How many recommendations per second are you expecting?  If you have
>> 1
>>>>>> million uniques per day (just for grins) and we assume 20,000 s/day
>> to
>>>>>> allow for peak loading, you have to do 50 queries per second peak.
>>>  This
>>>>>> seems to require 3 cores.  Use 16 to be safe.
>>>>>>
>>>>>> Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50
>> hours.
>>>>  I
>>>>>> think that your map-reduce is under performing by about a factor of
>>> 10.
>>>>>> This is quite plausible with bad arrangement of the inner loops.  I
>>>> think
>>>>>> that you would have highest performance computing the
>> recommendations
>>>> for a
>>>>>> few thousand items by a few thousand users at a time.  It might be
>>> just
>>>>>> about as fast to do all items against a few users at a time.  The
>>> reason
>>>>>> for this is that dense matrix multiply requires c n x k + m x k
>> memory
>>>> ops,
>>>>>> but n x k x m arithmetic ops.  If you can re-use data many times,
>> you
>>>> can
>>>>>> balance memory channel bandwidth against CPU speed.  Typically you
>>> need
>>>> 20
>>>>>> or more re-uses to really make this fly.
>>>>>>
>>>>>>
>>>>
>>>
>>
> 

Reply via email to