Hi-
I'm working on implementing a custom algorithm using the Mahout library. The
algorithm requires matrix multiplication, which I saw was available at the
object level (.times) as well as being implemented in the
MatrixMultiplicationJob. I am currently testing a step in the algorithm that
requires me to multiply a 10x2.4m matrix by one that is 2.4mx2.4m. The
performance has been awful, taking 11-12 hours to complete. This might be fine
if it was the extent of the algorithm, but I will have multiple similarly sized
steps, all of which will be repeated in a loop.
I dug into this further, looking at the job running on my Hadoop cluster
(Google Cloud Compute, 3 nodes @ 16 GB each). I noticed that the job appeared
to only be running a single map and thus on a single node, as opposed to
previous steps such as TransposeJob that ran multiple mappers and finished in a
fraction of the time. Researching it a bit further, I found a handful of
concerning posts such as the two below:
Hadoop File Splits : CompositeInputFormat : Inner Join
Hadoop File Splits : CompositeInputFormat : Inner Join
I am using CompositeInputFormat to provide input to a hadoop job. The number of
splits generated is the total number of files given as input to
CompositeInputFormat...
View on stackoverflow.com Preview by Yahoo
MatrixMultiplicationJob runs with 1 mapper only ?
MatrixMultiplicationJob runs with 1 mapper only ?
Hi, I am trying to multiple dense matrix of size [100 x 100k]. The size of the
file is 104MB and
with default block sizeof 64MB only 2 blocks are getting created.
View on mail-archives.apache.org Preview by Yahoo
So, my questions are as follows. Is the MatrixMultiplicationJob truly limited
to only being able to be run on a single node? If so, it seems fairly useless.
And if so, what is the recommended way to do decently sized multiplication
such as I require?