You may want to switch to the Scala DSL if you are planning more linear algebra. The DSL runs on Spark and so is much faster that the older Hadoop code but is also on top of a linear algebra optimizer. The type if thing you mention below is a few lines that can be run interactively in the Mahout Scala shell or can be put in your own driver just as easily.
http://mahout.apache.org/users/sparkbindings/home.html http://mahout.apache.org/users/sparkbindings/play-with-shell.html On Nov 14, 2014, at 7:41 AM, optimusfan <[email protected]> wrote: Thanks to Yahoo mail for messing up the links in my message above. Let's try this again: http://stackoverflow.com/questions/8654200/hadoop-file-splits-compositeinputformat-inner-join http://mail-archives.apache.org/mod_mbox/mahout-user/201301.mbox/%3c50cfd234cc7d3a4ea1e8910d3866f700095256f...@nda-hclc-evs02.hclc.corp.hcl.in%3E On Friday, November 14, 2014 9:12 AM, optimusfan <[email protected]> wrote: Hi- I'm working on implementing a custom algorithm using the Mahout library. The algorithm requires matrix multiplication, which I saw was available at the object level (.times) as well as being implemented in the MatrixMultiplicationJob. I am currently testing a step in the algorithm that requires me to multiply a 10x2.4m matrix by one that is 2.4mx2.4m. The performance has been awful, taking 11-12 hours to complete. This might be fine if it was the extent of the algorithm, but I will have multiple similarly sized steps, all of which will be repeated in a loop. I dug into this further, looking at the job running on my Hadoop cluster (Google Cloud Compute, 3 nodes @ 16 GB each). I noticed that the job appeared to only be running a single map and thus on a single node, as opposed to previous steps such as TransposeJob that ran multiple mappers and finished in a fraction of the time. Researching it a bit further, I found a handful of concerning posts such as the two below: Hadoop File Splits : CompositeInputFormat : Inner Join Hadoop File Splits : CompositeInputFormat : Inner Join I am using CompositeInputFormat to provide input to a hadoop job. The number of splits generated is the total number of files given as input to CompositeInputFormat... View on stackoverflow.com Preview by Yahoo MatrixMultiplicationJob runs with 1 mapper only ? MatrixMultiplicationJob runs with 1 mapper only ? Hi, I am trying to multiple dense matrix of size [100 x 100k]. The size of the file is 104MB and with default block sizeof 64MB only 2 blocks are getting created. View on mail-archives.apache.org Preview by Yahoo So, my questions are as follows. Is the MatrixMultiplicationJob truly limited to only being able to be run on a single node? If so, it seems fairly useless. And if so, what is the recommended way to do decently sized multiplication such as I require?
