Jeff, Yeah, the mapper sitting on a dfs block is pretty cool. Also, yes, we are about to start crunching on a lot of energy smart grid data. TVA is sorta like "Switzerland" for smart grid power generation and transmission data across the nation. Right now we have about 12TB, and this is slated to be around 30TB by the end of next 2010 (possibly more, depending on how many more PMUs come online). I am very interested in Mahout and have read up on it, it has many algorithms that I am familiar with from grad school. I will be doing some very simple MR jobs early on like finding the average frequency for a range of data, and I've been selling various groups internally on what CAN be done with good data mining and tools like Hadoop/Mahout. Our production cluster wont be online for a few more weeks, but that part is already rolling so I've moved on to focus on designing the first jobs to find quality "results/benefits" that I can "sell" in order to campaign for more ambitious projects I have drawn up. I know time series data lends itself to many machine learning applications, so, yes, I would be very interested in talking to anyone who wants to talk or share notes on hadoop and machine learning. I believe Mahout can be a tremendous resource for us and definitely plan on running and contributing to it.
Josh Patterson TVA -----Original Message----- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Wednesday, March 18, 2009 12:02 PM To: core-user@hadoop.apache.org Subject: Re: RecordReader design heuristic Hi Josh, It seemed like you had a conceptual wire crossed and I'm glad to help out. The neat thing about Hadoop mappers is - since they are given a replicated HDFS block to munch on - the job scheduler has <replication factor> number of node choices where it can run each mapper. This means mappers are always reading from local storage. On another note, I notice you are processing what looks to be large quantities of vector data. If you have any interest in clustering this data you might want to look at the Mahout project (http://lucene.apache.org/mahout/). We have a number of Hadoop-ready clustering algorithms, including a new non-parametric Dirichlet Process Clustering implementation that I committed recently. We are pulling it all together for a 0.1 release and I would be very interested in helping you to apply these algorithms if you have an interest. Jeff Patterson, Josh wrote: > Jeff, > ok, that makes more sense, I was under the mis-impression that it was creating and destroying mappers for each input record. I dont know why I had that in my head. My design suddenly became a lot clearer, and this provides a much more clean abstraction. Thanks for your help! > > Josh Patterson > TVA > >