Worked like a charm! I have installed hadoop on my Cassandra nodes and ran the MR using Hadoop job tracker.
A simple key count improved from ~2 hours to about 25 minutes (150M keys and ~100G on each node) Thanks Jeremy. -----Original Message----- From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] Sent: Monday, March 14, 2011 8:42 PM To: user@cassandra.apache.org Subject: Re: Map-Reduce on top of cassandra Just for the sake of updating this thread - Orr didn't yet have task trackers on the Cassandra nodes so most of the time was likely due to copying the ~100G of data to the hadoop cluster prior to processing. They're going to try after installing task trackers on the nodes. On Mar 14, 2011, at 10:06 AM, Or Yanay wrote: > Hi All, > > I am trying to write some map-reduce tasks so I can find out stuff like - how > many records have X status? > I am using 0.7.0 and have 5 nodes with ~100G of data on each node. > > I have written the code based on the word_count example and the map-reduce is > running successfully BUT is extremely slow (about 2 hours for the simplest > key count). > > I am now looking to track down the slowness and tune my process, or explore > alternative ways to achieve the same goal. > > Can anyone point me to a way to tune my map-reduce job? > Does anyone have any experience exploring Cassandra data with Hadoop cluster > configuration? ( As suggested > inhttp://wiki.apache.org/cassandra/HadoopSupport#ClusterConfig) > > Thanks, > Orr >