I may be wrong about which nodes the task is sent to. Others here know more about hadoop integration.
Aaron On 22 Oct 2010, at 21:30, Takayuki Tsunakawa <tsunakawa.ta...@jp.fujitsu.com> wrote: > Hello, Aaron, > > Thank you for much info (especially pointers that seem interesting). > > > So you would not have 1,000 tasks sent to each of the 1,000 cassandra nodes. > > Yes, I meant one map task would be sent to each task tracker, resulting in > 1,000 concurrent map tasks in the cluster. ColumnFamilyInputFormat cannot > identify the nodes that actually hold some data, so the job tracker will send > the map tasks to all of the 1,000 nodes. This is wasteful and time-consuming > if only 200 nodes hold some data for a keyspace. > > > When the task runs on the cassandra node it will iterate through all of the > > rows in the specified ColumnFamily with keys in the Token range the Node is > > responsible for. > > I hope the ColumnFamilyInputFormat will allow us to set KeyRange to select > rows passed to map. > > I'll read the web pages you gave me. Thank you. > All, any other advice and comment is appreciated. > > Regards, > Takayuki Tsunakawa > > ----- Original Message ----- > From: aaron morton > To: user@cassandra.apache.org > Sent: Friday, October 22, 2010 4:05 PM > Subject: Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes > of data > > > For plain old log analysis the Cloudera Hadoop distribution may be a better > match. Flume is designed to help with streaming data into HDFS, the LZo > compression extensions would help with the data size and PIG would make the > analysis easier (IMHO). > http://www.cloudera.com/hadoop/ > http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/ > http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ > > > I'll try to answer your questions, others please jump in if I'm wrong. > > > 1. Data in a keyspace will be distributed to all nodes in the cassandra > cluster. AFAIK the Job Tracker should only send one task to each task > tracker, and normally you would have a task tracker running on each cassandra > node. The task tracker can then throttle how may concurrent tasks can run. So > you would not have 1,000 tasks sent to each of the 1,000 cassandra nodes. > > > When the task runs on the cassandra node it will iterate through all of the > rows in the specified ColumnFamily with keys in the Token range the Node is > responsible for. If cassandra is using the RandomPartitioner, data will be > spear around the cluster. So, for example, a Map-Reduce job that only wants > to read the last weeks data may have to read from every node. Obviously this > depends on how the data is broken up between rows / columns. > > > > > 2. Some of the other people from riptano.com or rackspace may be able to help > with Cassandra's outer limits. There is a 400 node cluster planned > http://www.riptano.com/blog/riptano-and-digital-reasoning-form-partnership > > > Hope that helps. > Aaron