Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

Takayuki Tsunakawa Fri, 22 Oct 2010 01:21:08 -0700

Hello, Aaron,

Thank you for much info (especially pointers that seem interesting).


> So you would not have 1,000 tasks sent to each of the 1,000 cassandra
nodes.

Yes, I meant one map task would be sent to each task tracker, resulting in
1,000 concurrent map tasks in the cluster. ColumnFamilyInputFormat cannot
identify the nodes that actually hold some data, so the job tracker will
send the map tasks to all of the 1,000 nodes. This is wasteful and
time-consuming if only 200 nodes hold some data for a keyspace.

> When the task runs on the cassandra node it will iterate through all of
the rows in the specified ColumnFamily with keys in the Token range the Node
is responsible for.

I hope the ColumnFamilyInputFormat will allow us to set KeyRange to select
rows passed to map.

I'll read the web pages you gave me. Thank you.
All, any other advice and comment is appreciated.

Regards,
Takayuki Tsunakawa

----- Original Message ----- 
From: aaron morton
To: user@cassandra.apache.org
Sent: Friday, October 22, 2010 4:05 PM
Subject: Re: [Q] MapReduce behavior and Cassandra's scalability for
petabytes of data


For plain old log analysis the Cloudera Hadoop distribution may be a better
match. Flume is designed to help with streaming data into HDFS, the LZo
compression extensions would help with the data size and PIG would make the
analysis easier (IMHO).
http://www.cloudera.com/hadoop/
http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/
http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/


I'll try to answer your questions, others please jump in if I'm wrong.


1. Data in a keyspace will be distributed to all nodes in the cassandra
cluster. AFAIK the Job Tracker should only send one task to each task
tracker, and normally you would have a task tracker running on each
cassandra node. The task tracker can then throttle how may concurrent tasks
can run. So you would not have 1,000 tasks sent to each of the 1,000
cassandra nodes.


When the task runs on the cassandra node it will iterate through all of the
rows in the specified ColumnFamily with keys in the Token range the Node is
responsible for. If cassandra is using the RandomPartitioner, data will be
spear around the cluster. So, for example, a Map-Reduce job that only wants
to read the last weeks data may have to read from every node. Obviously this
depends on how the data is broken up between rows / columns.




2. Some of the other people from riptano.com or rackspace may be able to
help with Cassandra's outer limits. There is a 400 node cluster planned
http://www.riptano.com/blog/riptano-and-digital-reasoning-form-partnership


Hope that helps.
Aaron

Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

Reply via email to