Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

Aaron Morton Fri, 22 Oct 2010 02:38:01 -0700

I may be wrong about which nodes the task is sent to.  

Others here know more about hadoop integration.


Aaron
  

On 22 Oct 2010, at 21:30, Takayuki Tsunakawa <tsunakawa.ta...@jp.fujitsu.com> 
wrote:

> Hello, Aaron,
>  
> Thank you for much info (especially pointers that seem interesting).
>  
> > So you would not have 1,000 tasks sent to each of the 1,000 cassandra nodes.
>  
> Yes, I meant one map task would be sent to each task tracker, resulting in 
> 1,000 concurrent map tasks in the cluster. ColumnFamilyInputFormat cannot 
> identify the nodes that actually hold some data, so the job tracker will send 
> the map tasks to all of the 1,000 nodes. This is wasteful and time-consuming 
> if only 200 nodes hold some data for a keyspace.
>  
> > When the task runs on the cassandra node it will iterate through all of the 
> > rows in the specified ColumnFamily with keys in the Token range the Node is 
> > responsible for.
>  
> I hope the ColumnFamilyInputFormat will allow us to set KeyRange to select 
> rows passed to map.
>  
> I'll read the web pages you gave me. Thank you.
> All, any other advice and comment is appreciated.
>  
> Regards,
> Takayuki Tsunakawa
>  
> ----- Original Message ----- 
> From: aaron morton 
> To: user@cassandra.apache.org 
> Sent: Friday, October 22, 2010 4:05 PM
> Subject: Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes 
> of data
>  
> 
> For plain old log analysis the Cloudera Hadoop distribution may be a better 
> match. Flume is designed to help with streaming data into HDFS, the LZo 
> compression extensions would help with the data size and PIG would make the 
> analysis easier (IMHO). 
> http://www.cloudera.com/hadoop/
> http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/
> http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
>  
> 
> I'll try to answer your questions, others please jump in if I'm wrong.
>  
> 
> 1. Data in a keyspace will be distributed to all nodes in the cassandra 
> cluster. AFAIK the Job Tracker should only send one task to each task 
> tracker, and normally you would have a task tracker running on each cassandra 
> node. The task tracker can then throttle how may concurrent tasks can run. So 
> you would not have 1,000 tasks sent to each of the 1,000 cassandra nodes.
>  
> 
> When the task runs on the cassandra node it will iterate through all of the 
> rows in the specified ColumnFamily with keys in the Token range the Node is 
> responsible for. If cassandra is using the RandomPartitioner, data will be 
> spear around the cluster. So, for example, a Map-Reduce job that only wants 
> to read the last weeks data may have to read from every node. Obviously this 
> depends on how the data is broken up between rows / columns.
>  
>  
>  
> 
> 2. Some of the other people from riptano.com or rackspace may be able to help 
> with Cassandra's outer limits. There is a 400 node cluster planned 
> http://www.riptano.com/blog/riptano-and-digital-reasoning-form-partnership
>  
> 
> Hope that helps. 
> Aaron

Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

Reply via email to