For plain old log analysis the Cloudera Hadoop distribution may be a better 
match. Flume is designed to help with streaming data into HDFS, the LZo 
compression extensions would help with the data size and PIG would make the 
analysis easier (IMHO). 
http://www.cloudera.com/hadoop/
http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/
http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

I'll try to answer your questions, others please jump in if I'm wrong.

1. Data in a keyspace will be distributed to all nodes in the cassandra 
cluster. AFAIK the Job Tracker should only send one task to each task tracker, 
and normally you would have a task tracker running on each cassandra node. The 
task tracker can then throttle how may concurrent tasks can run. So you would 
not have 1,000 tasks sent to each of the 1,000 cassandra nodes. 

When the task runs on the cassandra node it will iterate through all of the 
rows in the specified ColumnFamily with keys in the Token range the Node is 
responsible for. If cassandra is using the RandomPartitioner, data will be 
spear around the cluster. So, for example, a Map-Reduce job that only wants to 
read the last weeks data may have to read from every node. Obviously this 
depends on how the data is broken up between rows / columns. 


2. Some of the other people from riptano.com or rackspace may be able to help 
with Cassandra's outer limits. There is a 400 node cluster planned 
http://www.riptano.com/blog/riptano-and-digital-reasoning-form-partnership

Hope that helps. 
Aaron

On 22 Oct 2010, at 15:45, Takayuki Tsunakawa wrote:

> Hello,
> 
> I'm evaluating whether Cassandra fits a certain customer well. The
> customer will collect petabytes of logs and analyze them. Could you
> tell me if my understanding is correct and/or give me your opinions?
> I'm sorry that the analysis requirement is not clear yet.
> 
> 1. MapReduce behavior
> I read the source code of Cassandra 0.6.x and understood that
> jobtracker submits the map tasks to all Cassandra nodes, regardless of
> whether the target keyspace's data reside there. That is, if there are
> 1,000 nodes in the Cassandra cluster, jobtracker sends more than 1,000
> map tasks to all of the 1,000 nodes in parallel. If this is correct,
> I'm afraid the startup time of a MapReduce job gets longer as more
> nodes join the Cassandra cluster.
> Is this correct?
> With HBase, jobtracker submits map tasks only to the region servers
> that hold the target data. This behavior is desirable because no
> wasteful task submission is done. Can you suggest the cases where
> Cassandra+MapReduce is better than HBase+MapReduce for log/sensor
> analysis? (Please excuse me for my not presenting the analysis
> requirement).
> 
> 2. Data capacity
> The excerpt from the paper about Amazon Dynamo says that the cluster
> can scale to hundreds of nodes, not thousands. I understand Cassandra
> is similar. Assuming that the recent commodity servers have 2 to 4 TB
> of disks, we need about 1,000 nodes or more to store petabytes of
> data.
> Is the present Cassandra suitable for petabytes of data? If not, is
> any development in progress to increase the scalability?
> 
> 
> --------------------------------------------------
> Finally, Dynamo adopts a full membership model where each node is
> aware of the data hosted by its peers. To do this, each node actively
> gossips the full routing table with other nodes in the system. This
> model works well for a system that contains couple of hundreds of
> nodes. However, scaling such a design to run with tens of thousands of
> nodes is not trivial because the overhead in maintaining the routing
> table increases with the system size. This limitation might be
> overcome by introducing hierarchical extensions to Dynamo. Also, note
> that this problem is actively addressed by O(1) DHT systems(e.g.,
> [14]).
> --------------------------------------------------
> 
> Regards,
> Takayuki Tsunakawa
> 
> 

Reply via email to