[Q] MapReduce behavior and Cassandra's scalability for petabytes of data

Takayuki Tsunakawa Thu, 21 Oct 2010 19:36:59 -0700

Hello,

I'm evaluating whether Cassandra fits a certain customer well. The
customer will collect petabytes of logs and analyze them. Could you
tell me if my understanding is correct and/or give me your opinions?
I'm sorry that the analysis requirement is not clear yet.


1. MapReduce behavior
I read the source code of Cassandra 0.6.x and understood that
jobtracker submits the map tasks to all Cassandra nodes, regardless of
whether the target keyspace's data reside there. That is, if there are
1,000 nodes in the Cassandra cluster, jobtracker sends more than 1,000
map tasks to all of the 1,000 nodes in parallel. If this is correct,
I'm afraid the startup time of a MapReduce job gets longer as more
nodes join the Cassandra cluster.
Is this correct?
With HBase, jobtracker submits map tasks only to the region servers
that hold the target data. This behavior is desirable because no
wasteful task submission is done. Can you suggest the cases where
Cassandra+MapReduce is better than HBase+MapReduce for log/sensor
analysis? (Please excuse me for my not presenting the analysis
requirement).

2. Data capacity
The excerpt from the paper about Amazon Dynamo says that the cluster
can scale to hundreds of nodes, not thousands. I understand Cassandra
is similar. Assuming that the recent commodity servers have 2 to 4 TB
of disks, we need about 1,000 nodes or more to store petabytes of
data.
Is the present Cassandra suitable for petabytes of data? If not, is
any development in progress to increase the scalability?


--------------------------------------------------
Finally, Dynamo adopts a full membership model where each node is
aware of the data hosted by its peers. To do this, each node actively
gossips the full routing table with other nodes in the system. This
model works well for a system that contains couple of hundreds of
nodes. However, scaling such a design to run with tens of thousands of
nodes is not trivial because the overhead in maintaining the routing
table increases with the system size. This limitation might be
overcome by introducing hierarchical extensions to Dynamo. Also, note
that this problem is actively addressed by O(1) DHT systems(e.g.,
[14]).
--------------------------------------------------

Regards,
Takayuki Tsunakawa

[Q] MapReduce behavior and Cassandra's scalability for petabytes of data

Reply via email to