Hello, I'm evaluating whether Cassandra fits a certain customer well. The customer will collect petabytes of logs and analyze them. Could you tell me if my understanding is correct and/or give me your opinions? I'm sorry that the analysis requirement is not clear yet.
1. MapReduce behavior I read the source code of Cassandra 0.6.x and understood that jobtracker submits the map tasks to all Cassandra nodes, regardless of whether the target keyspace's data reside there. That is, if there are 1,000 nodes in the Cassandra cluster, jobtracker sends more than 1,000 map tasks to all of the 1,000 nodes in parallel. If this is correct, I'm afraid the startup time of a MapReduce job gets longer as more nodes join the Cassandra cluster. Is this correct? With HBase, jobtracker submits map tasks only to the region servers that hold the target data. This behavior is desirable because no wasteful task submission is done. Can you suggest the cases where Cassandra+MapReduce is better than HBase+MapReduce for log/sensor analysis? (Please excuse me for my not presenting the analysis requirement). 2. Data capacity The excerpt from the paper about Amazon Dynamo says that the cluster can scale to hundreds of nodes, not thousands. I understand Cassandra is similar. Assuming that the recent commodity servers have 2 to 4 TB of disks, we need about 1,000 nodes or more to store petabytes of data. Is the present Cassandra suitable for petabytes of data? If not, is any development in progress to increase the scalability? -------------------------------------------------- Finally, Dynamo adopts a full membership model where each node is aware of the data hosted by its peers. To do this, each node actively gossips the full routing table with other nodes in the system. This model works well for a system that contains couple of hundreds of nodes. However, scaling such a design to run with tens of thousands of nodes is not trivial because the overhead in maintaining the routing table increases with the system size. This limitation might be overcome by introducing hierarchical extensions to Dynamo. Also, note that this problem is actively addressed by O(1) DHT systems(e.g., [14]). -------------------------------------------------- Regards, Takayuki Tsunakawa