On Mon, Oct 25, 2010 at 12:37 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > On Sun, Oct 24, 2010 at 9:09 PM, Takayuki Tsunakawa > <tsunakawa.ta...@jp.fujitsu.com> wrote: >> From: "Jonathan Ellis" <jbel...@gmail.com> >>> (b) Cassandra generates input splits from the sampling of keys each >>> node has in memory. So if a node does end up with no data for a >>> keyspace (because of bad OOP balancing for instance) it will have no >>> splits generated or mapped. >> >> I understood you are referring to StorageService.getSplits(). This >> seems to filter out the Cassandra nodes which have no data for the >> target (keyspace, column family) pair. > > Right. > >> [Q1] >> I understood that ColumnFamilyInputFormat requests the above node (or >> split) filtering to all nodes in the cluster. Is this correct? > > Yes. > >> [Q2] >> If Q1 is yes, more nodes result in higher cost of MapReduce job >> startup (for executing InputFormat.getSplits()). > > Not really: > 1) Each node has a sample of its keys in memory all the time for the > SSTable row indexes. So getSplits() is a purely in-JVM-memory > operation. > 2) Each node computes its own splits. Adding more nodes does not change > this. > >> [Q3-1] >> How much data is aimed at by the 400 node cluster Riptano is planning? >> If each node has 4 TB of disks and the replication factor is 3, the >> simple calculation shows 4 TB * 400 / 3 = 533 TB (ignoring commit log, >> OS areas, etc). > > We do not yet have permission to talk about details of this cluster, sorry. > >> [Q3-2] >> Based on the current architecture, how many nodes is the limit and how >> much (approximate) data is the practical limit? > > There is no reason Cassandra cannot scale to 1000s or more nodes with > the current architecture. > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com >
[Q3] There are some challenges with very large disk nodes. Caveats: I will use words like "long", "slow", and "large" relatively. If you have great equipment IE. 10G Ethernet between nodes it will not take "long" to transfer data. If you have an insane disk pack it may not take "long" to compact 200GB of data. I am basing these statements on server class hardware. ~32 GB ram ~2x processor, ~6 disk SAS RAID. Index sampling on start up. If you have very small rows your indexes become large. These have to be sampled on start up and sampling our indexes for 300Gb of data can take 5 minutes. This is going to be optimized soon. Joining nodes: When you go with larger systems joining a new node involves a lot of transfer, and can take a "long" time. Node join process is going to be optimized in 0.7 and 0.8 (quite drastic changes in 0.7) Major compaction and very large normal compaction can take a "long" time. For example while doing a 200 GB compaction that takes 30 minutes, other sstables build up, more sstables mean "slower" reads. Achieving a high RAM/DISK ratio may be easier with smaller nodes vs one big node with 128 GB RAM $$$. As Jonathan pointed out nothing technically is stopping larger disk nodes. (Just wanted to note some of this as I am in the middle of a process of joining a node now :)