> If I have 10B rows in my CF, and I can fit 10k rows per > > SStable, and the SStables are spread across 5 nodes, and I have 1 bloom > > filter false positive and 1 tombstone and ask the wrong node for the key, > > then: > > > > Mv = (((2B/10k)+1+1)*3)+1 == ((200,000)+2)*3+1 == 300,007 iops to read a > > key. > > This is a nonsensical arrangement. Assuming each SSTable is the size > of the default Memtable threshold (128MB), then each row is (128MB / > 10k) == 12.8k and 10B rows == 128TB of raw data. A typical RF of 3 > takes us to 384TB. The need for enough space for compactions takes us > to 768TB. That's not 5 nodes, it's more like 100+, and almost 2 > orders of magnitude off your estimate,
You started off so well. You laid out a couple of useful points: (1) for a 10B-row dataset, 12.8KB rows, RF=3, Cassandra cluster requires 768TB. If you have less, you'll run into severe administration problems. This is not obvious, but is critical and extremely useful. (2) 12.8KB rowsize wants a >128MB memtable threshold. Is there a rule of thumb for this? memTableThreshold = rowsize * 100,000? without addressing shortcomings > in the rest of it (which I leave to more capable folks on this list). > Then this. Obviously no-one is more capable on this list or there'd already be a good answer. So stop your whinging and help lay the foundation for a good answer. We forgot to include replication factor in the calculation. If we do assume RF=3, that triples the number of SStables that might have the key, right? That means the new formula is: readIops = ((numRowsOnNode * replicationFactor / rowsPerSStable) + bloomFalsePositives + tombstones ) * 3 It seems bloomFalsePositives and tombstones get lost in the noise. Is that a reasonable assumption? On a volatile dataset, do those grow to significant sizes? I'm going to assume both=0 for now. Let's change a couple assumptions to make a more realistic scenario. 1KB row size, 10B rows, RF=3. That's a 9TB dataset, requiring 27TB*2=54TB of storage. Our intrepid admin chooses to have 20 nodes of 3TB each. numRowsOnNode = 10B / 20 = 500M. replicationFactor = 3. rowsPerSStable = 128MB / 1K = 131k. Therefore worst-case iops per read on this cluster are: (500M * 3 / 131k) * 3 = 150M / 131k = 11,450. Feel free to point out problems with the methodology, but I humbly suggest if you do, you also propose a more-precise formula. -- timeless(ness)