Re: Worst case #iops to read a row

Time Less Tue, 13 Apr 2010 10:48:54 -0700

> If I have 10B rows in my CF, and I can fit 10k rows per
> > SStable, and the SStables are spread across 5 nodes, and I have 1 bloom
> > filter false positive and 1 tombstone and ask the wrong node for the key,
> > then:
> >
> > Mv = (((2B/10k)+1+1)*3)+1 == ((200,000)+2)*3+1 == 300,007 iops to read a
> > key.
>
> This is a nonsensical arrangement.  Assuming each SSTable is the size
> of the default Memtable threshold (128MB), then each row is (128MB /
> 10k) == 12.8k and 10B rows == 128TB of raw data.  A typical RF of 3
> takes us to 384TB.  The need for enough space for compactions takes us
> to 768TB.  That's not 5 nodes, it's more like 100+, and almost 2
> orders of magnitude off your estimate,



You started off so well. You laid out a couple of useful points:

(1) for a 10B-row dataset, 12.8KB rows, RF=3, Cassandra cluster requires
768TB. If you have less, you'll run into severe administration problems.
This is not obvious, but is critical and extremely useful.

(2) 12.8KB rowsize wants a >128MB memtable threshold. Is there a rule of
thumb for this? memTableThreshold = rowsize * 100,000?

without addressing shortcomings
> in the rest of it (which I leave to more capable folks on this list).
>

Then this.

Obviously no-one is more capable on this list or there'd already be a good
answer. So stop your whinging and help lay the foundation for a good answer.
We forgot to include replication factor in the calculation. If we do assume
RF=3, that triples the number of SStables that might have the key, right?
That means the new formula is:

readIops = ((numRowsOnNode * replicationFactor / rowsPerSStable)
   + bloomFalsePositives
   + tombstones ) * 3

It seems bloomFalsePositives and tombstones get lost in the noise. Is that a
reasonable assumption? On a volatile dataset, do those grow to significant
sizes? I'm going to assume both=0 for now.

Let's change a couple assumptions to make a more realistic scenario. 1KB row
size, 10B rows, RF=3. That's a 9TB dataset, requiring 27TB*2=54TB of
storage. Our intrepid admin chooses to have 20 nodes of 3TB each.

numRowsOnNode = 10B / 20 = 500M.
replicationFactor = 3.
rowsPerSStable = 128MB / 1K = 131k.

Therefore worst-case iops per read on this cluster are:
(500M * 3 / 131k) * 3 = 150M / 131k = 11,450.

Feel free to point out problems with the methodology, but I humbly suggest
if you do, you also propose a more-precise formula.

-- 
timeless(ness)

Re: Worst case #iops to read a row

Reply via email to