Hi Carl,

I can’t speak to all of the internal mechanics and what the committers factored 
in.  I have no doubt that intelligent decisions were the goal given the context 
of the time.  More where I come from is that at least in our case, we see nodes 
with a fair hunk of file data sitting in buffer cache, and the benefits of that 
have to become throttled when the buffered data won’t be consumable until 
decompression.  Unfortunately with Java you aren’t in the tidy situation where 
you can just claim a page of memory from buffer cache with no real overhead, so 
its plausible that memory copy vs memory traversal for decompaction don’t 
differ terribly.  Definitely something I want to look into.

Most of where I’ve seen I/O stalling relates to flushing of dirty pages from 
the writes of sstable compaction.  What to do depends a lot on the specifics of 
your situation.  TL;DR summary is that short bursts of write ops can choke out 
everything else once the I/O queue is filled.  It doesn’t really pertain to 
mean/median/95-percentile performance.  It starts to show at 99, and definitely 
999.

I don’t know if the interleaving with I/O wait results in some of the 
decompression being effectively free, it’s entirely plausible that this has 
been observed and the current approach improved accordingly. It’s a pretty 
reasonable CPU scheduling behavior unless cores are otherwise being kept busy, 
e.g. with memory copies or pauses to yank things from memory to CPU cache.  Jon 
Haddad recently pointed me to some resources that might explain getting less 
from CPU than the actual CPU numbers suggest, but I haven’t yet really wrapped 
my head around the details enough to decide how I would want to investigate 
reductions in CPU instructions executed.

I do know that we definitely saw from our latency metrics that read times are 
impacted when writes flush in a spike, so we tuned to mitigate it.  It probably 
doesn’t take much to achieve a read stall, as anything that stats a filesystem 
entry (either via cache miss on the dirnodes, or if you haven’t disabled atime) 
might be just as subject to stalling as anything that tries to read content 
from the file itself.

No opinion on 3.11.x handling of column metadata.  I’ve read that it is a great 
deal more complicated and a factor in various performance headaches, but like 
you, I haven’t gotten into the source around that so I don’t have a mental 
model for the details.

From: Carl Mueller <carl.muel...@smartthings.com.INVALID>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Tuesday, December 10, 2019 at 3:19 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: Dynamo autoscaling: does it beat cassandra?

Message from External Sender
Dor and Reid: thanks, that was very helpful.

Is the large amount of compression an artifact of pre-cass3.11 where the column 
names were per-cell (combined with the cluster key for extreme verbosity, I 
think), so compression would at least be effective against those portions of 
the sstable data? IIRC the cass commiters figured as long as you can shrink the 
data, the reduced size drops the time to read off of the disk, maybe even the 
time to get into CPU cache from memory and the CPU to decompress is somewhat 
"free" at that point since everything else is stalled for I/O or memory reads?

But I don't know how the 3.11.x format works to avoid spamming of those column 
names, I haven't torn into that part of the code.

On Tue, Dec 10, 2019 at 10:15 AM Reid Pinchback 
<rpinchb...@tripadvisor.com<mailto:rpinchb...@tripadvisor.com>> wrote:
Note that DynamoDB I/O throughput scaling doesn’t work well with brief spikes.  
Unless you write your own machinery to manage the provisioning, by the time AWS 
scales the I/O bandwidth your incident has long since passed.  It’s not a thing 
to rely on if you have a latency SLA.  It really only works for situations like 
a sustained alteration in load, e.g. if you have a sinusoidal daily traffic 
pattern, or periodic large batch operations that run for an hour or two, and 
you need the I/O adjustment while that takes place.

Also note that DynamoDB routinely chokes on write contention, which C* would 
rarely do.  About the only benefit DynamoDB has over C* is that more of its 
operations function as atomic mutations of an existing row.

One thing to also factor into the comparison is developer effort.  The DynamoDB 
API isn’t exactly tuned to making developers productive.  Most of the AWS APIs 
aren’t, really, once you use them for non-toy projects. AWS scales in many 
dimensions, but total developer effort is not one of them when you are talking 
about high-volume tier one production systems.

To respond to one of the other original points/questions, yes key and row 
caches don’t seem to be a win, but that would vary with your specific usage 
pattern.  Caches need a good enough hit rate to offset the GC impact.  Even 
when C* lets you move things off heap, you’ll see a fair number of GC-able 
artifacts associated with data in caches.  Chunk cache somewhat wins with being 
off-heap, because it isn’t just I/O avoidance with that cache, you’re also 
benefitting from the decompression.  However I’ve started to wonder how often 
sstable compression is worth the performance drag and internal C* complexity.  
If you compare to where a more traditional RDBMS would use compression, e.g. 
Postgres, use of compression is more selective; you only bear the cost in the 
places already determined to win from the tradeoff.

From: Dor Laor <d...@scylladb.com<mailto:d...@scylladb.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Monday, December 9, 2019 at 5:58 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Dynamo autoscaling: does it beat cassandra?

Message from External Sender
The DynamoDB model has several key benefits over Cassandra's.
The most notable one is the tablet concept - data is partitioned into 10GB
chunks. So scaling happens where such a tablet reaches maximum capacity
and it is automatically divided to two. It can happen in parallel across the 
entire
data set, thus there is no concept of growing the amount of nodes or vnodes.
As the actual hardware is multi-tenant, the average server should have plenty
of capacity to receive these streams.

That said, when we benchmarked DynamoDB and just hit it with ingest workload,
even when it was reserved, we had to slow down the pace since we received many
'error 500' which means internal server errors. Their hot partitions do not 
behave great
as well.

So I believe a growth of 10% the capacity with good key distribution can be 
handled well
but a growth of 2x in a short time will fail. It's something you're expect from 
any database
but Dynamo has an advantage with tablets and multitenancy and issues with hot 
partitions
and accounting of hot keys which will get cached in Cassandra better.

Dynamo allows you to detach compute from the storage which is a key benefit in 
a serverless, spiky deployment.

On Mon, Dec 9, 2019 at 1:02 PM Jeff Jirsa 
<jji...@gmail.com<mailto:jji...@gmail.com>> wrote:
Expansion probably much faster in 4.0 with complete sstable streaming (skips 
ser/deser), though that may have diminishing returns with vnodes unless you're 
using LCS.

Dynamo on demand / autoscaling isn't magic - they're overprovisioning to give 
you the burst, then expanding on demand. That overprovisioning comes with a 
cost. Unless you're actively and regularly scaling, you're probably going to 
pay more for it.

It'd be cool if someone focused on this - I think the faster streaming goes a 
long way. The way vnodes work today make it difficult to add more than one at a 
time without violating consistency, and thats unlikely to change, but if each 
individual node is much faster, that may mask it a bit.



On Mon, Dec 9, 2019 at 12:35 PM Carl Mueller 
<carl.muel...@smartthings.com.invalid> wrote:
Dynamo salespeople have been pushing autoscaling abilities that have been one 
of the key temptations to our management to switch off of cassandra.

Has anyone done any numbers on how well dynamo will autoscale demand spikes, 
and how we could architect cassandra to compete with such abilities?

We probably could overprovision and with the presumably higher cost of dynamo 
beat it, although the sales engineers claim they are closing the cost factor 
too. We could vertically scale to some degree, but node expansion seems close.

VNode expansion is still limited to one at a time?

We use VNodes so we can't do netflix's cluster doubling, correct? With cass 
4.0's alleged segregation of the data by token we could though and possibly 
also "prep" the node by having the necessary sstables already present ahead of 
time?

There's always "caching" too, but there isn't a lot of data on general fronting 
of cassandra with caches, and the row cache continues to be mostly useless?

Reply via email to