Yep, my approach is definitely naive to hotspotting. If someone had that trouble, they could exhaust the iterator out of getReplicas() and distribute their writes more evenly (which might result in better statement distribution, but wouldn't change the workload on the cluster). In the end they're going to get in trouble with hotspotting regardless of async single statements or batches. The single statement async code prefers the first replica returned, so this logic is consistent with the default model.
> Lots of folks are still stuck on maximum utilization, ironically these same people tend to focus on using spindles for storage and so will ultimately end up having to throttle ingest to allow compaction to catch up Yeah, otherwise known cost sensitivity, with the unfortunate side effect of making it easy to accidentally overwhelm a cluster as a new operator since the warning signs look different than they do for most other data stores. Straying a bit far afield here, but I actually think it would be a nice feature if *by default* Cassandra artificially throttled writes as compaction starts getting behind as an early warning sign (a feature you could turn off with a flag). Cassandra does a great job of absorbing bursty writes, but unfortunately that masks (for the new operator) the warning signs that your sustained write rate is more than the cluster can handle. Writes are still fast so you assume the cluster is healthy, and by the time there's backpressure to the client, you're already possibly past the point of simple recovery (eg you no longer have enough excess IO to support bootstrapping new nodes). That would also actually free up some I/O to keep the cluster from tipping over so hard. On Fri, Sep 25, 2015 at 12:14 PM, Ryan Svihla <r...@foundev.pro> wrote: > > I think my main point is still, unlogged token aware batches are great, > but if you’re writes are large enough, they may actually hurt rather than > help, and likewise if your writes are too small, async only is likely only > going to hurt. I’d say the average user I’ve had to help (with my selection > bias) has individual writes already on the large size of optimal so > batching frequently hurts them. Also they tend not to do async in the first > place. > > In summary, batch or not is IMO the wrong area to focus, total write > payload sizing for your cluster is the factor to focus on and however you > get there is fantastic. more replies inline: > > On Sep 25, 2015, at 1:24 PM, Eric Stevens <migh...@gmail.com> wrote: > > > compaction usually is the limiter for most clusters, so the difference > between async versus unlogged batch ends up being minor or worse..non > existent cause the hardware and data model combination result in compaction > being the main throttle. > > If your number of records to load per second is predetermined (as would be > the case in any production use case), then this doesn't make any difference > on compaction whether loaded as batches vs as single statements, your > cluster needs to support the same number and shape of mutates either way. > > > Not everyone is as grown up about their cluster sizing. Lots of folks are > still stuck on maximum utilization, ironically these same people tend to > focus on using spindles for storage and so will ultimately end up having to > throttle ingest to allow compaction to catch up. Anyway in these admittedly > awful situations throttling of ingest is all too common as the commit log > can basically easily outstrip compaction. > > > > if you add in token awareness to your batch..you’ve basically > eliminated the primary complaint of using unlogged batches so why not do > that. > > This is absolutely the right idea if your driver supports it, but the gain > is smaller than I would have expected based on the warnings > of imminent doom when we've had this conversation before. If your driver > supports token awareness, use that to group statements by primary replica > and concurrently execute those that way. Here's the code we're using (in > Scala using the Java driver): > > def groupByFirstReplica()(implicit session: CQLSession): Map[Host, CQLBatch] > = { > val meta = session.getCluster.getMetadata > statements.groupBy { st => > try { > meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next > } catch { case NonFatal(e) => > null > } > } mapValues { st => CQLBatch(st) } > } > > We now have a map of primary host to sub-batch for all the statements in > our logical batch. We can now do either of these (depending on how greedy > we want to be in our client; Future.traverse is preferred and nicer, > Future.sequence is greedier and more resource intensive): > > Future.sequence(groupByFirstReplica().values.map(_.execute())).map(_.flatten) > Future.traverse(groupByFirstReplica().values) { _.execute() }.map(_.flatten) > > We get back Future[Iterable[ResultSet]] - this future completes when the > logical batch's sub-batches have all completed. > > Note that with the DSE Java driver, for the above to succeed in its > intent, the statements need to be prepared statements (for st.getRoutingKey > to return non-null), and either the keyspace has to be fully defined in the > CQL, or you have to have set the correct keyspace when you created the > connection (for st.getKeyspace to return non-null). Otherwise the values > given to meta.getReplicas will fail to resolve a primary host which results > in doing token-unaware batches (i.e. you'll get back a Map(null -> > allStatements)). However those same criteria are required for single > statements to be token aware. > > > This is excellent stuff, my only concern with primary replicas is for > people with uneven partitions, and the occasionally stupidly fat one. I’d > rather spread those writes around the other replicas instead of beating up > the primary one. However, for a well modeled partition key the approach you > outline is probably optimal. > > > > > On Fri, Sep 25, 2015 at 7:30 AM, Ryan Svihla <r...@foundev.pro> wrote: > >> Generally this is all correct but I cannot emphasize enough how much this >> “just depends” and today I generally move people to async inserts first >> before trying to micro-optimize some things to keep in mind. >> >> >> - compaction usually is the limiter for most clusters, so the >> difference between async versus unlogged batch ends up being minor or >> worse..non existent cause the hardware and data model combination result >> in >> compaction being the main throttle. >> - if you add in token awareness to your batch..you’ve basically >> eliminated the primary complaint of using unlogged batches so why not do >> that. When I was at DataStax I made some similar suggestions for token >> aware batch after seeing the perf improvements with Spark writes using >> unlogged batch. Several others did as well so I’m not the first one with >> this idea. >> - write size makes in my experience the largest difference BY FAR >> about which is faster. and the number is largely irrelevant compared to >> the >> total payload size. Depending on the hardware and etc a good rule of thumb >> is writes below 1k bytes tend to get really inefficient and writes that >> are >> over 100k tend to slow down total throughput. I’ll reemphasize this magic >> number has been different on almost every cluster I’ve tuned. >> >> >> In summary all this means is, too small or too large of writes are slow, >> and unlogged batches may involve some extra hops, if you eliminate the >> extra hops by token awareness then it just comes down to write size >> optimization. >> >> On Sep 24, 2015, at 5:18 PM, Eric Stevens <migh...@gmail.com> wrote: >> >> > I side-tracked some punctual benchmarks and stumbled on the >> observations of unlogged inserts being *A LOT* faster than the async >> counterparts. >> >> My own testing agrees very strongly with this. When this topic came up >> on this list before, there was a concern that batch coordination produces >> GC pressure in your cluster because you're involving nodes which aren't >> *strictly >> speaking* necessary to be involved. >> >> Our own testing shows some small impact on this front, but really >> lightweight GC tuning mitigated the effects by putting a little more room >> in Xmn (if you're still on CMS garbage collector). On G1GC (which is what >> we run in production) we weren't able to measure a difference. >> >> Our testing shows data loads being as much as 5x to 8x faster when using >> small concurrent batches over using single statements concurrently. We >> tried three different concurrency models. >> >> To save on coordinator overhead, we group the statements in our "batch" >> by replica (using the functionality exposed by the DataStax Java driver), >> and do essentially token aware batching. This still has a *small* amount >> of additional coordinator overhead (since the data size of the unit of work >> is larger, and sits in memory in the coordinator longer). We've been >> running this way successfully for months with *sustained* rates north of >> 50,000 mutates per second. We burst *much* higher. >> >> Through trial and error we determined we got diminishing returns in the >> realm of 100 statements per token-aware batch. It looks like your own data >> bears that out as well. I'm sure that's workload dependent though. >> >> I've been disagreed with on this topic in this list in the past despite >> the numbers I was able to post. Nobody has shown me numbers (nor anything >> else concrete) that contradict my position though, so I stand by it. There's >> no question in my mind, if your mutates are of any significant volume and >> you care about the performance of them, token aware unlogged batching is >> the right strategy. When we reduce our batch sizes or switch to single >> async statements, we fall over immediately. >> >> On Tue, Sep 22, 2015 at 7:54 AM, Gerard Maas <gerard.m...@gmail.com> >> wrote: >> >>> General advice advocates for individual async inserts as the fastest way >>> to insert data into Cassandra. Our insertion mechanism is based on that >>> model and recently we have been evaluating performance, looking to measure >>> and optimize our ingestion rate. >>> >>> I side-tracked some punctual benchmarks and stumbled on the observations >>> of unlogged inserts being *A LOT* faster than the async counterparts. >>> >>> In our tests, unlogged batch shows increased throughput and lower >>> cluster CPU usage, so I'm wondering where the tradeoff might be. >>> >>> I compiled those observations in this document that I'm sharing and >>> opening up for comments. Are we observing some artifact or should we set >>> the record straight for unlogged batches to achieve better insertion >>> throughput? >>> >>> >>> https://docs.google.com/document/d/1qSIJ46cmjKggxm1yxboI-KhYJh1gnA6RK-FkfUg6FrI >>> >>> Let me know. >>> >>> Kind regards, >>> >>> Gerard. >>> >> >> >> Regards, >> >> Ryan Svihla >> >> > >