Due to a cut and paste error those flamegraphs were a recording of the whole system, not just Cassandra. Throughput is approximately 30k rows/sec.
Here's the graphs with just the Cassandra PID: - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars2.svg - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars2.svg - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars2.svg And here's graphs during a cqlsh COPY FROM to the same table, using real data, MAXBATCHSIZE=2. Throughput is good at approximately 110k rows/sec. - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg -- Eric On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <eric...@gmail.com> wrote: > Totally understood :) > > I forgot to mention - I set the /proc/irq/*/smp_affinity mask to include > all of the CPUs. Actually most of them were set that way already (for > example, 0000ffff,ffffffff) - it might be because irqbalanced is > running. But for some reason the interrupts are all being handled on CPU 0 > anyway. > > I see this in /var/log/dmesg on the machines: > >> >> Your BIOS has requested that x2apic be disabled. >> This will leave your machine vulnerable to irq-injection attacks. >> Use 'intremap=no_x2apic_optout' to override BIOS request. >> Enabled IRQ remapping in xapic mode >> x2apic not enabled, IRQ remapping is in xapic mode > > > In a reply to one of the comments, he says: > > > When IO-APIC configured to spread interrupts among all cores, it can >> handle up to eight cores. If you have more than eight cores, kernel will >> not configure IO-APIC to spread interrupts. Thus the trick I described in >> the article will not work. >> Otherwise it may be caused by buggy BIOS or even buggy hardware. > > > I'm not sure if either of them is relevant to my situation. > > > Thanks! > > > > > > -- Eric > > On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <j...@jonhaddad.com> > wrote: > >> You shouldn't need a kernel recompile. Check out the section "Simple >> solution for the problem" in http://www.alexonlinux.com/ >> smp-affinity-and-proper-interrupt-handling-in-linux. You can balance >> your requests across up to 8 CPUs. >> >> I'll check out the flame graphs in a little bit - in the middle of >> something and my brain doesn't multitask well :) >> >> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <eric...@gmail.com> wrote: >> >>> Hi Jonathan - >>> >>> It looks like these machines are configured to use CPU 0 for all I/O >>> interrupts. I don't think I'm going to get the OK to compile a new kernel >>> for them to balance the interrupts across CPUs, but to mitigate the problem >>> I taskset the Cassandra process to run on all CPU except 0. It didn't >>> change the performance though. Let me know if you think it's crucial that >>> we balance the interrupts across CPUs and I can try to lobby for a new >>> kernel. >>> >>> Here are flamegraphs from each node from a cassandra-stress ingest into >>> a table representative of the what we are going to be using. This table >>> is also roughly 200 bytes, with 64 columns and the primary key (date, >>> sequence_number). Cassandra-stress was run on 3 separate client >>> machines. Using cassandra-stress to write to this table I see the same >>> thing: neither disk, CPU or network is fully utilized. >>> >>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/ >>> 05/flamegraph_ultva01_sars.svg >>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/ >>> 05/flamegraph_ultva02_sars.svg >>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/ >>> 05/flamegraph_ultva03_sars.svg >>> >>> Re: GC: In the stress run with the parameters above, two of the three >>> nodes log zero or one GCInspectors. On the other hand, the 3rd machine >>> logs a GCInspector every 5 seconds or so, 300-500ms each time. I found >>> out that the 3rd machine actually has different specs as the other two. >>> It's an older box with the same RAM but less CPUs (32 instead of 48), a >>> slower SSD and slower memory. The Cassandra configuration is exactly the >>> same. I tried running Cassandra with only 32 CPUs on the newer boxes to >>> see if that would cause them to GC pause more, but it didn't. >>> >>> On a separate topic - for this cassandra-stress run I reduced the batch >>> size to 2 in order to keep the logs clean. That also reduced the >>> throughput from around 100k rows/second to 32k rows/sec. I've been doing >>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a custom >>> C++ application. In most of the tests that I've been doing I've been using >>> a batch size of around 20 (unlogged, all batch rows have the same partition >>> key). However, it fills the logs with batch size warnings. I was going to >>> raise the batch warning size but the docs scared me away from doing that. >>> Given that we're using unlogged/same partition batches is it safe to raise >>> the batch size warning limit? Actually cqlsh COPY FROM has very good >>> throughput using a small batch size, but I can't get that same throughput >>> in cassandra-stress or my C++ app with a batch size of 2. >>> >>> Thanks! >>> >>> >>> >>> -- Eric >>> >>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <j...@jonhaddad.com> >>> wrote: >>> >>>> How many CPUs are you using for interrupts? >>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrup >>>> t-handling-in-linux >>>> >>>> Have you tried making a flame graph to see where Cassandra is spending >>>> its time? http://www.brendangregg.com/blog/2014-06-12/java-flame >>>> -graphs.html >>>> >>>> Are you tracking GC pauses? >>>> >>>> Jon >>>> >>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <eric...@gmail.com> >>>> wrote: >>>> >>>>> Hi all: >>>>> >>>>> I'm new to Cassandra and I'm doing some performance testing. One of >>>>> things that I'm testing is ingestion throughput. My server setup is: >>>>> >>>>> - 3 node cluster >>>>> - SSD data (both commit log and sstables are on the same disk) >>>>> - 64 GB RAM per server >>>>> - 48 cores per server >>>>> - Cassandra 3.0.11 >>>>> - 48 Gb heap using G1GC >>>>> - 1 Gbps NICs >>>>> >>>>> Since I'm using SSD I've tried tuning the following (one at a time) >>>>> but none seemed to make a lot of difference: >>>>> >>>>> - concurrent_writes=384 >>>>> - memtable_flush_writers=8 >>>>> - concurrent_compactors=8 >>>>> >>>>> I am currently doing ingestion tests sending data from 3 clients on >>>>> the same subnet. I am using cassandra-stress to do some ingestion >>>>> testing. The tests are using CL=ONE and RF=2. >>>>> >>>>> Using cassandra-stress (3.10) I am able to saturate the disk using a >>>>> large enough column size and the standard five column cassandra-stress >>>>> schema. For example, -col size=fixed(400) will saturate the disk and >>>>> compactions will start falling behind. >>>>> >>>>> One of our main tables has a row size that approximately 200 bytes, >>>>> across 64 columns. When ingesting this table I don't see any resource >>>>> saturation. Disk utilization is around 10-15% per iostat. Incoming >>>>> network traffic on the servers is around 100-300 Mbps. CPU utilization is >>>>> around 20-70%. nodetool tpstats shows mostly zeros with occasional >>>>> spikes around 500 in MutationStage. >>>>> >>>>> The stress run does 10,000,000 inserts per client, each with a >>>>> separate range of partition IDs. The run with 200 byte rows takes about 4 >>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time >>>>> 173 >>>>> ms. >>>>> >>>>> The overall performance is good - around 120k rows/sec ingested. But >>>>> I'm curious to know where the bottleneck is. There's no resource >>>>> saturation and nodetool tpstats shows only occasional brief >>>>> queueing. Is the rest just expected latency inside of Cassandra? >>>>> >>>>> Thanks, >>>>> >>>>> -- Eric >>>>> >>>> >>> >