Shoot - I didn't see that one. I subscribe to the digest but was focusing on the direct replies and accidentally missed Patrick and Jeff Jirsa's messages. Sorry about that...
I've been using a combination of cassandra-stress, cqlsh COPY FROM and a custom C++ application for my ingestion testing. My default setting for my custom client application is 96 threads, and then by default I run one client application process on each of 3 machines. I tried doubling/quadrupling the number of client threads (and doubling/tripling the number of client processes but keeping the threads per process the same) but didn't see any change. If I recall correctly I started getting timeouts after I went much beyond concurrent_writes which is 384 (for a 48 CPU box) - meaning at 500 threads per client machine I started seeing timeouts. I'll try again to be sure. For the purposes of this conversation I will try to always use cassandra-stress to keep the number of unknowns limited. I'll will run more cassandra-stress clients tomorrow in line with Patrick's 3-5 per server recommendation. Thanks! -- Eric On Wed, Jun 14, 2017 at 12:40 AM, Jonathan Haddad <j...@jonhaddad.com> wrote: > Did you try adding more client stress nodes as Patrick recommended? > > On Tue, Jun 13, 2017 at 9:31 PM Eric Pederson <eric...@gmail.com> wrote: > >> Scratch that theory - the flamegraphs show that GC is only 3-4% of two >> newer machine's overall processing, compared to 18% on the slow machine. >> >> I took that machine out of the cluster completely and recreated the >> keyspaces. The ingest tests now run slightly faster (!). I would have >> expected a linear slowdown since the load is fairly balanced across >> partitions. GC appears to be the bottleneck in the 3-server >> configuration. But still in the two-server configuration the >> CPU/disk/network is still not being fully utilized (the closest is CPU at >> ~45% on one ingest test). nodetool tpstats shows only blips of >> queueing. >> >> >> >> >> -- Eric >> >> On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson <eric...@gmail.com> wrote: >> >>> Hi all - I wanted to follow up on this. I'm happy with the throughput >>> we're getting but I'm still curious about the bottleneck. >>> >>> The big thing that sticks out is one of the nodes is logging frequent >>> GCInspector messages: 350-500ms every 3-6 seconds. All three nodes in >>> the cluster have identical Cassandra configuration, but the node that is >>> logging frequent GCs is an older machine with slower CPU and SSD. This >>> node logs frequent GCInspectors both under load and when compacting but >>> otherwise unloaded. >>> >>> My theory is that the other two nodes have similar GC frequency (because >>> they are seeing the same basic load), but because they are faster machines, >>> they don't spend as much time per GC and don't cross the GCInspector >>> threshold. Does that sound plausible? nodetool tpstats doesn't show >>> any queueing in the system. >>> >>> Here's flamegraphs from the system when running a cqlsh COPY FROM: >>> >>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>> /flamegraph_ultva01_cars_batch2.svg >>> >>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg> >>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>> /flamegraph_ultva02_cars_batch2.svg >>> >>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg> >>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>> /flamegraph_ultva03_cars_batch2.svg >>> >>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg> >>> >>> The slow node (ultva03) spends disproportional time in GC. >>> >>> Thanks, >>> >>> >>> -- Eric >>> >>> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <eric...@gmail.com> >>> wrote: >>> >>>> Due to a cut and paste error those flamegraphs were a recording of the >>>> whole system, not just Cassandra. Throughput is approximately 30k >>>> rows/sec. >>>> >>>> Here's the graphs with just the Cassandra PID: >>>> >>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>> /flamegraph_ultva01_sars2.svg >>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>> /flamegraph_ultva02_sars2.svg >>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>> /flamegraph_ultva03_sars2.svg >>>> >>>> >>>> And here's graphs during a cqlsh COPY FROM to the same table, using >>>> real data, MAXBATCHSIZE=2. Throughput is good at approximately 110k >>>> rows/sec. >>>> >>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>> /flamegraph_ultva01_cars_batch2.svg >>>> >>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg> >>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>> /flamegraph_ultva02_cars_batch2.svg >>>> >>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg> >>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>> /flamegraph_ultva03_cars_batch2.svg >>>> >>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg> >>>> >>>> >>>> >>>> >>>> -- Eric >>>> >>>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <eric...@gmail.com> >>>> wrote: >>>> >>>>> Totally understood :) >>>>> >>>>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to >>>>> include all of the CPUs. Actually most of them were set that way already >>>>> (for example, 0000ffff,ffffffff) - it might be because irqbalanced is >>>>> running. But for some reason the interrupts are all being handled on CPU >>>>> 0 >>>>> anyway. >>>>> >>>>> I see this in /var/log/dmesg on the machines: >>>>> >>>>>> >>>>>> Your BIOS has requested that x2apic be disabled. >>>>>> This will leave your machine vulnerable to irq-injection attacks. >>>>>> Use 'intremap=no_x2apic_optout' to override BIOS request. >>>>>> Enabled IRQ remapping in xapic mode >>>>>> x2apic not enabled, IRQ remapping is in xapic mode >>>>> >>>>> >>>>> In a reply to one of the comments, he says: >>>>> >>>>> >>>>> When IO-APIC configured to spread interrupts among all cores, it can >>>>>> handle up to eight cores. If you have more than eight cores, kernel will >>>>>> not configure IO-APIC to spread interrupts. Thus the trick I described in >>>>>> the article will not work. >>>>>> Otherwise it may be caused by buggy BIOS or even buggy hardware. >>>>> >>>>> >>>>> I'm not sure if either of them is relevant to my situation. >>>>> >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- Eric >>>>> >>>>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <j...@jonhaddad.com> >>>>> wrote: >>>>> >>>>>> You shouldn't need a kernel recompile. Check out the section "Simple >>>>>> solution for the problem" in http://www.alexonlinux.com/ >>>>>> smp-affinity-and-proper-interrupt-handling-in-linux. You can >>>>>> balance your requests across up to 8 CPUs. >>>>>> >>>>>> I'll check out the flame graphs in a little bit - in the middle of >>>>>> something and my brain doesn't multitask well :) >>>>>> >>>>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <eric...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Jonathan - >>>>>>> >>>>>>> It looks like these machines are configured to use CPU 0 for all I/O >>>>>>> interrupts. I don't think I'm going to get the OK to compile a new >>>>>>> kernel >>>>>>> for them to balance the interrupts across CPUs, but to mitigate the >>>>>>> problem >>>>>>> I taskset the Cassandra process to run on all CPU except 0. It didn't >>>>>>> change the performance though. Let me know if you think it's crucial >>>>>>> that >>>>>>> we balance the interrupts across CPUs and I can try to lobby for a new >>>>>>> kernel. >>>>>>> >>>>>>> Here are flamegraphs from each node from a cassandra-stress ingest >>>>>>> into a table representative of the what we are going to be using. This >>>>>>> table is also roughly 200 bytes, with 64 columns and the primary key >>>>>>> (date, >>>>>>> sequence_number). Cassandra-stress was run on 3 separate client >>>>>>> machines. Using cassandra-stress to write to this table I see the >>>>>>> same thing: neither disk, CPU or network is fully utilized. >>>>>>> >>>>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>>>> /flamegraph_ultva01_sars.svg >>>>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>>>> /flamegraph_ultva02_sars.svg >>>>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>>>> /flamegraph_ultva03_sars.svg >>>>>>> >>>>>>> Re: GC: In the stress run with the parameters above, two of the >>>>>>> three nodes log zero or one GCInspectors. On the other hand, the >>>>>>> 3rd machine logs a GCInspector every 5 seconds or so, 300-500ms >>>>>>> each time. I found out that the 3rd machine actually has different >>>>>>> specs >>>>>>> as the other two. It's an older box with the same RAM but less CPUs (32 >>>>>>> instead of 48), a slower SSD and slower memory. The Cassandra >>>>>>> configuration is exactly the same. I tried running Cassandra with >>>>>>> only 32 >>>>>>> CPUs on the newer boxes to see if that would cause them to GC pause >>>>>>> more, >>>>>>> but it didn't. >>>>>>> >>>>>>> On a separate topic - for this cassandra-stress run I reduced the >>>>>>> batch size to 2 in order to keep the logs clean. That also reduced the >>>>>>> throughput from around 100k rows/second to 32k rows/sec. I've been >>>>>>> doing >>>>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a >>>>>>> custom C++ application. In most of the tests that I've been doing I've >>>>>>> been using a batch size of around 20 (unlogged, all batch rows have the >>>>>>> same partition key). However, it fills the logs with batch size >>>>>>> warnings. >>>>>>> I was going to raise the batch warning size but the docs scared me away >>>>>>> from doing that. Given that we're using unlogged/same partition >>>>>>> batches >>>>>>> is it safe to raise the batch size warning limit? Actually cqlsh >>>>>>> COPY FROM has very good throughput using a small batch size, but I >>>>>>> can't get that same throughput in cassandra-stress or my C++ app with a >>>>>>> batch size of 2. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- Eric >>>>>>> >>>>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <j...@jonhaddad.com> >>>>>>> wrote: >>>>>>> >>>>>>>> How many CPUs are you using for interrupts? >>>>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrup >>>>>>>> t-handling-in-linux >>>>>>>> >>>>>>>> Have you tried making a flame graph to see where Cassandra is >>>>>>>> spending its time? http://www.brendangregg. >>>>>>>> com/blog/2014-06-12/java-flame-graphs.html >>>>>>>> >>>>>>>> Are you tracking GC pauses? >>>>>>>> >>>>>>>> Jon >>>>>>>> >>>>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <eric...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi all: >>>>>>>>> >>>>>>>>> I'm new to Cassandra and I'm doing some performance testing. One >>>>>>>>> of things that I'm testing is ingestion throughput. My server setup >>>>>>>>> is: >>>>>>>>> >>>>>>>>> - 3 node cluster >>>>>>>>> - SSD data (both commit log and sstables are on the same disk) >>>>>>>>> - 64 GB RAM per server >>>>>>>>> - 48 cores per server >>>>>>>>> - Cassandra 3.0.11 >>>>>>>>> - 48 Gb heap using G1GC >>>>>>>>> - 1 Gbps NICs >>>>>>>>> >>>>>>>>> Since I'm using SSD I've tried tuning the following (one at a >>>>>>>>> time) but none seemed to make a lot of difference: >>>>>>>>> >>>>>>>>> - concurrent_writes=384 >>>>>>>>> - memtable_flush_writers=8 >>>>>>>>> - concurrent_compactors=8 >>>>>>>>> >>>>>>>>> I am currently doing ingestion tests sending data from 3 clients >>>>>>>>> on the same subnet. I am using cassandra-stress to do some ingestion >>>>>>>>> testing. The tests are using CL=ONE and RF=2. >>>>>>>>> >>>>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk using >>>>>>>>> a large enough column size and the standard five column >>>>>>>>> cassandra-stress >>>>>>>>> schema. For example, -col size=fixed(400) will saturate the disk >>>>>>>>> and compactions will start falling behind. >>>>>>>>> >>>>>>>>> One of our main tables has a row size that approximately 200 >>>>>>>>> bytes, across 64 columns. When ingesting this table I don't see any >>>>>>>>> resource saturation. Disk utilization is around 10-15% per iostat. >>>>>>>>> Incoming network traffic on the servers is around 100-300 Mbps. CPU >>>>>>>>> utilization is around 20-70%. nodetool tpstats shows mostly >>>>>>>>> zeros with occasional spikes around 500 in MutationStage. >>>>>>>>> >>>>>>>>> The stress run does 10,000,000 inserts per client, each with a >>>>>>>>> separate range of partition IDs. The run with 200 byte rows takes >>>>>>>>> about 4 >>>>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC >>>>>>>>> time 173 >>>>>>>>> ms. >>>>>>>>> >>>>>>>>> The overall performance is good - around 120k rows/sec ingested. >>>>>>>>> But I'm curious to know where the bottleneck is. There's no resource >>>>>>>>> saturation and nodetool tpstats shows only occasional brief >>>>>>>>> queueing. Is the rest just expected latency inside of Cassandra? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> -- Eric >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>> >>> >>