Re: Bottleneck for small inserts?

Eric Pederson Tue, 13 Jun 2017 22:48:07 -0700

Shoot - I didn't see that one.  I subscribe to the digest but was focusing
on the direct replies and accidentally missed Patrick and Jeff Jirsa's
messages.  Sorry about that...


I've been using a combination of cassandra-stress, cqlsh COPY FROM and a
custom C++ application for my ingestion testing.   My default setting for
my custom client application is 96 threads, and then by default I run one
client application process on each of 3 machines.  I tried
doubling/quadrupling the number of client threads (and doubling/tripling
the number of client processes but keeping the threads per process the
same) but didn't see any change.   If I recall correctly I started getting
timeouts after I went much beyond concurrent_writes which is 384 (for a 48
CPU box) - meaning at 500 threads per client machine I started seeing
timeouts.    I'll try again to be sure.

For the purposes of this conversation I will try to always use
cassandra-stress to keep the number of unknowns limited.  I'll will run
more cassandra-stress clients tomorrow in line with Patrick's 3-5 per
server recommendation.

Thanks!


-- Eric

On Wed, Jun 14, 2017 at 12:40 AM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> Did you try adding more client stress nodes as Patrick recommended?
>
> On Tue, Jun 13, 2017 at 9:31 PM Eric Pederson <eric...@gmail.com> wrote:
>
>> Scratch that theory - the flamegraphs show that GC is only 3-4% of two
>> newer machine's overall processing, compared to 18% on the slow machine.
>>
>> I took that machine out of the cluster completely and recreated the
>> keyspaces.  The ingest tests now run slightly faster (!).   I would have
>> expected a linear slowdown since the load is fairly balanced across
>> partitions.  GC appears to be the bottleneck in the 3-server
>> configuration.  But still in the two-server configuration the
>> CPU/disk/network is still not being fully utilized (the closest is CPU at
>> ~45% on one ingest test).  nodetool tpstats shows only blips of
>> queueing.
>>
>>
>>
>>
>> -- Eric
>>
>> On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson <eric...@gmail.com> wrote:
>>
>>> Hi all - I wanted to follow up on this.  I'm happy with the throughput
>>> we're getting but I'm still curious about the bottleneck.
>>>
>>> The big thing that sticks out is one of the nodes is logging frequent
>>> GCInspector messages: 350-500ms every 3-6 seconds.  All three nodes in
>>> the cluster have identical Cassandra configuration, but the node that is
>>> logging frequent GCs is an older machine with slower CPU and SSD.  This
>>> node logs frequent GCInspectors both under load and when compacting but
>>> otherwise unloaded.
>>>
>>> My theory is that the other two nodes have similar GC frequency (because
>>> they are seeing the same basic load), but because they are faster machines,
>>> they don't spend as much time per GC and don't cross the GCInspector
>>> threshold.  Does that sound plausible?   nodetool tpstats doesn't show
>>> any queueing in the system.
>>>
>>> Here's flamegraphs from the system when running a cqlsh COPY FROM:
>>>
>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>    /flamegraph_ultva01_cars_batch2.svg
>>>    
>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>    /flamegraph_ultva02_cars_batch2.svg
>>>    
>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>    /flamegraph_ultva03_cars_batch2.svg
>>>    
>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>>>
>>> The slow node (ultva03) spends disproportional time in GC.
>>>
>>> Thanks,
>>>
>>>
>>> -- Eric
>>>
>>> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <eric...@gmail.com>
>>> wrote:
>>>
>>>> Due to a cut and paste error those flamegraphs were a recording of the
>>>> whole system, not just Cassandra.    Throughput is approximately 30k
>>>> rows/sec.
>>>>
>>>> Here's the graphs with just the Cassandra PID:
>>>>
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva01_sars2.svg
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva02_sars2.svg
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva03_sars2.svg
>>>>
>>>>
>>>> And here's graphs during a cqlsh COPY FROM to the same table, using
>>>> real data, MAXBATCHSIZE=2.    Throughput is good at approximately 110k
>>>> rows/sec.
>>>>
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva01_cars_batch2.svg
>>>>    
>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva02_cars_batch2.svg
>>>>    
>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva03_cars_batch2.svg
>>>>    
>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>>>>
>>>>
>>>>
>>>>
>>>> -- Eric
>>>>
>>>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <eric...@gmail.com>
>>>> wrote:
>>>>
>>>>> Totally understood :)
>>>>>
>>>>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to
>>>>> include all of the CPUs.  Actually most of them were set that way already
>>>>> (for example, 0000ffff,ffffffff) - it might be because irqbalanced is
>>>>> running.  But for some reason the interrupts are all being handled on CPU >>>>> 0
>>>>> anyway.
>>>>>
>>>>> I see this in /var/log/dmesg on the machines:
>>>>>
>>>>>>
>>>>>> Your BIOS has requested that x2apic be disabled.
>>>>>> This will leave your machine vulnerable to irq-injection attacks.
>>>>>> Use 'intremap=no_x2apic_optout' to override BIOS request.
>>>>>> Enabled IRQ remapping in xapic mode
>>>>>> x2apic not enabled, IRQ remapping is in xapic mode
>>>>>
>>>>>
>>>>> In a reply to one of the comments, he says:
>>>>>
>>>>>
>>>>> When IO-APIC configured to spread interrupts among all cores, it can
>>>>>> handle up to eight cores. If you have more than eight cores, kernel will
>>>>>> not configure IO-APIC to spread interrupts. Thus the trick I described in
>>>>>> the article will not work.
>>>>>> Otherwise it may be caused by buggy BIOS or even buggy hardware.
>>>>>
>>>>>
>>>>> I'm not sure if either of them is relevant to my situation.
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -- Eric
>>>>>
>>>>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <j...@jonhaddad.com>
>>>>> wrote:
>>>>>
>>>>>> You shouldn't need a kernel recompile.  Check out the section "Simple
>>>>>> solution for the problem" in http://www.alexonlinux.com/
>>>>>> smp-affinity-and-proper-interrupt-handling-in-linux.  You can
>>>>>> balance your requests across up to 8 CPUs.
>>>>>>
>>>>>> I'll check out the flame graphs in a little bit - in the middle of
>>>>>> something and my brain doesn't multitask well :)
>>>>>>
>>>>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <eric...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Jonathan -
>>>>>>>
>>>>>>> It looks like these machines are configured to use CPU 0 for all I/O
>>>>>>> interrupts.  I don't think I'm going to get the OK to compile a new 
>>>>>>> kernel
>>>>>>> for them to balance the interrupts across CPUs, but to mitigate the 
>>>>>>> problem
>>>>>>> I taskset the Cassandra process to run on all CPU except 0.  It didn't
>>>>>>> change the performance though.  Let me know if you think it's crucial 
>>>>>>> that
>>>>>>> we balance the interrupts across CPUs and I can try to lobby for a new
>>>>>>> kernel.
>>>>>>>
>>>>>>> Here are flamegraphs from each node from a cassandra-stress ingest
>>>>>>> into a table representative of the what we are going to be using.   This
>>>>>>> table is also roughly 200 bytes, with 64 columns and the primary key 
>>>>>>> (date,
>>>>>>> sequence_number).  Cassandra-stress was run on 3 separate client
>>>>>>> machines.  Using cassandra-stress to write to this table I see the
>>>>>>> same thing: neither disk, CPU or network is fully utilized.
>>>>>>>
>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>    /flamegraph_ultva01_sars.svg
>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>    /flamegraph_ultva02_sars.svg
>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>    /flamegraph_ultva03_sars.svg
>>>>>>>
>>>>>>> Re: GC: In the stress run with the parameters above, two of the
>>>>>>> three nodes log zero or one GCInspectors.  On the other hand, the
>>>>>>> 3rd machine logs a GCInspector every 5 seconds or so, 300-500ms
>>>>>>> each time.  I found out that the 3rd machine actually has different 
>>>>>>> specs
>>>>>>> as the other two.  It's an older box with the same RAM but less CPUs (32
>>>>>>> instead of 48), a slower SSD and slower memory.   The Cassandra
>>>>>>> configuration is exactly the same.   I tried running Cassandra with 
>>>>>>> only 32
>>>>>>> CPUs on the newer boxes to see if that would cause them to GC pause 
>>>>>>> more,
>>>>>>> but it didn't.
>>>>>>>
>>>>>>> On a separate topic - for this cassandra-stress run I reduced the
>>>>>>> batch size to 2 in order to keep the logs clean.  That also reduced the
>>>>>>> throughput from around 100k rows/second to 32k rows/sec.  I've been 
>>>>>>> doing
>>>>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a
>>>>>>> custom C++ application.  In most of the tests that I've been doing I've
>>>>>>> been using a batch size of around 20 (unlogged, all batch rows have the
>>>>>>> same partition key).  However, it fills the logs with batch size 
>>>>>>> warnings.
>>>>>>> I was going to raise the batch warning size but the docs scared me away
>>>>>>> from doing that.   Given that we're using unlogged/same partition 
>>>>>>> batches
>>>>>>> is it safe to raise the batch size warning limit?   Actually cqlsh
>>>>>>> COPY FROM has very good throughput using a small batch size, but I
>>>>>>> can't get that same throughput in cassandra-stress or my C++ app with a
>>>>>>> batch size of 2.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- Eric
>>>>>>>
>>>>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <j...@jonhaddad.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> How many CPUs are you using for interrupts?
>>>>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrup
>>>>>>>> t-handling-in-linux
>>>>>>>>
>>>>>>>> Have you tried making a flame graph to see where Cassandra is
>>>>>>>> spending its time? http://www.brendangregg.
>>>>>>>> com/blog/2014-06-12/java-flame-graphs.html
>>>>>>>>
>>>>>>>> Are you tracking GC pauses?
>>>>>>>>
>>>>>>>> Jon
>>>>>>>>
>>>>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <eric...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi all:
>>>>>>>>>
>>>>>>>>> I'm new to Cassandra and I'm doing some performance testing.  One
>>>>>>>>> of things that I'm testing is ingestion throughput.   My server setup 
>>>>>>>>> is:
>>>>>>>>>
>>>>>>>>>    - 3 node cluster
>>>>>>>>>    - SSD data (both commit log and sstables are on the same disk)
>>>>>>>>>    - 64 GB RAM per server
>>>>>>>>>    - 48 cores per server
>>>>>>>>>    - Cassandra 3.0.11
>>>>>>>>>    - 48 Gb heap using G1GC
>>>>>>>>>    - 1 Gbps NICs
>>>>>>>>>
>>>>>>>>> Since I'm using SSD I've tried tuning the following (one at a
>>>>>>>>> time) but none seemed to make a lot of difference:
>>>>>>>>>
>>>>>>>>>    - concurrent_writes=384
>>>>>>>>>    - memtable_flush_writers=8
>>>>>>>>>    - concurrent_compactors=8
>>>>>>>>>
>>>>>>>>> I am currently doing ingestion tests sending data from 3 clients
>>>>>>>>> on the same subnet.  I am using cassandra-stress to do some ingestion
>>>>>>>>> testing.  The tests are using CL=ONE and RF=2.
>>>>>>>>>
>>>>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk using
>>>>>>>>> a large enough column size and the standard five column 
>>>>>>>>> cassandra-stress
>>>>>>>>> schema.  For example, -col size=fixed(400) will saturate the disk
>>>>>>>>> and compactions will start falling behind.
>>>>>>>>>
>>>>>>>>> One of our main tables has a row size that approximately 200
>>>>>>>>> bytes, across 64 columns.  When ingesting this table I don't see any
>>>>>>>>> resource saturation.  Disk utilization is around 10-15% per iostat.
>>>>>>>>> Incoming network traffic on the servers is around 100-300 Mbps.  CPU
>>>>>>>>> utilization is around 20-70%.  nodetool tpstats shows mostly
>>>>>>>>> zeros with occasional spikes around 500 in MutationStage.
>>>>>>>>>
>>>>>>>>> The stress run does 10,000,000 inserts per client, each with a
>>>>>>>>> separate range of partition IDs.  The run with 200 byte rows takes 
>>>>>>>>> about 4
>>>>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC 
>>>>>>>>> time 173
>>>>>>>>> ms.
>>>>>>>>>
>>>>>>>>> The overall performance is good - around 120k rows/sec ingested.
>>>>>>>>> But I'm curious to know where the bottleneck is.  There's no resource
>>>>>>>>> saturation and nodetool tpstats shows only occasional brief
>>>>>>>>> queueing.  Is the rest just expected latency inside of Cassandra?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> -- Eric
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>

Re: Bottleneck for small inserts?

Reply via email to