Re: Cassandra eats all cpu cores, high load average

Julien Anguenot Fri, 12 Feb 2016 07:45:48 -0800

> On Feb 12, 2016, at 9:24 AM, Skvazh Roman <r...@skvazh.com> wrote:
> 
> I have disabled autocompaction and stop it on highload node.


Does the load decrease and the node answers requests “normally” when you do 
disable auto-compaction? You actually see pending compactions on nodes having 
high load correct?

> Heap is 8Gb. gc_grace is 86400
> All sstables is about 200-300 Mb.

All seems legit here. Using G1 GC?

> $ nodetool compactionstats
> pending tasks: 14

Try to increase the compactors from 4 to 6-8 on a node, disable gossip and let 
it finish compacting and put it back in the ring by enabling gossip. See what 
happens.

The tombstones count growing is because the auto-aucompactions are disabled on 
these nodes. Probably not your issue.

   J.


> 
> $ dstat -lvnr 10
> ---load-avg--- ---procs--- ------memory-usage----- ---paging-- -dsk/total- 
> ---system-- ----total-cpu-usage---- -net/total- --io/total-
> 1m   5m  15m |run blk new| used  buff  cach  free|  in   out | read  writ| 
> int   csw |usr sys idl wai hiq siq| recv  send| read  writ
> 29.4 28.6 23.5|0.0   0 1.2|11.3G  190M 17.6G  407M|   0     0 |7507k 7330k|  
> 13k   40k| 11   1  88   0   0   0|   0     0 |96.5  64.6
> 29.3 28.6 23.5| 29   0 0.9|11.3G  190M 17.6G  408M|   0     0 |   0   
> 189k|9822  2319 | 99   0   0   0   0   0| 138k  120k|   0  4.30
> 29.4 28.6 23.6| 30   0 2.0|11.3G  190M 17.6G  408M|   0     0 |   0    
> 26k|8689  2189 |100   0   0   0   0   0| 139k  120k|   0  2.70
> 29.4 28.7 23.6| 29   0 3.0|11.3G  190M 17.6G  408M|   0     0 |   0    
> 20k|8722  1846 | 99   0   0   0   0   0| 136k  120k|   0  1.50 ^C
> 
> 
> JvmTop 0.8.0 alpha - 15:20:37,  amd64, 16 cpus, Linux 3.14.44-3, load avg 
> 28.09
> http://code.google.com/p/jvmtop
> 
> PID 32505: org.apache.cassandra.service.CassandraDaemon
> ARGS:
> VMARGS: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSCl[...]
> VM: Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 1.8.0_65
> UP:  8:31m  #THR: 334  #THRPEAK: 437  #THRCREATED: 4694 USER: cassandra
> GC-Time:  0: 8m   #GC-Runs: 6378      #TotalLoadedClasses: 5926
> CPU: 97.96% GC:  0.00% HEAP:6049m /7540m NONHEAP:  82m /  n/a
> 
>  TID   NAME                                    STATE    CPU  TOTALCPU 
> BLOCKEDBY
>    447 SharedPool-Worker-45                 RUNNABLE 60.47%     1.03%
>    343 SharedPool-Worker-2                  RUNNABLE 56.46%     3.07%
>    349 SharedPool-Worker-8                  RUNNABLE 56.43%     1.61%
>    456 SharedPool-Worker-25                 RUNNABLE 55.25%     1.06%
>    483 SharedPool-Worker-40                 RUNNABLE 53.06%     1.04%
>    475 SharedPool-Worker-53                 RUNNABLE 52.31%     1.03%
>    464 SharedPool-Worker-20                 RUNNABLE 52.00%     1.11%
>    577 SharedPool-Worker-71                 RUNNABLE 51.73%     1.02%
>    404 SharedPool-Worker-10                 RUNNABLE 51.10%     1.29%
>    486 SharedPool-Worker-34                 RUNNABLE 51.06%     1.03%
> Note: Only top 10 threads (according cpu load) are shown!
> 
> 
>> On 12 Feb 2016, at 18:14, Julien Anguenot <jul...@anguenot.org> wrote:
>> 
>> At the time when the load is high and you have to restart, do you see any 
>> pending compactions when using `nodetool compactionstats`?
>> 
>> Possible to see a `nodetool compactionstats` taken *when* the load is too 
>> high?  Have you checked the size of your SSTables for that big table? Any 
>> large ones in there?  What about the Java HEAP configuration on these nodes?
>> 
>> If you have too many tombstones I would try to decrease gc_grace_seconds so 
>> they get cleared out earlier during compactions.
>> 
>>  J.
>> 
>>> On Feb 12, 2016, at 8:45 AM, Skvazh Roman <r...@skvazh.com> wrote:
>>> 
>>> There is 1-4 compactions at that moment.
>>> We have many tombstones, which does not removed.
>>> DroppableTombstoneRatio is 5-6 (greater than 1)
>>> 
>>>> On 12 Feb 2016, at 15:53, Julien Anguenot <jul...@anguenot.org> wrote:
>>>> 
>>>> Hey, 
>>>> 
>>>> What about compactions count when that is happening?
>>>> 
>>>> J.
>>>> 
>>>> 
>>>>> On Feb 12, 2016, at 3:06 AM, Skvazh Roman <r...@skvazh.com> wrote:
>>>>> 
>>>>> Hello!
>>>>> We have a cluster of 25 c3.4xlarge nodes (16 cores, 32 GiB) with attached 
>>>>> 1.5 TB 4000 PIOPS EBS drive.
>>>>> Sometimes one or two nodes user cpu spikes to 100%, load average to 20-30 
>>>>> - read requests drops of.
>>>>> Only restart of this cassandra services helps.
>>>>> Please advice.
>>>>> 
>>>>> One big table with wide rows. 600 Gb per node.
>>>>> LZ4Compressor
>>>>> LeveledCompaction
>>>>> 
>>>>> concurrent compactors: 4
>>>>> compactor throughput: tried from 16 to 128
>>>>> Concurrent_readers: from 16 to 32
>>>>> Concurrent_writers: 128
>>>>> 
>>>>> 
>>>>> https://gist.github.com/rskvazh/de916327779b98a437a6
>>>>> 
>>>>> 
>>>>> JvmTop 0.8.0 alpha - 06:51:10,  amd64, 16 cpus, Linux 3.14.44-3, load avg 
>>>>> 19.35
>>>>> http://code.google.com/p/jvmtop
>>>>> 
>>>>> Profiling PID 9256: org.apache.cassandra.service.CassandraDa
>>>>> 
>>>>> 95.73% (     4.31s) 
>>>>> ....google.common.collect.AbstractIterator.tryToComputeN()
>>>>> 1.39% (     0.06s) com.google.common.base.Objects.hashCode()
>>>>> 1.26% (     0.06s) io.netty.channel.epoll.Native.epollWait()
>>>>> 0.85% (     0.04s) net.jpountz.lz4.LZ4JNI.LZ4_compress_limitedOutput()
>>>>> 0.46% (     0.02s) net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast()
>>>>> 0.26% (     0.01s) com.google.common.collect.Iterators$7.computeNext()
>>>>> 0.06% (     0.00s) io.netty.channel.epoll.Native.eventFdWrite()
>>>>> 
>>>>> 
>>>>> ttop:
>>>>> 
>>>>> 2016-02-12T08:20:25.605+0000 Process summary
>>>>> process cpu=1565.15%
>>>>> application cpu=1314.48% (user=1354.48% sys=-40.00%)
>>>>> other: cpu=250.67%
>>>>> heap allocation rate 146mb/s
>>>>> [000405] user=76.25% sys=-0.54% alloc=     0b/s - SharedPool-Worker-9
>>>>> [000457] user=75.54% sys=-1.26% alloc=     0b/s - SharedPool-Worker-14
>>>>> [000451] user=73.52% sys= 0.29% alloc=     0b/s - SharedPool-Worker-16
>>>>> [000311] user=76.45% sys=-2.99% alloc=     0b/s - SharedPool-Worker-4
>>>>> [000389] user=70.69% sys= 2.62% alloc=     0b/s - SharedPool-Worker-6
>>>>> [000388] user=86.95% sys=-14.28% alloc=     0b/s - SharedPool-Worker-5
>>>>> [000404] user=70.69% sys= 0.10% alloc=     0b/s - SharedPool-Worker-8
>>>>> [000390] user=72.61% sys=-1.82% alloc=     0b/s - SharedPool-Worker-7
>>>>> [000255] user=87.86% sys=-17.87% alloc=     0b/s - SharedPool-Worker-1
>>>>> [000444] user=72.21% sys=-2.30% alloc=     0b/s - SharedPool-Worker-12
>>>>> [000310] user=71.50% sys=-2.31% alloc=     0b/s - SharedPool-Worker-3
>>>>> [000445] user=69.68% sys=-0.83% alloc=     0b/s - SharedPool-Worker-13
>>>>> [000406] user=72.61% sys=-4.40% alloc=     0b/s - SharedPool-Worker-10
>>>>> [000446] user=69.78% sys=-1.65% alloc=     0b/s - SharedPool-Worker-11
>>>>> [000452] user=66.86% sys= 0.22% alloc=     0b/s - SharedPool-Worker-15
>>>>> [000256] user=69.08% sys=-2.42% alloc=     0b/s - SharedPool-Worker-2
>>>>> [004496] user=29.99% sys= 0.59% alloc=   30mb/s - CompactionExecutor:15
>>>>> [004906] user=29.49% sys= 0.74% alloc=   39mb/s - CompactionExecutor:16
>>>>> [010143] user=28.58% sys= 0.25% alloc=   26mb/s - CompactionExecutor:17
>>>>> [000785] user=27.87% sys= 0.70% alloc=   38mb/s - CompactionExecutor:12
>>>>> [012723] user= 9.09% sys= 2.46% alloc= 2977kb/s - RMI TCP 
>>>>> Connection(2673)-127.0.0.1
>>>>> [000555] user= 5.35% sys=-0.08% alloc=  474kb/s - SharedPool-Worker-24
>>>>> [000560] user= 3.94% sys= 0.07% alloc=  434kb/s - SharedPool-Worker-22
>>>>> [000557] user= 3.94% sys=-0.17% alloc=  339kb/s - SharedPool-Worker-25
>>>>> [000447] user= 2.73% sys= 0.60% alloc=  436kb/s - SharedPool-Worker-19
>>>>> [000563] user= 3.33% sys=-0.04% alloc=  460kb/s - SharedPool-Worker-20
>>>>> [000448] user= 2.73% sys= 0.27% alloc=  414kb/s - SharedPool-Worker-21
>>>>> [000554] user= 1.72% sys= 0.70% alloc=  232kb/s - SharedPool-Worker-26
>>>>> [000558] user= 1.41% sys= 0.39% alloc=  213kb/s - SharedPool-Worker-23
>>>>> [000450] user= 1.41% sys=-0.03% alloc=  158kb/s - SharedPool-Worker-17
>>>> 
>> 
>> 
>> 
> 

--
Julien Anguenot (@anguenot)
USA +1.832.408.0344 <tel:+1.832.408.0344>  
FR +33.7.86.85.70.44

Re: Cassandra eats all cpu cores, high load average

Reply via email to