> Does the load decrease and the node answers requests “normally” when you do 
> disable auto-compaction? You actually see pending compactions on nodes having 
> high load correct?

Nope.

> All seems legit here. Using G1 GC?
Yes

Problems also occurred on nodes without pending compactions.



> On 12 Feb 2016, at 18:44, Julien Anguenot <jul...@anguenot.org> wrote:
> 
>> 
>> On Feb 12, 2016, at 9:24 AM, Skvazh Roman <r...@skvazh.com 
>> <mailto:r...@skvazh.com>> wrote:
>> 
>> I have disabled autocompaction and stop it on highload node.
> 
> Does the load decrease and the node answers requests “normally” when you do 
> disable auto-compaction? You actually see pending compactions on nodes having 
> high load correct?
> 
>> Heap is 8Gb. gc_grace is 86400
>> All sstables is about 200-300 Mb.
> 
> All seems legit here. Using G1 GC?
> 
>> $ nodetool compactionstats
>> pending tasks: 14
> 
> Try to increase the compactors from 4 to 6-8 on a node, disable gossip and 
> let it finish compacting and put it back in the ring by enabling gossip. See 
> what happens.
> 
> The tombstones count growing is because the auto-aucompactions are disabled 
> on these nodes. Probably not your issue.
> 
>    J.
> 
> 
>> 
>> $ dstat -lvnr 10
>> ---load-avg--- ---procs--- ------memory-usage----- ---paging-- -dsk/total- 
>> ---system-- ----total-cpu-usage---- -net/total- --io/total-
>> 1m   5m  15m |run blk new| used  buff  cach  free|  in   out | read  writ| 
>> int   csw |usr sys idl wai hiq siq| recv  send| read  writ
>> 29.4 28.6 23.5|0.0   0 1.2|11.3G  190M 17.6G  407M|   0     0 |7507k 7330k|  
>> 13k   40k| 11   1  88   0   0   0|   0     0 |96.5  64.6
>> 29.3 28.6 23.5| 29   0 0.9|11.3G  190M 17.6G  408M|   0     0 |   0   
>> 189k|9822  2319 | 99   0   0   0   0   0| 138k  120k|   0  4.30
>> 29.4 28.6 23.6| 30   0 2.0|11.3G  190M 17.6G  408M|   0     0 |   0    
>> 26k|8689  2189 |100   0   0   0   0   0| 139k  120k|   0  2.70
>> 29.4 28.7 23.6| 29   0 3.0|11.3G  190M 17.6G  408M|   0     0 |   0    
>> 20k|8722  1846 | 99   0   0   0   0   0| 136k  120k|   0  1.50 ^C
>> 
>> 
>> JvmTop 0.8.0 alpha - 15:20:37,  amd64, 16 cpus, Linux 3.14.44-3, load avg 
>> 28.09
>> http://code.google.com/p/jvmtop <http://code.google.com/p/jvmtop>
>> 
>> PID 32505: org.apache.cassandra.service.CassandraDaemon
>> ARGS:
>> VMARGS: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar 
>> -XX:+CMSCl[...]
>> VM: Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 1.8.0_65
>> UP:  8:31m  #THR: 334  #THRPEAK: 437  #THRCREATED: 4694 USER: cassandra
>> GC-Time:  0: 8m   #GC-Runs: 6378      #TotalLoadedClasses: 5926
>> CPU: 97.96% GC:  0.00% HEAP:6049m /7540m NONHEAP:  82m /  n/a
>> 
>>  TID   NAME                                    STATE    CPU  TOTALCPU 
>> BLOCKEDBY
>>    447 SharedPool-Worker-45                 RUNNABLE 60.47%     1.03%
>>    343 SharedPool-Worker-2                  RUNNABLE 56.46%     3.07%
>>    349 SharedPool-Worker-8                  RUNNABLE 56.43%     1.61%
>>    456 SharedPool-Worker-25                 RUNNABLE 55.25%     1.06%
>>    483 SharedPool-Worker-40                 RUNNABLE 53.06%     1.04%
>>    475 SharedPool-Worker-53                 RUNNABLE 52.31%     1.03%
>>    464 SharedPool-Worker-20                 RUNNABLE 52.00%     1.11%
>>    577 SharedPool-Worker-71                 RUNNABLE 51.73%     1.02%
>>    404 SharedPool-Worker-10                 RUNNABLE 51.10%     1.29%
>>    486 SharedPool-Worker-34                 RUNNABLE 51.06%     1.03%
>> Note: Only top 10 threads (according cpu load) are shown!
>> 
>> 
>>> On 12 Feb 2016, at 18:14, Julien Anguenot <jul...@anguenot.org 
>>> <mailto:jul...@anguenot.org>> wrote:
>>> 
>>> At the time when the load is high and you have to restart, do you see any 
>>> pending compactions when using `nodetool compactionstats`?
>>> 
>>> Possible to see a `nodetool compactionstats` taken *when* the load is too 
>>> high?  Have you checked the size of your SSTables for that big table? Any 
>>> large ones in there?  What about the Java HEAP configuration on these nodes?
>>> 
>>> If you have too many tombstones I would try to decrease gc_grace_seconds so 
>>> they get cleared out earlier during compactions.
>>> 
>>>  J.
>>> 
>>>> On Feb 12, 2016, at 8:45 AM, Skvazh Roman <r...@skvazh.com 
>>>> <mailto:r...@skvazh.com>> wrote:
>>>> 
>>>> There is 1-4 compactions at that moment.
>>>> We have many tombstones, which does not removed.
>>>> DroppableTombstoneRatio is 5-6 (greater than 1)
>>>> 
>>>>> On 12 Feb 2016, at 15:53, Julien Anguenot <jul...@anguenot.org 
>>>>> <mailto:jul...@anguenot.org>> wrote:
>>>>> 
>>>>> Hey, 
>>>>> 
>>>>> What about compactions count when that is happening?
>>>>> 
>>>>> J.
>>>>> 
>>>>> 
>>>>>> On Feb 12, 2016, at 3:06 AM, Skvazh Roman <r...@skvazh.com 
>>>>>> <mailto:r...@skvazh.com>> wrote:
>>>>>> 
>>>>>> Hello!
>>>>>> We have a cluster of 25 c3.4xlarge nodes (16 cores, 32 GiB) with 
>>>>>> attached 1.5 TB 4000 PIOPS EBS drive.
>>>>>> Sometimes one or two nodes user cpu spikes to 100%, load average to 
>>>>>> 20-30 - read requests drops of.
>>>>>> Only restart of this cassandra services helps.
>>>>>> Please advice.
>>>>>> 
>>>>>> One big table with wide rows. 600 Gb per node.
>>>>>> LZ4Compressor
>>>>>> LeveledCompaction
>>>>>> 
>>>>>> concurrent compactors: 4
>>>>>> compactor throughput: tried from 16 to 128
>>>>>> Concurrent_readers: from 16 to 32
>>>>>> Concurrent_writers: 128
>>>>>> 
>>>>>> 
>>>>>> https://gist.github.com/rskvazh/de916327779b98a437a6 
>>>>>> <https://gist.github.com/rskvazh/de916327779b98a437a6>
>>>>>> 
>>>>>> 
>>>>>> JvmTop 0.8.0 alpha - 06:51:10,  amd64, 16 cpus, Linux 3.14.44-3, load 
>>>>>> avg 19.35
>>>>>> http://code.google.com/p/jvmtop
>>>>>> 
>>>>>> Profiling PID 9256: org.apache.cassandra.service.CassandraDa
>>>>>> 
>>>>>> 95.73% (     4.31s) 
>>>>>> ....google.common.collect.AbstractIterator.tryToComputeN()
>>>>>> 1.39% (     0.06s) com.google.common.base.Objects.hashCode()
>>>>>> 1.26% (     0.06s) io.netty.channel.epoll.Native.epollWait()
>>>>>> 0.85% (     0.04s) net.jpountz.lz4.LZ4JNI.LZ4_compress_limitedOutput()
>>>>>> 0.46% (     0.02s) net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast()
>>>>>> 0.26% (     0.01s) com.google.common.collect.Iterators$7.computeNext()
>>>>>> 0.06% (     0.00s) io.netty.channel.epoll.Native.eventFdWrite()
>>>>>> 
>>>>>> 
>>>>>> ttop:
>>>>>> 
>>>>>> 2016-02-12T08:20:25.605+0000 Process summary
>>>>>> process cpu=1565.15%
>>>>>> application cpu=1314.48% (user=1354.48% sys=-40.00%)
>>>>>> other: cpu=250.67%
>>>>>> heap allocation rate 146mb/s
>>>>>> [000405] user=76.25% sys=-0.54% alloc=     0b/s - SharedPool-Worker-9
>>>>>> [000457] user=75.54% sys=-1.26% alloc=     0b/s - SharedPool-Worker-14
>>>>>> [000451] user=73.52% sys= 0.29% alloc=     0b/s - SharedPool-Worker-16
>>>>>> [000311] user=76.45% sys=-2.99% alloc=     0b/s - SharedPool-Worker-4
>>>>>> [000389] user=70.69% sys= 2.62% alloc=     0b/s - SharedPool-Worker-6
>>>>>> [000388] user=86.95% sys=-14.28% alloc=     0b/s - SharedPool-Worker-5
>>>>>> [000404] user=70.69% sys= 0.10% alloc=     0b/s - SharedPool-Worker-8
>>>>>> [000390] user=72.61% sys=-1.82% alloc=     0b/s - SharedPool-Worker-7
>>>>>> [000255] user=87.86% sys=-17.87% alloc=     0b/s - SharedPool-Worker-1
>>>>>> [000444] user=72.21% sys=-2.30% alloc=     0b/s - SharedPool-Worker-12
>>>>>> [000310] user=71.50% sys=-2.31% alloc=     0b/s - SharedPool-Worker-3
>>>>>> [000445] user=69.68% sys=-0.83% alloc=     0b/s - SharedPool-Worker-13
>>>>>> [000406] user=72.61% sys=-4.40% alloc=     0b/s - SharedPool-Worker-10
>>>>>> [000446] user=69.78% sys=-1.65% alloc=     0b/s - SharedPool-Worker-11
>>>>>> [000452] user=66.86% sys= 0.22% alloc=     0b/s - SharedPool-Worker-15
>>>>>> [000256] user=69.08% sys=-2.42% alloc=     0b/s - SharedPool-Worker-2
>>>>>> [004496] user=29.99% sys= 0.59% alloc=   30mb/s - CompactionExecutor:15
>>>>>> [004906] user=29.49% sys= 0.74% alloc=   39mb/s - CompactionExecutor:16
>>>>>> [010143] user=28.58% sys= 0.25% alloc=   26mb/s - CompactionExecutor:17
>>>>>> [000785] user=27.87% sys= 0.70% alloc=   38mb/s - CompactionExecutor:12
>>>>>> [012723] user= 9.09% sys= 2.46% alloc= 2977kb/s - RMI TCP 
>>>>>> Connection(2673)-127.0.0.1
>>>>>> [000555] user= 5.35% sys=-0.08% alloc=  474kb/s - SharedPool-Worker-24
>>>>>> [000560] user= 3.94% sys= 0.07% alloc=  434kb/s - SharedPool-Worker-22
>>>>>> [000557] user= 3.94% sys=-0.17% alloc=  339kb/s - SharedPool-Worker-25
>>>>>> [000447] user= 2.73% sys= 0.60% alloc=  436kb/s - SharedPool-Worker-19
>>>>>> [000563] user= 3.33% sys=-0.04% alloc=  460kb/s - SharedPool-Worker-20
>>>>>> [000448] user= 2.73% sys= 0.27% alloc=  414kb/s - SharedPool-Worker-21
>>>>>> [000554] user= 1.72% sys= 0.70% alloc=  232kb/s - SharedPool-Worker-26
>>>>>> [000558] user= 1.41% sys= 0.39% alloc=  213kb/s - SharedPool-Worker-23
>>>>>> [000450] user= 1.41% sys=-0.03% alloc=  158kb/s - SharedPool-Worker-17
>>>>> 
>>> 
>>> 
>>> 
>> 
> 
> --
> Julien Anguenot (@anguenot)
> USA +1.832.408.0344 <tel:+1.832.408.0344>  
> FR +33.7.86.85.70.44

Reply via email to