After disabling binary, gossip, thrift node blocks on 16 read stages and [iadmin@ip-10-0-25-46 ~]$ nodetool tpstats Pool Name Active Pending Completed Blocked All time blocked MutationStage 0 0 19587002 0 0 ReadStage 16 122722 825762 0 0 RequestResponseStage 0 0 14281567 0 0 ReadRepairStage 0 0 37390 0 0 CounterMutationStage 0 0 0 0 0 MiscStage 0 0 0 0 0 HintedHandoff 0 0 114 0 0 GossipStage 0 0 93775 0 0 CacheCleanupExecutor 0 0 0 0 0 InternalResponseStage 0 0 0 0 0 CommitLogArchiver 0 0 0 0 0 CompactionExecutor 0 0 18523 0 0 ValidationExecutor 0 0 18 0 0 MigrationStage 0 0 6 0 0 AntiEntropyStage 0 0 60 0 0 PendingRangeCalculator 0 0 89 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0 2489 0 0 MemtablePostFlush 0 0 2562 0 0 MemtableReclaimMemory 1 28 2461 0 0
Message type Dropped READ 0 RANGE_SLICE 0 _TRACE 0 MUTATION 0 COUNTER_MUTATION 0 BINARY 0 REQUEST_RESPONSE 0 PAGED_RANGE 0 READ_REPAIR 0 > On 12 Feb 2016, at 17:45, Skvazh Roman <r...@skvazh.com> wrote: > > There is 1-4 compactions at that moment. > We have many tombstones, which does not removed. > DroppableTombstoneRatio is 5-6 (greater than 1) > >> On 12 Feb 2016, at 15:53, Julien Anguenot <jul...@anguenot.org> wrote: >> >> Hey, >> >> What about compactions count when that is happening? >> >> J. >> >> >>> On Feb 12, 2016, at 3:06 AM, Skvazh Roman <r...@skvazh.com> wrote: >>> >>> Hello! >>> We have a cluster of 25 c3.4xlarge nodes (16 cores, 32 GiB) with attached >>> 1.5 TB 4000 PIOPS EBS drive. >>> Sometimes one or two nodes user cpu spikes to 100%, load average to 20-30 - >>> read requests drops of. >>> Only restart of this cassandra services helps. >>> Please advice. >>> >>> One big table with wide rows. 600 Gb per node. >>> LZ4Compressor >>> LeveledCompaction >>> >>> concurrent compactors: 4 >>> compactor throughput: tried from 16 to 128 >>> Concurrent_readers: from 16 to 32 >>> Concurrent_writers: 128 >>> >>> >>> https://gist.github.com/rskvazh/de916327779b98a437a6 >>> >>> >>> JvmTop 0.8.0 alpha - 06:51:10, amd64, 16 cpus, Linux 3.14.44-3, load avg >>> 19.35 >>> http://code.google.com/p/jvmtop >>> >>> Profiling PID 9256: org.apache.cassandra.service.CassandraDa >>> >>> 95.73% ( 4.31s) >>> ....google.common.collect.AbstractIterator.tryToComputeN() >>> 1.39% ( 0.06s) com.google.common.base.Objects.hashCode() >>> 1.26% ( 0.06s) io.netty.channel.epoll.Native.epollWait() >>> 0.85% ( 0.04s) net.jpountz.lz4.LZ4JNI.LZ4_compress_limitedOutput() >>> 0.46% ( 0.02s) net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast() >>> 0.26% ( 0.01s) com.google.common.collect.Iterators$7.computeNext() >>> 0.06% ( 0.00s) io.netty.channel.epoll.Native.eventFdWrite() >>> >>> >>> ttop: >>> >>> 2016-02-12T08:20:25.605+0000 Process summary >>> process cpu=1565.15% >>> application cpu=1314.48% (user=1354.48% sys=-40.00%) >>> other: cpu=250.67% >>> heap allocation rate 146mb/s >>> [000405] user=76.25% sys=-0.54% alloc= 0b/s - SharedPool-Worker-9 >>> [000457] user=75.54% sys=-1.26% alloc= 0b/s - SharedPool-Worker-14 >>> [000451] user=73.52% sys= 0.29% alloc= 0b/s - SharedPool-Worker-16 >>> [000311] user=76.45% sys=-2.99% alloc= 0b/s - SharedPool-Worker-4 >>> [000389] user=70.69% sys= 2.62% alloc= 0b/s - SharedPool-Worker-6 >>> [000388] user=86.95% sys=-14.28% alloc= 0b/s - SharedPool-Worker-5 >>> [000404] user=70.69% sys= 0.10% alloc= 0b/s - SharedPool-Worker-8 >>> [000390] user=72.61% sys=-1.82% alloc= 0b/s - SharedPool-Worker-7 >>> [000255] user=87.86% sys=-17.87% alloc= 0b/s - SharedPool-Worker-1 >>> [000444] user=72.21% sys=-2.30% alloc= 0b/s - SharedPool-Worker-12 >>> [000310] user=71.50% sys=-2.31% alloc= 0b/s - SharedPool-Worker-3 >>> [000445] user=69.68% sys=-0.83% alloc= 0b/s - SharedPool-Worker-13 >>> [000406] user=72.61% sys=-4.40% alloc= 0b/s - SharedPool-Worker-10 >>> [000446] user=69.78% sys=-1.65% alloc= 0b/s - SharedPool-Worker-11 >>> [000452] user=66.86% sys= 0.22% alloc= 0b/s - SharedPool-Worker-15 >>> [000256] user=69.08% sys=-2.42% alloc= 0b/s - SharedPool-Worker-2 >>> [004496] user=29.99% sys= 0.59% alloc= 30mb/s - CompactionExecutor:15 >>> [004906] user=29.49% sys= 0.74% alloc= 39mb/s - CompactionExecutor:16 >>> [010143] user=28.58% sys= 0.25% alloc= 26mb/s - CompactionExecutor:17 >>> [000785] user=27.87% sys= 0.70% alloc= 38mb/s - CompactionExecutor:12 >>> [012723] user= 9.09% sys= 2.46% alloc= 2977kb/s - RMI TCP >>> Connection(2673)-127.0.0.1 >>> [000555] user= 5.35% sys=-0.08% alloc= 474kb/s - SharedPool-Worker-24 >>> [000560] user= 3.94% sys= 0.07% alloc= 434kb/s - SharedPool-Worker-22 >>> [000557] user= 3.94% sys=-0.17% alloc= 339kb/s - SharedPool-Worker-25 >>> [000447] user= 2.73% sys= 0.60% alloc= 436kb/s - SharedPool-Worker-19 >>> [000563] user= 3.33% sys=-0.04% alloc= 460kb/s - SharedPool-Worker-20 >>> [000448] user= 2.73% sys= 0.27% alloc= 414kb/s - SharedPool-Worker-21 >>> [000554] user= 1.72% sys= 0.70% alloc= 232kb/s - SharedPool-Worker-26 >>> [000558] user= 1.41% sys= 0.39% alloc= 213kb/s - SharedPool-Worker-23 >>> [000450] user= 1.41% sys=-0.03% alloc= 158kb/s - SharedPool-Worker-17 >> >