What I can tell you from that trace - given that this is the correct thread and it really hangs there:
The validation is stuck when reading from an SSTable. Unfortunately I am no caffeine expert. It looks like the read is cached and after the read caffeine tries to drain the cache and this is stuck. I don't see the reason from that stack trace. Someone had to dig deeper into caffeine to find the root cause. 2017-04-13 9:27 GMT+02:00 Roland Otta <roland.o...@willhaben.at>: > i had a closer look at the validation executor thread (i hope thats what > you meant) > > it seems the thread is always repeating stuff in > org.apache.cassandra.cache.ChunkCache$CachingRebufferer. > rebuffer(ChunkCache.java:235) > > here is the full stack trace ... > > i am sorry .. but i have no clue whats happening there .. > > com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda$64/2098345091 > <(209)%20834-5091>.accept(Unknown Source) > com.github.benmanes.caffeine.cache.BoundedBuffer$RingBuffer.drainTo( > BoundedBuffer.java:104) > com.github.benmanes.caffeine.cache.StripedBuffer.drainTo( > StripedBuffer.java:160) > com.github.benmanes.caffeine.cache.BoundedLocalCache.drainReadBuffer( > BoundedLocalCache.java:964) > com.github.benmanes.caffeine.cache.BoundedLocalCache. > maintenance(BoundedLocalCache.java:918) > com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp( > BoundedLocalCache.java:903) > com.github.benmanes.caffeine.cache.BoundedLocalCache$ > PerformCleanupTask.run(BoundedLocalCache.java:2680) > com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute( > MoreExecutors.java:457) > com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleDrainBuffers( > BoundedLocalCache.java:875) > com.github.benmanes.caffeine.cache.BoundedLocalCache. > afterRead(BoundedLocalCache.java:748) > com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent( > BoundedLocalCache.java:1783) > com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache. > java:97) > com.github.benmanes.caffeine.cache.LocalLoadingCache.get( > LocalLoadingCache.java:66) > org.apache.cassandra.cache.ChunkCache$CachingRebufferer. > rebuffer(ChunkCache.java:235) > org.apache.cassandra.cache.ChunkCache$CachingRebufferer. > rebuffer(ChunkCache.java:213) > org.apache.cassandra.io.util.RandomAccessReader.reBufferAt( > RandomAccessReader.java:65) > org.apache.cassandra.io.util.RandomAccessReader.reBuffer( > RandomAccessReader.java:59) > org.apache.cassandra.io.util.RebufferingInputStream.read( > RebufferingInputStream.java:88) > org.apache.cassandra.io.util.RebufferingInputStream.readFully( > RebufferingInputStream.java:66) > org.apache.cassandra.io.util.RebufferingInputStream.readFully( > RebufferingInputStream.java:60) > org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:402) > org.apache.cassandra.db.marshal.AbstractType.readValue(AbstractType.java: > 420) > org.apache.cassandra.db.rows.Cell$Serializer.deserialize(Cell.java:245) > org.apache.cassandra.db.rows.UnfilteredSerializer.readSimpleColumn( > UnfilteredSerializer.java:610) > org.apache.cassandra.db.rows.UnfilteredSerializer.lambda$ > deserializeRowBody$1(UnfilteredSerializer.java:575) > org.apache.cassandra.db.rows.UnfilteredSerializer$$Lambda$84/898489541.accept(Unknown > Source) > org.apache.cassandra.utils.btree.BTree.applyForwards(BTree.java:1222) > org.apache.cassandra.utils.btree.BTree.apply(BTree.java:1177) > org.apache.cassandra.db.Columns.apply(Columns.java:377) > org.apache.cassandra.db.rows.UnfilteredSerializer.deserializeRowBody( > UnfilteredSerializer.java:571) > org.apache.cassandra.db.rows.UnfilteredSerializer.deserialize( > UnfilteredSerializer.java:440) > org.apache.cassandra.io.sstable.SSTableSimpleIterator$ > CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:95) > org.apache.cassandra.io.sstable.SSTableSimpleIterator$ > CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:73) > org.apache.cassandra.utils.AbstractIterator.hasNext( > AbstractIterator.java:47) > org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext( > SSTableIdentityIterator.java:122) > org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRow > Iterator.computeNext(LazilyInitializedUnfilteredRowIterator.java:100) > org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRow > Iterator.computeNext(LazilyInitializedUnfilteredRowIterator.java:32) > org.apache.cassandra.utils.AbstractIterator.hasNext( > AbstractIterator.java:47) > org.apache.cassandra.utils.MergeIterator$Candidate. > advance(MergeIterator.java:374) > org.apache.cassandra.utils.MergeIterator$ManyToOne. > advance(MergeIterator.java:186) > org.apache.cassandra.utils.MergeIterator$ManyToOne. > computeNext(MergeIterator.java:155) > org.apache.cassandra.utils.AbstractIterator.hasNext( > AbstractIterator.java:47) > org.apache.cassandra.db.rows.UnfilteredRowIterators$ > UnfilteredRowMergeIterator.computeNext(UnfilteredRowIterators.java:500) > org.apache.cassandra.db.rows.UnfilteredRowIterators$ > UnfilteredRowMergeIterator.computeNext(UnfilteredRowIterators.java:360) > org.apache.cassandra.utils.AbstractIterator.hasNext( > AbstractIterator.java:47) > org.apache.cassandra.db.transform.BaseRows.hasNext(BaseRows.java:133) > org.apache.cassandra.db.rows.UnfilteredRowIterators.digest( > UnfilteredRowIterators.java:178) > org.apache.cassandra.repair.Validator.rowHash(Validator.java:221) > org.apache.cassandra.repair.Validator.add(Validator.java:160) > org.apache.cassandra.db.compaction.CompactionManager. > doValidationCompaction(CompactionManager.java:1364) > org.apache.cassandra.db.compaction.CompactionManager. > access$700(CompactionManager.java:85) > org.apache.cassandra.db.compaction.CompactionManager$ > 13.call(CompactionManager.java:933) > java.util.concurrent.FutureTask.run(FutureTask.java:266) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.run(FutureTask.java:266) > java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > org.apache.cassandra.concurrent.NamedThreadFactory. > lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) > org.apache.cassandra.concurrent.NamedThreadFactory$ > $Lambda$5/1371495133.run(Unknown Source) > java.lang.Thread.run(Thread.java:745) > > On Thu, 2017-04-13 at 08:47 +0200, benjamin roth wrote: > > You should connect to the node with JConsole and see where the compaction > thread is stuck > > 2017-04-13 8:34 GMT+02:00 Roland Otta <roland.o...@willhaben.at>: > > hi, > > we have the following issue on our 3.10 development cluster. > > we are doing regular repairs with thelastpickle's fork of creaper. > sometimes the repair (it is a full repair in that case) hangs because > of a stuck validation compaction > > nodetool compactionstats gives me > a1bb45c0-1fc6-11e7-81de-0fb0b3f5a345 Validation bds ad_event > 805955242 841258085 bytes 95.80% > we have here no more progress for hours > > nodetool tpstats shows > alidationExecutor 1 1 16186 0 > 0 > > i checked the logs on the affected node and could not find any > suspicious errors. > > anyone that already had this issue and knows how to cope with that? > > a restart of the node helps to finish the repair ... but i am not sure > whether that somehow breaks the full repair > > bg, > roland > > >