It happened again today and I had a bit more time to probe stuff. It seems
all non periodic tasks execute on a single thread. so if one thread where
to get stuck work would simply pile up until out of memory, i did a series
of stack dumps and it always seemed to look something like this

"NonPeriodicTasks:1" #103 daemon prio=5 os_prio=0 tid=0x00007febe8342400
> nid=0x4103 runnable [0x00007febc78ed000]
>    java.lang.Thread.State: RUNNABLE
> at com.google.common.collect.Iterators$7.computeNext(Iterators.java:652)
> at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
> at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
> at
> com.github.benmanes.caffeine.cache.LocalCache.invalidateAll(LocalCache.java:108)
> at
> com.github.benmanes.caffeine.cache.LocalManualCache.invalidateAll(LocalManualCache.java:79)
> at
> org.apache.cassandra.cache.ChunkCache.invalidateFile(ChunkCache.java:197)
> at
> org.apache.cassandra.io.util.FileHandle$Cleanup.lambda$tidy$0(FileHandle.java:207)
> at
> org.apache.cassandra.io.util.FileHandle$Cleanup$$Lambda$217/794936631.accept(Unknown
> Source)
> at java.util.Optional.ifPresent(Optional.java:159)
> at
> org.apache.cassandra.io.util.FileHandle$Cleanup.tidy(FileHandle.java:207)
> at
> org.apache.cassandra.utils.concurrent.Ref$GlobalState.release(Ref.java:326)
> at
> org.apache.cassandra.utils.concurrent.Ref$State.ensureReleased(Ref.java:204)
> at org.apache.cassandra.utils.concurrent.Ref.ensureReleased(Ref.java:129)
> at
> org.apache.cassandra.utils.concurrent.SharedCloseableImpl.close(SharedCloseableImpl.java:45)
> at
> org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier$1.run(SSTableReader.java:2231)


And the thread executing these tasks would always be at 100% cpu

One would expect that invalidating a local cache would be cheap operation.
Yet its not, what could cause chunk cache invalidation to be slow ?
Cassandra does seem to be using an old version of caffeine and there have
been issues <https://github.com/ben-manes/caffeine/issues/216> with it in
the past where it would go into an endless loop under the wrong set of
circumstances




On Mon, 3 Aug 2020 at 13:52, jelmer <jkupe...@gmail.com> wrote:

> It did look like there where repairs running at the time. The
> LiveSSTableCount for the entire node is about 2200 tables, for the keyspace
> that was being repaired its just 150
>
> We run cassandra 3.11.6 so we should be unaffected by  cassandra-14096
>
> We use http://cassandra-reaper.io/ for the repairs
>
>
>
> On Sat, 1 Aug 2020 at 01:49, Erick Ramirez <erick.rami...@datastax.com>
> wrote:
>
>> I don't have specific experience relating to InstanceTidier but when I
>> saw this, I immediately thought of repairs blowing up the heap. 40K
>> instances indicates to me that you have thousands of SSTables -- are they
>> tiny (like 1MB or less)? Otherwise, are they dense nodes (~1TB or more)?
>>
>> How do you run repairs? I'm wondering if it's possible that there are
>> multiple repairs running in parallel like a cron job kicking in while the
>> previous repair is still running.
>>
>> You didn't specify your C* version but my guess is that it's pre-3.11.5.
>> FWIW the repair issue I'm referring to is CASSANDRA-14096 [1].
>>
>> [1] https://issues.apache.org/jira/browse/CASSANDRA-14096
>>
>

Reply via email to