I will give it a try and see what I can find. I plan to go down the
rabbit hole tomorrow. Will keep you updated.
On 05/11/2024 17:34, Jeff Jirsa wrote:
On Nov 5, 2024, at 4:12 AM, Bowen Song via user
<user@cassandra.apache.org> wrote:
Writes on this node starts to timeout and fail. But if left
untouched, it's only gonna get worse, and eventually lead to JVM OOM
and crash.
By inspecting the heap dump created at OOM, we can see that both of
the MemtableFlushWriter threads are stuck on line 1190
<https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1190>
in the ColumnFamilyStore.java:
// mark writes older than the barrier as blocking
progress, permitting them to exceed our memory limit
// if they are stuck waiting on it, then wait for them
all to complete
writeBarrier.markBlocking();
writeBarrier.await(); // <----------- stuck here
And the MemtablePostFlush thread is stuck on line 1094
<https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1094>
in the same file.
try
{
// we wait on the latch for the commitLogUpperBound
to be set, and so that waiters
// on this task can rely on all prior flushes being
complete
latch.await(); // <----------- stuck here
}
Our top suspect is CDC interacting with repair, since this started to
happen shortly after we enabled CDC on the nodes, and each time
repair was running. But we have not been able to reproduce this in a
testing cluster, and don't know what's the next step to troubleshoot
this issue. So I'm posting it in the mailing lists and hoping someone
may know something about it or point me to the right direction.
Wouldn’t be completely surprised if CDC or repair somehow has a
barrier, I’ve also seen similar behavior pre-3.0 with “very long
running read commands” that have a barrier on the memtable that
prevent release.
You’ve got the heap (great, way better than most people debugging),
are you able to navigate through it and look for references to that
memtable or other things holding a barrier?