Copilot commented on code in PR #4691:
URL: https://github.com/apache/cassandra/pull/4691#discussion_r2998997253
##########
src/java/org/apache/cassandra/utils/JVMStabilityInspector.java:
##########
@@ -124,7 +124,12 @@ public static void inspectThrowable(Throwable t,
Consumer<Throwable> fn, boolean
logger.error("OutOfMemory error letting the JVM handle the
error:", t);
- StorageService.instance.removeShutdownHook();
+ // Only remove the shutdown hook for fatal heap OOMs where the JVM
will exit
+ // and the drain hook could hang due to lack of heap memory. For
non-fatal OOMs
+ // (e.g. direct buffer, native thread), keep the hook for graceful
shutdown.
+ // See CASSANDRA-16939.
+ if (isHeapSpaceOom((OutOfMemoryError) t))
+ StorageService.instance.removeShutdownHook();
forceHeapSpaceOomMaybe((OutOfMemoryError) t);
Review Comment:
The shutdown-hook removal is now gated on the *original* OOM message, but
`forceHeapSpaceOomMaybe(...)` will intentionally exhaust the Java heap for any
non-heap OOM (e.g. "Direct buffer memory"). That means we can still end up in a
real heap OOM/VM exit path while keeping the drain shutdown hook installed,
reintroducing the hang this code is trying to avoid. Consider either (a) also
removing the hook when you are about to force a heap OOM, or (b) skipping
`forceHeapSpaceOomMaybe` for the OOM types where we want to preserve graceful
shutdown, so the behavior is consistent.
```suggestion
{
StorageService.instance.removeShutdownHook();
forceHeapSpaceOomMaybe((OutOfMemoryError) t);
}
```
##########
test/unit/org/apache/cassandra/utils/JVMStabilityInspectorTest.java:
##########
@@ -239,4 +240,54 @@ public void fileHandleTest()
JVMStabilityInspector.replaceKiller(originalKiller);
}
}
+
+ @Test
+ public void testShutdownHookNotRemovedForNonFatalOom() throws Exception
+ {
+ testShutdownHookRemovalOnOom("Direct buffer memory", false);
+ }
+
+ @Test
+ public void testShutdownHookRemovedForFatalOom() throws Exception
+ {
+ for (String message : JVMStabilityInspector.FORCE_HEAP_OOM_IGNORE_SET)
+ testShutdownHookRemovalOnOom(message, true);
+ }
+
+ private void testShutdownHookRemovalOnOom(String oomMessage, boolean
expectHookRemoved) throws Exception
+ {
+ Thread testHook = new Thread(() -> {}, "TestDrainHook"); //
checkstyle: permit this instantiation
+ Runtime.getRuntime().addShutdownHook(testHook);
+
+ Field drainField =
StorageService.class.getDeclaredField("drainOnShutdown");
+ drainField.setAccessible(true);
+ Thread originalHook = (Thread) drainField.get(StorageService.instance);
+
+ try
+ {
+ drainField.set(StorageService.instance, testHook);
+
+ Assertions.assertThatThrownBy(() ->
JVMStabilityInspector.inspectThrowable(new OutOfMemoryError(oomMessage)))
+ .isInstanceOf(OutOfMemoryError.class);
+
+ if (expectHookRemoved)
+ assertFalse("Shutdown hook should be removed for fatal OOM ("
+ oomMessage + ')',
+ Runtime.getRuntime().removeShutdownHook(testHook));
+ else
+ assertTrue("Shutdown hook should not be removed for non-fatal
OOM (" + oomMessage + ')',
+ Runtime.getRuntime().removeShutdownHook(testHook));
+ }
+ finally
+ {
+ drainField.set(StorageService.instance, originalHook);
Review Comment:
This test triggers `StorageService.instance.removeShutdownHook()` for the
fatal OOM cases, and that method unconditionally calls
`PathUtils.clearOnExitThreads()` (removing global delete-on-exit shutdown
hooks). That can interfere with other unit tests that rely on
`PathUtils.deleteRecursiveOnExit(...)`/`File.deleteOnExit()` cleanup later in
the same JVM. Consider snapshotting/restoring PathUtils' on-exit hook list via
reflection in the `finally`, or refactoring the production code to make the
shutdown-hook removal observable without clearing unrelated JVM shutdown hooks.
##########
src/java/org/apache/cassandra/utils/JVMStabilityInspector.java:
##########
@@ -186,7 +191,18 @@ else if (t instanceof UnrecoverableIllegalStateException)
inspectThrowable(t.getCause(), fn, isUncaughtException);
}
- private static final Set<String> FORCE_HEAP_OOM_IGNORE_SET =
ImmutableSet.of("Java heap space", "GC Overhead limit exceeded");
+ @VisibleForTesting
+ static final Set<String> FORCE_HEAP_OOM_IGNORE_SET = ImmutableSet.of("Java
heap space", "GC Overhead limit exceeded");
+
+ /**
+ * Returns true if the given OOM is a fatal heap-related error (heap space
exhaustion or GC overhead),
+ * as opposed to a non-fatal OOM like direct buffer or native thread
exhaustion.
+ */
+ @VisibleForTesting
+ static boolean isHeapSpaceOom(OutOfMemoryError oom)
+ {
+ return FORCE_HEAP_OOM_IGNORE_SET.contains(oom.getMessage());
+ }
Review Comment:
`isHeapSpaceOom(...)` (and `forceHeapSpaceOomMaybe`) rely on an exact match
of `OutOfMemoryError.getMessage()`. This is brittle across JVMs/versions and
even casing; e.g. HotSpot commonly uses "GC overhead limit exceeded" (lowercase
'overhead'), while the set currently contains "GC Overhead limit exceeded".
Consider normalizing (case-insensitive compare, or `contains`/prefix match) and
guarding against JVM-specific message variations so fatal heap OOMs are
consistently detected.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]