Hi, Could the problem have occurred because the ForkJoinPool got an OOME when it tried to allocate a ForkJoinWorkerThread?
To check for that, if you're using the commonPool(), you might be able to add a custom ForkJoinWorkerThreadFactory via passing in -Djava.util.concurrent.ForkJoinPool.common.threadFactory=<insert fqcn of custom factory here> and implement newThread() such that you try-catch OOME and log it from there. Cheers, √ Viktor Klang Software Architect, Java Platform Group Oracle ________________________________ From: core-libs-dev <core-libs-dev-r...@openjdk.org> on behalf of Xiao Yu <cutefish...@gmail.com> Sent: Saturday, 3 February 2024 19:54 To: Jaikiran Pai <jai.forums2...@gmail.com> Cc: core-libs-dev@openjdk.org <core-libs-dev@openjdk.org> Subject: Re: The common ForkJoinPool does not have any ForkJoinWorkerThread while tasks are submitted to the queue Hi Jaikiran, Thanks a lot for replying. Our application is a client that communicates to the server for request/response. The client creates a secure (TLS) connection to the server, that is, on top of the SocketChannel, we implement a Wrapper class called SSLDataChannel for reading and writing. The SSLDataChannel uses the javax.net.ssl.SSLEngine. Before any read and write can happen, we need to do SSL handshakes by calling methods in SSLEngine. One of the methods is SSLEngine#getDelegatedTask(). The returned task needs to be executed before the handshake can proceed. After the task is done, we need to continue processing read and write events on the connection. The connection read and write events are all handled by a class called NioEndpointHandler. One requirement for our client is that it supports an asynchronous API and therefore the whole stack must all implement non-blocking methods. The tasks from the SSLEngine could take a long time and we do not want them to block our other connection events, and this is when the ForkJoinPool is used. We run the SSL tasks in the ForkJoinPool and after the task is done we arrange to run the NioEndpointHandler callbacks to proceed with the read and write events. The much simplified code looks somewhat like the following. ``` class NioEndpointHandler { /** The ssl channel */ private final SSLDataChannel sslDataChannel; /** The runnable to execute to handle read after ssl tasks is done. */ private final Runnable handleReadAfterSSLTask = () -> onRead(); /** The handler state. */ State state; /** Executes the SSL tasks until no task to run, then run the callback. */ private void executeSSLTask(ExecutorService executor, Runnable callback) { executor.submit(() -> { Runnable task; while ((task = sslDataChannel.getSSLEngine().getDelegatedTask()) != null) { task.run(); } try { callback.run(); } catch (Throwable t) { /* logging the exception. */ } }); } /** Handle a read event. */ private void onRead() { if (sslDataChannel.needsHandshake()) { /* do handshake */ /* One of the handshake step is to check if there is any SSL task to run. */ if (sslDataChannel.needExecuteTask()) { executeSSLTask(ForkJoinPool.commonPool(), handleReadAfterSSLTask); } } } private void terminate() { state = TERMINATED; /* Other clean up tasks, however, tasks submitted to the ForkJoinPool are not cancelled. */ } } ``` > What are these handlers? Are they classes which implement Runnable or > are they something else? What does termination of handler mean in this > context? Do you use any java.util.concurrent.* APIs to "cancel" such > terminated handlers? The much simplified handler code please see above. The tasks submitted to the ForkJoinPool queue are Runnables that are fields to the NioEndpointHandler. What we have observed is that there are a lot of tasks in the fork join pool that have a reference to the lambda inside NioEndpointHandler#executeSSLTask which eventually have a reference to the NioEndpointHandler. Those NioEndpointHandler are in the TERMINATED state. The only reference to those NioEndpointHandler are these tasks or otherwise they can be garbage collected after the termination cleans up all the other references. Termination of the handler means those connections are at the end of their life cycle. We clean up things such as signal end of life cycle for all the associated request/response pairs and closing the SSLDataChannel, etc. No, we have not use the cancel method to cancel the submitted tasks. I agree that this is an oversight and it would be cleaner to cancel them. However, my current theory is that this is not the root cause. From my understanding of the code, the cancel method only changes the state of the task. It does not remove the task from the queue of the ForkJoinPool. Therefore, those tasks, even if got cancelled, would still stay in the queue preventing the terminated NioEndpointHandler from being garbage collected. Currently, I am strongly biased to my own theory that somehow there is no ForkJoinPool thread that polling tasks out of the queue and I am trying to use the ctl field in the ForkJoinPool as the evidence to backup my theory. I am wondering if I am making some mistake with my theory. > Finally, what does the OutOfMemoryError exception stacktrace look like > and what is the JVM parameters (heap size for example) used to launch > this application? Our clients creates about 155 threads and quite a lot of them have OOME on their stack. I am not quite sure how to reply to this question. Going through the stack traces, I do not find anything very suspicious. They are just exercising their most frequent code path: some I/O threads waiting for I/O events and some execution threads waiting for more work to do, etc. It is worth mentioning that there is no ForkJoinPoolWorkerThread stacks in the thread dump from the heap dump. From my understanding, the only time when there is no such thread is when there is no tasks to run. But there are quite a lot of tasks in the queue. Here are our JVM arguments: ``` -Xms1G -Xmx1G -Djava.util.logging.config.file=/var/lib/andc/config/params/sender.logging.properties -Djavax.net.ssl.trustStore=/var/lib/andc/wallet/client.trust -Doracle.kv.security=/var/lib/andc/config/security/login.properties -Doci.javasdk.extra.stream.logs.enabled=false -XX:G1HeapRegionSize=32m -XX:+DisableExplicitGC -Xlog:all=warning,gc*=info,safepoint=info:file=/var/lib/andc/log/sender/sender.gc:utctime:filecount=10,filesize=10000000 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/andc/log/sender/ ``` We have creation and termination timestamps in the NioEndpointHandler object. From what I can see in the heap dump, the SSL tasks in the ForkJoinPool are associated with NioEndpointHandler that are created at an interval on the magnitude of seconds (retry attempt with second-magnitude backoff). Each NioEndpointHandler are terminated after a fixed 5-second timeout due to unable to connect. The time span for those NioEndpointHandler is about 2 hours. This creates ``` 2 hours * 3600 seconds / hour * 1 NioEndpointHandler / second * 1 SSLDataChannel / NioEndpointHandler * 65K bytes / SSLDataChannel ~= 468M bytes. ``` With 1G heap size, this eventually caused OOME. We are adding fixes so that the SSL tasks would not preventing the NioEndpointHandler from being garbage collected. However, the root cause is still a mystery and I am wondering if I am on the right tracker to figure it out. Thanks a lot for your time and patience. Xiao Yu On Fri, Feb 2, 2024 at 5:35 AM Jaikiran Pai <jai.forums2...@gmail.com<mailto:jai.forums2...@gmail.com>> wrote: Hello Xiao, I don't have enough knowledge of this area to provide any insight into the issue. However, just to try and get the discussion started, do you have any sample code of your application which shows how the application uses the ForkJoinPool? More specifically what APIs do you use in the application? Few other questions inline below. On 12/01/24 11:30 am, Xiao Yu wrote: > .... > Here is the full background. One of our process experienced an OOME > and a heap > dump was obtained. We know there was a concurrent issue of our system > happening > on some other machines such that network failure and retries occurred > in this > process at the same time. Upon analyzing the heap dump, we observed a > lot of > our network connection handlers being frequently created and > terminated which > is expected due to the network failure and retry attempts mentioned above. > However, those terminated handlers are not being GC'ed because of > there were > references to tasks submitted to the ForkJoinPool during the connection > attempts. The tasks stayed in the queue until OOME happened as there is no > threads to execute them. What are these handlers? Are they classes which implement Runnable or are they something else? What does termination of handler mean in this context? Do you use any java.util.concurrent.* APIs to "cancel" such terminated handlers? Finally, what does the OutOfMemoryError exception stacktrace look like and what is the JVM parameters (heap size for example) used to launch this application? -Jaikiran