xy720 opened a new pull request, #46959: URL: https://github.com/apache/doris/pull/46959
### What problem does this PR solve? Issue Number: close #xxx Related PR: #28608 Problem Summary: In TabletStatMgr, We use stream().parallel() or parallelStream() in a ForkJoinTask,when the parallel(Stream)() method is called, the stream will allocate the `ForEach` task to multiple threads. However, when the stream is within a ForkJoinTask, it will attempt to steal threads from the ForkJoinPool. When the number of threads in the ForkJoinPool is small, thread competition is very likely to occur, ultimately leading to a deadlock. Here is a deadlock stack of 4-core Fe: Dead Lock Stack: ``` "tablet stat mgr" #28 daemon prio=5 os_prio=0 cpu=12322.96ms elapsed=2159051.95s allocated=8527M defined_classes=5 tid=0x00007f4d241d6800 nid=0x24b6 in Object.wait() [0x00007f4cfb37a000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(java.base@11.0.24/Native Method) - waiting on <no object reference available> at java.util.concurrent.ForkJoinTask.externalAwaitDone(java.base@11.0.24/ForkJoinTask.java:330) - waiting to re-lock in wait() <0x00000005debf6e00> (a java.util.concurrent.ForkJoinTask$AdaptedRunnableAction) at java.util.concurrent.ForkJoinTask.doJoin(java.base@11.0.24/ForkJoinTask.java:398) at java.util.concurrent.ForkJoinTask.join(java.base@11.0.24/ForkJoinTask.java:721) at org.apache.doris.catalog.TabletStatMgr.runAfterCatalogReady(TabletStatMgr.java:85) at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) at org.apache.doris.common.util.Daemon.run(Daemon.java:119) Locked ownable synchronizers: - None "ForkJoinPool-1-worker-13" #441579 daemon prio=5 os_prio=0 cpu=839.24ms elapsed=191462.96s allocated=356M defined_classes=0 tid=0x00007f4d88008000 nid=0xb2668 waiting on condition [0x00007f4cf6807000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.24/Native Method) - parking to wait for <0x00000005c4abe5a8> (a java.util.concurrent.ForkJoinPool) at java.util.concurrent.locks.LockSupport.parkUntil(java.base@11.0.24/LockSupport.java:275) at java.util.concurrent.ForkJoinPool.runWorker(java.base@11.0.24/ForkJoinPool.java:1619) at java.util.concurrent.ForkJoinWorkerThread.run(java.base@11.0.24/ForkJoinWorkerThread.java:183) Locked ownable synchronizers: - None "ForkJoinPool-2-worker-9" #444184 daemon prio=5 os_prio=0 cpu=2.16ms elapsed=179817.30s allocated=1076K defined_classes=0 tid=0x00007f4d60dc6000 nid=0xd4a06 waiting on condition [0x00007f4ce315f000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.24/Native Method) - parking to wait for <0x00000005cc189d48> (a java.util.concurrent.ForkJoinPool) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.24/LockSupport.java:194) at java.util.concurrent.ForkJoinPool.runWorker(java.base@11.0.24/ForkJoinPool.java:1628) at java.util.concurrent.ForkJoinWorkerThread.run(java.base@11.0.24/ForkJoinWorkerThread.java:183) Locked ownable synchronizers: - None "ForkJoinPool-2-worker-11" #444199 daemon prio=5 os_prio=0 cpu=1.27ms elapsed=179757.30s allocated=555K defined_classes=0 tid=0x00007f4d802a1800 nid=0xd4cd6 waiting on condition [0x00007f4cdc32e000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.24/Native Method) - parking to wait for <0x00000005cc189d48> (a java.util.concurrent.ForkJoinPool) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.24/LockSupport.java:194) at java.util.concurrent.ForkJoinPool.runWorker(java.base@11.0.24/ForkJoinPool.java:1628) at java.util.concurrent.ForkJoinWorkerThread.run(java.base@11.0.24/ForkJoinWorkerThread.java:183) Locked ownable synchronizers: - None ``` This commit will try to dynamic adjust the thread num of ForkJoinPool by backend size. The minimum num of thread num is 8, maximum num of thread num is 64, and the thread num will round up to multiply of 8. ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [x] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [x] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [x] No. - [ ] Yes. <!-- Add document PR link here. eg: https://github.com/apache/doris-website/pull/1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org