[ https://issues.apache.org/jira/browse/SPARK-51365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932822#comment-17932822 ]
Yang Jie commented on SPARK-51365: ---------------------------------- Sharing some findings: When I run the {code:java} build/mvn test -Dtest.include.tags=org.apache.spark.tags.ExtendedSQLTest -pl sql/core {code} locally using Maven, and observing the resource usage of the SQLQueryTestSuite. I noticed that there are a large number of threads like the following in the test process: * 829 `ResultQueryStageExecution` threads (the number of these threads fluctuates, and 829 maybe not the maximum count). {code:java} "ResultQueryStageExecution-1107" prio=0 tid=0x0 nid=0x0 waiting on condition java.lang.Thread.State: TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@58eaa9b3 at java.base@17.0.14/jdk.internal.misc.Unsafe.park(Native Method) at java.base@17.0.14/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252) at java.base@17.0.14/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) at java.base@17.0.14/java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:460) at java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1061) at java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1122) at java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base@17.0.14/java.lang.Thread.run(Thread.java:840) {code} * 1024 `shuffle-exchange` threads (the number of these threads fluctuates, and 1024 maybe not the maximum count). {code:java} "shuffle-exchange-1000" prio=0 tid=0x0 nid=0x0 waiting on condition java.lang.Thread.State: TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5e6488d2 at java.base@17.0.14/jdk.internal.misc.Unsafe.park(Native Method) at java.base@17.0.14/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252) at java.base@17.0.14/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) at java.base@17.0.14/java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:460) at java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1061) at java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1122) at java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base@17.0.14/java.lang.Thread.run(Thread.java:840) {code} * 100 threads each for types block-manager-ask-thread-pool, block-manager-storage-async-thread-pool , and broadcast-exchange,the number of these threads appears to be fixed, as the count remains unchanged after multiple jstack operations. {code:java} "block-manager-ask-thread-pool-0" prio=0 tid=0x0 nid=0x0 waiting on condition java.lang.Thread.State: TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3ad48349 at java.base@17.0.14/jdk.internal.misc.Unsafe.park(Native Method) at java.base@17.0.14/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252) at java.base@17.0.14/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) at java.base@17.0.14/java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:460) at java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1061) at java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1122) at java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base@17.0.14/java.lang.Thread.run(Thread.java:840) {code} {code:java} "block-manager-storage-async-thread-pool-0" prio=0 tid=0x0 nid=0x0 waiting on condition java.lang.Thread.State: TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@783b4363 at java.base@17.0.14/jdk.internal.misc.Unsafe.park(Native Method) at java.base@17.0.14/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252) at java.base@17.0.14/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) at java.base@17.0.14/java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:460) at java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1061) at java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1122) at java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base@17.0.14/java.lang.Thread.run(Thread.java:840) {code} {code:java} "broadcast-exchange-451" prio=0 tid=0x0 nid=0x0 waiting on condition java.lang.Thread.State: TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5b039a8d at java.base@17.0.14/jdk.internal.misc.Unsafe.park(Native Method) at java.base@17.0.14/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252) at java.base@17.0.14/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) at java.base@17.0.14/java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:460) at java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1061) at java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1122) at java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base@17.0.14/java.lang.Thread.run(Thread.java:840) {code} The above is the phenomenon I observed, and it seems that the number of threads is indeed abnormal (perhaps related to the design of SharedSparkSession?). Also, since we have configured -Xss4m (and in the Hive module, we even need to configure -Xss64m ...), in extreme cases, this may indeed lead to the inability to create new threads. > OOM occurred during macOS daily tests > ------------------------------------- > > Key: SPARK-51365 > URL: https://issues.apache.org/jira/browse/SPARK-51365 > Project: Spark > Issue Type: Bug > Components: Tests > Affects Versions: 4.1.0 > Reporter: Yang Jie > Priority: Major > > * [https://github.com/apache/spark/actions/runs/13316147273/job/37299839380] > {code:java} > Warning: [343.044s][warning][os,thread] Failed to start thread "Unknown > thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 4096k, > guardsize: 16k, detached. > 11372Warning: [343.044s][warning][os,thread] Failed to start the native > thread for java.lang.Thread "shuffle-exchange-1529" > 11373*** RUN ABORTED *** > 11374An exception or error caused a run to abort: unable to create native > thread: possibly out of memory or process/resource limits reached > 11375 java.lang.OutOfMemoryError: unable to create native thread: possibly > out of memory or process/resource limits reached > 11376 at java.base/java.lang.Thread.start0(Native Method) > 11377 at java.base/java.lang.Thread.start(Thread.java:1553) > 11378 at java.base/java.lang.System$2.start(System.java:2577) > 11379 at > java.base/jdk.internal.vm.SharedThreadContainer.start(SharedThreadContainer.java:152) > 11380 at > java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:953) > 11381 at > java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1364) > 11382 at > scala.concurrent.impl.ExecutionContextImpl.execute(ExecutionContextImpl.scala:21) > 11383 at > java.base/java.util.concurrent.CompletableFuture.asyncSupplyStage(CompletableFuture.java:1782) > 11384 at > java.base/java.util.concurrent.CompletableFuture.supplyAsync(CompletableFuture.java:2005) > 11385 at > org.apache.spark.sql.execution.SQLExecution$.withThreadLocalCaptured(SQLExecution.scala:329) > 11386 ... > 11387Warning: The requested profile "volcano" could not be activated because > it does not exist. > 11388Warning: The requested profile "hive" could not be activated because it > does not exist. > 11389Error: Failed to execute goal > org.scalatest:scalatest-maven-plugin:2.2.0:test (test) on project > spark-sql_2.13: There are test failures -> [Help 1] > 11390Error: > 11391Error: To see the full stack trace of the errors, re-run Maven with the > -e switch. > 11392Error: Re-run Maven using the -X switch to enable full debug logging. > 11393Error: > 11394Error: For more information about the errors and possible solutions, > please read the following articles: > 11395Error: [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException > 11396Error: Process completed with exit code 1. {code} > > * [https://github.com/apache/spark/actions/runs/13316147273/job/37299839259] > > {code:java} > - group-by-ordinal.sql > 10493- group-by-ordinal.sql_analyzer_test > 10494Warning: [495.950s][warning][os,thread] Failed to start thread "Unknown > thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 4096k, > guardsize: 16k, detached. > 10495Warning: [495.950s][warning][os,thread] Failed to start the native > thread for java.lang.Thread "shuffle-exchange-1799" > 1049616:17:17.464 ERROR org.apache.spark.sql.SQLQueryTestSuite: Error using > configs: > spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=CODEGEN_ONLY > 10497 > 10498*** RUN ABORTED *** > 10499An exception or error caused a run to abort: unable to create native > thread: possibly out of memory or process/resource limits reached > 10500 java.lang.OutOfMemoryError: unable to create native thread: possibly > out of memory or process/resource limits reached > 10501 at java.base/java.lang.Thread.start0(Native Method) > 10502 at java.base/java.lang.Thread.start(Thread.java:1553) > 10503 at java.base/java.lang.System$2.start(System.java:2577) > 10504 at > java.base/jdk.internal.vm.SharedThreadContainer.start(SharedThreadContainer.java:152) > 10505 at > java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:953) > 10506 at > java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1364) > 10507 at > scala.concurrent.impl.ExecutionContextImpl.execute(ExecutionContextImpl.scala:21) > 10508 at > java.base/java.util.concurrent.CompletableFuture.asyncSupplyStage(CompletableFuture.java:1782) > 10509 at > java.base/java.util.concurrent.CompletableFuture.supplyAsync(CompletableFuture.java:2005) > 10510 at > org.apache.spark.sql.execution.SQLExecution$.withThreadLocalCaptured(SQLExecution.scala:329) > 10511 ... {code} > > > The root cause is unknown for now, and we need to investigate it. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org