[ 
https://issues.apache.org/jira/browse/SPARK-51365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932822#comment-17932822
 ] 

Yang Jie commented on SPARK-51365:
----------------------------------

Sharing some findings:

 

When I run the 
{code:java}
build/mvn test -Dtest.include.tags=org.apache.spark.tags.ExtendedSQLTest -pl 
sql/core {code}
locally using Maven, and observing the resource usage of the SQLQueryTestSuite.

 

I noticed that there are a large number of threads like the following in the 
test process:

 
 * 829 `ResultQueryStageExecution` threads (the number of these threads 
fluctuates, and 829 maybe not the maximum count).

 
{code:java}
"ResultQueryStageExecution-1107" prio=0 tid=0x0 nid=0x0 waiting on condition
     java.lang.Thread.State: TIMED_WAITING
 on 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@58eaa9b3
    at java.base@17.0.14/jdk.internal.misc.Unsafe.park(Native Method)
    at 
java.base@17.0.14/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252)
    at 
java.base@17.0.14/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679)
    at 
java.base@17.0.14/java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:460)
    at 
java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1061)
    at 
java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1122)
    at 
java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base@17.0.14/java.lang.Thread.run(Thread.java:840) {code}
 
 * 1024 `shuffle-exchange` threads (the number of these threads fluctuates, and 
1024 maybe not the maximum count).

 

 
{code:java}
"shuffle-exchange-1000" prio=0 tid=0x0 nid=0x0 waiting on condition
     java.lang.Thread.State: TIMED_WAITING
 on 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5e6488d2
    at java.base@17.0.14/jdk.internal.misc.Unsafe.park(Native Method)
    at 
java.base@17.0.14/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252)
    at 
java.base@17.0.14/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679)
    at 
java.base@17.0.14/java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:460)
    at 
java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1061)
    at 
java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1122)
    at 
java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base@17.0.14/java.lang.Thread.run(Thread.java:840) {code}
 

 
 * 100 threads each for types block-manager-ask-thread-pool,  
block-manager-storage-async-thread-pool , and broadcast-exchange,the number of 
these threads appears to be fixed, as the count remains unchanged after 
multiple jstack operations.
 
{code:java}
"block-manager-ask-thread-pool-0" prio=0 tid=0x0 nid=0x0 waiting on condition
     java.lang.Thread.State: TIMED_WAITING
 on 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3ad48349
    at java.base@17.0.14/jdk.internal.misc.Unsafe.park(Native Method)
    at 
java.base@17.0.14/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252)
    at 
java.base@17.0.14/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679)
    at 
java.base@17.0.14/java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:460)
    at 
java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1061)
    at 
java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1122)
    at 
java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base@17.0.14/java.lang.Thread.run(Thread.java:840) {code}


 

 
{code:java}
"block-manager-storage-async-thread-pool-0" prio=0 tid=0x0 nid=0x0 waiting on 
condition     java.lang.Thread.State: TIMED_WAITING on 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@783b4363  
   at java.base@17.0.14/jdk.internal.misc.Unsafe.park(Native Method)       at 
java.base@17.0.14/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252)
     at 
java.base@17.0.14/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679)
     at 
java.base@17.0.14/java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:460)
        at 
java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1061)
      at 
java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1122)
    at 
java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base@17.0.14/java.lang.Thread.run(Thread.java:840) {code}
 

 

 
{code:java}
"broadcast-exchange-451" prio=0 tid=0x0 nid=0x0 waiting on condition
     java.lang.Thread.State: TIMED_WAITING
 on 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5b039a8d
    at java.base@17.0.14/jdk.internal.misc.Unsafe.park(Native Method)
    at 
java.base@17.0.14/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252)
    at 
java.base@17.0.14/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679)
    at 
java.base@17.0.14/java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:460)
    at 
java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1061)
    at 
java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1122)
    at 
java.base@17.0.14/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base@17.0.14/java.lang.Thread.run(Thread.java:840)
 {code}
 

The above is the phenomenon I observed, and it seems that the number of threads 
is indeed abnormal (perhaps related to the design of SharedSparkSession?). 
Also, since we have configured -Xss4m (and in the Hive module, we even need to 
configure -Xss64m ...), in extreme cases, this may indeed lead to the inability 
to create new threads.

 

> OOM occurred during macOS daily tests
> -------------------------------------
>
>                 Key: SPARK-51365
>                 URL: https://issues.apache.org/jira/browse/SPARK-51365
>             Project: Spark
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 4.1.0
>            Reporter: Yang Jie
>            Priority: Major
>
> * [https://github.com/apache/spark/actions/runs/13316147273/job/37299839380]
> {code:java}
> Warning: [343.044s][warning][os,thread] Failed to start thread "Unknown 
> thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 4096k, 
> guardsize: 16k, detached.
> 11372Warning: [343.044s][warning][os,thread] Failed to start the native 
> thread for java.lang.Thread "shuffle-exchange-1529"
> 11373*** RUN ABORTED ***
> 11374An exception or error caused a run to abort: unable to create native 
> thread: possibly out of memory or process/resource limits reached 
> 11375  java.lang.OutOfMemoryError: unable to create native thread: possibly 
> out of memory or process/resource limits reached
> 11376  at java.base/java.lang.Thread.start0(Native Method)
> 11377  at java.base/java.lang.Thread.start(Thread.java:1553)
> 11378  at java.base/java.lang.System$2.start(System.java:2577)
> 11379  at 
> java.base/jdk.internal.vm.SharedThreadContainer.start(SharedThreadContainer.java:152)
> 11380  at 
> java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:953)
> 11381  at 
> java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1364)
> 11382  at 
> scala.concurrent.impl.ExecutionContextImpl.execute(ExecutionContextImpl.scala:21)
> 11383  at 
> java.base/java.util.concurrent.CompletableFuture.asyncSupplyStage(CompletableFuture.java:1782)
> 11384  at 
> java.base/java.util.concurrent.CompletableFuture.supplyAsync(CompletableFuture.java:2005)
> 11385  at 
> org.apache.spark.sql.execution.SQLExecution$.withThreadLocalCaptured(SQLExecution.scala:329)
> 11386  ...
> 11387Warning:  The requested profile "volcano" could not be activated because 
> it does not exist.
> 11388Warning:  The requested profile "hive" could not be activated because it 
> does not exist.
> 11389Error:  Failed to execute goal 
> org.scalatest:scalatest-maven-plugin:2.2.0:test (test) on project 
> spark-sql_2.13: There are test failures -> [Help 1]
> 11390Error:  
> 11391Error:  To see the full stack trace of the errors, re-run Maven with the 
> -e switch.
> 11392Error:  Re-run Maven using the -X switch to enable full debug logging.
> 11393Error:  
> 11394Error:  For more information about the errors and possible solutions, 
> please read the following articles:
> 11395Error:  [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> 11396Error: Process completed with exit code 1. {code}
>  
>  * [https://github.com/apache/spark/actions/runs/13316147273/job/37299839259]
>  
> {code:java}
> - group-by-ordinal.sql
> 10493- group-by-ordinal.sql_analyzer_test
> 10494Warning: [495.950s][warning][os,thread] Failed to start thread "Unknown 
> thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 4096k, 
> guardsize: 16k, detached.
> 10495Warning: [495.950s][warning][os,thread] Failed to start the native 
> thread for java.lang.Thread "shuffle-exchange-1799"
> 1049616:17:17.464 ERROR org.apache.spark.sql.SQLQueryTestSuite: Error using 
> configs: 
> spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=CODEGEN_ONLY
> 10497
> 10498*** RUN ABORTED ***
> 10499An exception or error caused a run to abort: unable to create native 
> thread: possibly out of memory or process/resource limits reached 
> 10500  java.lang.OutOfMemoryError: unable to create native thread: possibly 
> out of memory or process/resource limits reached
> 10501  at java.base/java.lang.Thread.start0(Native Method)
> 10502  at java.base/java.lang.Thread.start(Thread.java:1553)
> 10503  at java.base/java.lang.System$2.start(System.java:2577)
> 10504  at 
> java.base/jdk.internal.vm.SharedThreadContainer.start(SharedThreadContainer.java:152)
> 10505  at 
> java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:953)
> 10506  at 
> java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1364)
> 10507  at 
> scala.concurrent.impl.ExecutionContextImpl.execute(ExecutionContextImpl.scala:21)
> 10508  at 
> java.base/java.util.concurrent.CompletableFuture.asyncSupplyStage(CompletableFuture.java:1782)
> 10509  at 
> java.base/java.util.concurrent.CompletableFuture.supplyAsync(CompletableFuture.java:2005)
> 10510  at 
> org.apache.spark.sql.execution.SQLExecution$.withThreadLocalCaptured(SQLExecution.scala:329)
> 10511  ... {code}
>  
>  
> The root cause is unknown for now, and we need to investigate it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to