[ 
https://issues.apache.org/jira/browse/SPARK-55620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063098#comment-18063098
 ] 

Tae Hwan Eom commented on SPARK-55620:
--------------------------------------

Hi,
I'm trying to reproduce this issue on my local macOS machine, but I haven't 
been able to.

It might be because my development environment has better performance.

Still, I’ve thought about how we could adjust the code to reduce the likelihood 
of this issue occurring.
If I modify the code and open a PR, would you be able to test it with the 
updated version?

!image-2026-03-05-13-06-16-885.png|width=372,height=244!

> test_connect_session flaky timeout due to shutdown deadlock
> -----------------------------------------------------------
>
>                 Key: SPARK-55620
>                 URL: https://issues.apache.org/jira/browse/SPARK-55620
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect, PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Minor
>         Attachments: image-2026-03-05-13-06-16-885.png
>
>
> h2. Description
> {{test_connect_session}} occasionally times out (450 seconds) in CI. The test 
> normally completes in 20 seconds but sometimes hangs indefinitely during 
> shutdown, causing flaky test failures.
> h2. Reproduce
> This is a flaky bug with ~33% failure rate:
> 1. Run {{python/run-tests.py --testnames 
> pyspark.sql.tests.connect.test_connect_session}}
> 2. Test may hang at 450 seconds timeout
> *Evidence from CI runs:*
> - [Run 
> 22196465437|https://github.com/Yicong-Huang/spark/actions/runs/22196465437]: 
> Cancelled after 4m10s
> - [Run 
> 22196593939|https://github.com/Yicong-Huang/spark/actions/runs/22196593939]: 
> Timeout after 1h22m (hung at 450s)
> - [Run 
> 22237720726|https://github.com/Yicong-Huang/spark/actions/runs/22237720726]: 
> Success in 20s ✓
> h2. Root Cause
> Deadlock during Python shutdown when {{ReleaseExecute}} cleanup tasks are 
> still executing:
> {code}
> Session.__del__()
>   → client.close() waits: concurrent.futures.wait(self._release_futures)
>     → Worker thread executes: ReleaseExecute() gRPC call
>       → gRPC attempts: threading.Thread().start()
>         → Python 3.12 blocks thread creation during shutdown
>           → DEADLOCK (main waits for worker, worker waits for thread)
> {code}
> Thread stacks show:
> - Main thread: blocked in {{concurrent.futures.wait()}}
> - Worker thread: blocked in {{threading.start() -> self._started.wait()}}
> The {{ReleaseExecute}} tasks are asynchronous cleanup submitted during test 
> execution. If they haven't completed when Python shuts down, gRPC's attempt 
> to spawn I/O threads gets blocked.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to