[
https://issues.apache.org/jira/browse/SPARK-55620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yicong Huang updated SPARK-55620:
---------------------------------
Description:
h2. Description
{{test_connect_session}} occasionally times out (450 seconds) in CI. The test
normally completes in 20 seconds but sometimes hangs indefinitely during
shutdown, causing flaky test failures.
h2. Reproduce
This is a flaky bug with ~33% failure rate:
1. Run {{python/run-tests.py --testnames
pyspark.sql.tests.connect.test_connect_session}}
2. Test may hang at 450 seconds timeout
*Evidence from CI runs:*
- [Run
22196465437|https://github.com/Yicong-Huang/spark/actions/runs/22196465437]:
Cancelled after 4m10s
- [Run
22196593939|https://github.com/Yicong-Huang/spark/actions/runs/22196593939]:
Timeout after 1h22m (hung at 450s)
- [Run
22237720726|https://github.com/Yicong-Huang/spark/actions/runs/22237720726]:
Success in 20s ✓
h2. Root Cause
Deadlock during Python shutdown when {{ReleaseExecute}} cleanup tasks are still
executing:
{code}
Session.__del__()
→ client.close() waits: concurrent.futures.wait(self._release_futures)
→ Worker thread executes: ReleaseExecute() gRPC call
→ gRPC attempts: threading.Thread().start()
→ Python 3.12 blocks thread creation during shutdown
→ DEADLOCK (main waits for worker, worker waits for thread)
{code}
Thread stacks show:
- Main thread: blocked in {{concurrent.futures.wait()}}
- Worker thread: blocked in {{threading.start() -> self._started.wait()}}
The {{ReleaseExecute}} tasks are asynchronous cleanup submitted during test
execution. If they haven't completed when Python shuts down, gRPC's attempt to
spawn I/O threads gets blocked.
was:
h2. Description
{{test_connect_session}} occasionally times out (450 seconds) in CI. The test
normally completes in 20 seconds but sometimes hangs indefinitely during
shutdown, causing flaky test failures.
h2. Reproduce
This is a flaky bug with ~33% failure rate:
1. Run {{python/run-tests.py --testnames
pyspark.sql.tests.connect.test_connect_session}}
2. Test may hang at 450 seconds timeout
*Evidence from CI runs:*
- Run 22196465437: Cancelled after 4m10s
- Run 22196593939: Timeout after 1h22m (hung at 450s)
- Run 22237720726: Success in 20s ✓
h2. Root Cause
Deadlock during Python shutdown when {{ReleaseExecute}} cleanup tasks are still
executing:
{code}
Session.__del__()
→ client.close() waits: concurrent.futures.wait(self._release_futures)
→ Worker thread executes: ReleaseExecute() gRPC call
→ gRPC attempts: threading.Thread().start()
→ Python 3.12 blocks thread creation during shutdown
→ DEADLOCK (main waits for worker, worker waits for thread)
{code}
Thread stacks show:
- Main thread: blocked in {{concurrent.futures.wait()}}
- Worker thread: blocked in {{threading.start() -> self._started.wait()}}
The {{ReleaseExecute}} tasks are asynchronous cleanup submitted during test
execution. If they haven't completed when Python shuts down, gRPC's attempt to
spawn I/O threads gets blocked.
> test_connect_session flaky timeout due to shutdown deadlock
> -----------------------------------------------------------
>
> Key: SPARK-55620
> URL: https://issues.apache.org/jira/browse/SPARK-55620
> Project: Spark
> Issue Type: Bug
> Components: Connect, PySpark
> Affects Versions: 4.2.0
> Reporter: Yicong Huang
> Priority: Minor
>
> h2. Description
> {{test_connect_session}} occasionally times out (450 seconds) in CI. The test
> normally completes in 20 seconds but sometimes hangs indefinitely during
> shutdown, causing flaky test failures.
> h2. Reproduce
> This is a flaky bug with ~33% failure rate:
> 1. Run {{python/run-tests.py --testnames
> pyspark.sql.tests.connect.test_connect_session}}
> 2. Test may hang at 450 seconds timeout
> *Evidence from CI runs:*
> - [Run
> 22196465437|https://github.com/Yicong-Huang/spark/actions/runs/22196465437]:
> Cancelled after 4m10s
> - [Run
> 22196593939|https://github.com/Yicong-Huang/spark/actions/runs/22196593939]:
> Timeout after 1h22m (hung at 450s)
> - [Run
> 22237720726|https://github.com/Yicong-Huang/spark/actions/runs/22237720726]:
> Success in 20s ✓
> h2. Root Cause
> Deadlock during Python shutdown when {{ReleaseExecute}} cleanup tasks are
> still executing:
> {code}
> Session.__del__()
> → client.close() waits: concurrent.futures.wait(self._release_futures)
> → Worker thread executes: ReleaseExecute() gRPC call
> → gRPC attempts: threading.Thread().start()
> → Python 3.12 blocks thread creation during shutdown
> → DEADLOCK (main waits for worker, worker waits for thread)
> {code}
> Thread stacks show:
> - Main thread: blocked in {{concurrent.futures.wait()}}
> - Worker thread: blocked in {{threading.start() -> self._started.wait()}}
> The {{ReleaseExecute}} tasks are asynchronous cleanup submitted during test
> execution. If they haven't completed when Python shuts down, gRPC's attempt
> to spawn I/O threads gets blocked.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]