Yicong Huang created SPARK-55620:
------------------------------------

             Summary: test_connect_session flaky timeout due to shutdown 
deadlock
                 Key: SPARK-55620
                 URL: https://issues.apache.org/jira/browse/SPARK-55620
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Connect
    Affects Versions: 4.2.0
            Reporter: Yicong Huang


h2. Description

{{test_connect_session}} occasionally times out (450 seconds) in CI. The test 
normally completes in 20 seconds but sometimes hangs indefinitely during 
shutdown, causing flaky test failures.

h2. Reproduce

This is a flaky bug with ~33% failure rate:

1. Run {{python/run-tests.py --testnames 
pyspark.sql.tests.connect.test_connect_session}}
2. Test may hang at 450 seconds timeout

*Evidence from CI runs:*
- Run 22196465437: Cancelled after 4m10s
- Run 22196593939: Timeout after 1h22m (hung at 450s)
- Run 22237720726: Success in 20s ✓

h2. Root Cause

Deadlock during Python shutdown when {{ReleaseExecute}} cleanup tasks are still 
executing:

{code}
Session.__del__()
  → client.close() waits: concurrent.futures.wait(self._release_futures)
    → Worker thread executes: ReleaseExecute() gRPC call
      → gRPC attempts: threading.Thread().start()
        → Python 3.12 blocks thread creation during shutdown
          → DEADLOCK (main waits for worker, worker waits for thread)
{code}

Thread stacks show:
- Main thread: blocked in {{concurrent.futures.wait()}}
- Worker thread: blocked in {{threading.start() -> self._started.wait()}}

The {{ReleaseExecute}} tasks are asynchronous cleanup submitted during test 
execution. If they haven't completed when Python shuts down, gRPC's attempt to 
spawn I/O threads gets blocked.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to