Daisuke Taniwaki created SPARK-57425:
----------------------------------------

             Summary: Reattach iterator cannot recover when short-TTL 
credentials expire mid-stream
                 Key: SPARK-57425
                 URL: https://issues.apache.org/jira/browse/SPARK-57425
             Project: Spark
          Issue Type: Bug
          Components: Connect
    Affects Versions: 4.0.0, 4.1.0, 4.2.0, 5.0.0
            Reporter: Daisuke Taniwaki


`ExecutePlanResponseReattachableIterator`
(`python/pyspark/sql/connect/client/reattach.py`) has a reattach mechanism
designed to recover when the underlying gRPC stream is broken before
`ResultComplete`. That recovery is structurally impossible when the
server enforces a short auth-token TTL (e.g. AWS Athena Spark, 30 min):

1. `ExecutePlan` is started with a fresh credential.
2. The query runs past the TTL; the server kills the stream with
   `PERMISSION_DENIED`.
3. The default retry policy does not treat `PERMISSION_DENIED` as
   retryable, so the iterator never even attempts to reattach.
4. Even if reattach were attempted, `self._metadata` still holds the
   expired token captured at `__init__`, so it would immediately fail
   with the same 403.

The iterator's own contract ("recover from broken stream") is violated
for any deployment that combines short token TTLs with long-running
streams. Both gaps must be fixed for the reattach machinery to do what
it was designed to do.

This has not surfaced in typical deployments because four conditions
must align (short server TTL, a stream that outlives it, a server that
actively kills the stream on expiry, and reattach firing). Local dev
without auth, on-prem with long-lived tokens, and short ad-hoc queries
each violate at least one. Managed federated-credential environments
hit all four; Athena Spark Connect with its 30-minute auth token is the
canonical trigger.

The dbt-athena Spark adapter ships runtime monkey-patches today as a
verified workaround. They have been in production use long enough to
confirm the behaviour is safe. The fix here folds the moving parts into
upstream so the workaround becomes unnecessary.

Backport requested to branch-4.0, branch-4.1, branch-4.2 — 4.x is what
managed environments actually run.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to