[ 
https://issues.apache.org/jira/browse/SPARK-52571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-52571:
-----------------------------
    Fix Version/s:     (was: 4.0.1)

> ExecuteGrpcResponseSender reports "Deadline Exceeded" when size limits are 
> hit, causing debugging confusion
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-52571
>                 URL: https://issues.apache.org/jira/browse/SPARK-52571
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect
>    Affects Versions: 4.0.0
>         Environment: h2. Environment
> *Spark Version:* 4.0.0
> *Deployment:* Docker container using {{apache/spark:4.0.0}}
> *Client:* Scala application using Spark Connect
> *Data Source:* Delta tables on S3 (s3a://)
> *Dataset Size:* 20+ million rows
> *Docker Configuration:*
> {{services:
>   spark:
>     image: apache/spark:4.0.0
>     mem_limit: 12g
>     environment:
>       SPARK_MODE: master
>     ports:
>       - "15002:15002"    # Spark-Connect gRPC}}
> *Spark Connect Server Configuration:*
> {{/opt/spark/sbin/start-connect-server.sh \
>   --conf spark.driver.memory=10g \
>   --conf spark.driver.maxResultSize=8g \
>   --conf spark.connect.execute.reattachable.senderMaxStreamDuration=1200s \
>   --conf spark.connect.execute.reattachable.senderMaxStreamSize=2g \
>   --conf spark.connect.grpc.maxInboundMessageSize=268435456 \
>   --conf spark.connect.grpc.deadline=1200s \
>   --conf spark.network.timeout=1200s}}
> h2.  
>            Reporter: Callum Dempsey Leach
>            Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> h4. Issue
> *ExecuteGrpcResponseSender reports "Deadline Exceeded" when size limits are 
> hit, causing debugging confusion*
> I encountered significant debugging issues when streaming large result sets 
> (20M+ rows) from Delta tables using Spark Connect. The 
> {{ExecuteGrpcResponseSender}} was reporting "Deadline reached" errors, which 
> led me to focus entirely on time-based configurations for hours, when the 
> actual issue was size limits being exceeded.
> The root cause is misleading error messages in 
> [{{ExecuteGrpcResponseSender.scala}}|https://github.com/apache/spark/blob/master/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala].
>  The timeout logic checks both time AND size limits but only reports 
> "Deadline reached":
> {code:java}
> // Line ~240: The condition checks BOTH time AND size
> def deadlineLimitReached = 
>   sentResponsesSize > maximumResponseSize || deadlineTimeNs < 
> System.nanoTime()
> // Line ~320: But error message only mentions deadline  
> logInfo(log"Deadline reached, shutting down stream for opId=...")
> {code}
> *My Experience:* I was seeing "Deadline reached" errors and spent 
> considerable time adjusting time-based settings like 
> {{senderMaxStreamDuration}}, when the real issue was 
> {{CONNECT_EXECUTE_REATTACHABLE_SENDER_MAX_STREAM_SIZE}} (default 1GB) being 
> exceeded by my large streaming responses.
> *Documentation Gap:* The situation was made worse because these critical 
> streaming configurations ({{senderMaxStreamDuration}}, 
> {{senderMaxStreamSize}}) are completely missing from the official Spark 
> Configuration documentation, making it nearly impossible to discover the 
> correct parameters to tune.
> *My Working Solution:*
> After identifying both the size and time limit issues, I configured:
> {code:java}
> --conf spark.connect.execute.reattachable.senderMaxStreamDuration=1200s
> --conf spark.connect.execute.reattachable.senderMaxStreamSize=2g
> --conf spark.connect.grpc.maxInboundMessageSize=268435456
> --conf spark.connect.grpc.deadline=1200s
> --conf spark.network.timeout=1200s
> {code}
> This achieved 500K rows/s throughput from my Delta tables in S3.
> h4. Proposed Improvements
> *1. Fix Misleading Error Messages (High Priority):*
> - Distinguish between time-based and size-based stream termination in log 
> messages
> - Include specific configuration suggestions in error messages
> - Example: "Size limit reached (2.1GB > 1GB). Consider increasing 
> spark.connect.execute.reattachable.senderMaxStreamSize"
> - Example: "Time deadline reached (300s). Consider increasing 
> spark.connect.execute.reattachable.senderMaxStreamDuration"
> *2. Documentation Enhancement (Critical):*
> - Add {{senderMaxStreamDuration}} and {{senderMaxStreamSize}} to the official 
> Configuration documentation
> - These parameters are essential for large streaming workloads but completely 
> undocumented
> - Provide guidance on sizing timeouts and limits based on expected data 
> volumes
> - Include troubleshooting section for large streaming queries
> *3. Default Value Review:*
> - Consider increasing default {{senderMaxStreamDuration}} from 2m to 10m
> - Evaluate increasing default {{senderMaxStreamSize}} from 1GB for 
> large-scale streaming scenarios
> h4. Environment
> - Spark Version: 4.0.0
> - Deployment: Docker container using apache/spark:4.0.0
> - Data Source: Delta tables on S3 (s3a://)
> - Dataset Size: 20+ million rows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to