[ https://issues.apache.org/jira/browse/SPARK-52571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kent Yao updated SPARK-52571: ----------------------------- Fix Version/s: (was: 4.0.1) > ExecuteGrpcResponseSender reports "Deadline Exceeded" when size limits are > hit, causing debugging confusion > ----------------------------------------------------------------------------------------------------------- > > Key: SPARK-52571 > URL: https://issues.apache.org/jira/browse/SPARK-52571 > Project: Spark > Issue Type: Bug > Components: Connect > Affects Versions: 4.0.0 > Environment: h2. Environment > *Spark Version:* 4.0.0 > *Deployment:* Docker container using {{apache/spark:4.0.0}} > *Client:* Scala application using Spark Connect > *Data Source:* Delta tables on S3 (s3a://) > *Dataset Size:* 20+ million rows > *Docker Configuration:* > {{services: > spark: > image: apache/spark:4.0.0 > mem_limit: 12g > environment: > SPARK_MODE: master > ports: > - "15002:15002" # Spark-Connect gRPC}} > *Spark Connect Server Configuration:* > {{/opt/spark/sbin/start-connect-server.sh \ > --conf spark.driver.memory=10g \ > --conf spark.driver.maxResultSize=8g \ > --conf spark.connect.execute.reattachable.senderMaxStreamDuration=1200s \ > --conf spark.connect.execute.reattachable.senderMaxStreamSize=2g \ > --conf spark.connect.grpc.maxInboundMessageSize=268435456 \ > --conf spark.connect.grpc.deadline=1200s \ > --conf spark.network.timeout=1200s}} > h2. > Reporter: Callum Dempsey Leach > Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > h4. Issue > *ExecuteGrpcResponseSender reports "Deadline Exceeded" when size limits are > hit, causing debugging confusion* > I encountered significant debugging issues when streaming large result sets > (20M+ rows) from Delta tables using Spark Connect. The > {{ExecuteGrpcResponseSender}} was reporting "Deadline reached" errors, which > led me to focus entirely on time-based configurations for hours, when the > actual issue was size limits being exceeded. > The root cause is misleading error messages in > [{{ExecuteGrpcResponseSender.scala}}|https://github.com/apache/spark/blob/master/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala]. > The timeout logic checks both time AND size limits but only reports > "Deadline reached": > {code:java} > // Line ~240: The condition checks BOTH time AND size > def deadlineLimitReached = > sentResponsesSize > maximumResponseSize || deadlineTimeNs < > System.nanoTime() > // Line ~320: But error message only mentions deadline > logInfo(log"Deadline reached, shutting down stream for opId=...") > {code} > *My Experience:* I was seeing "Deadline reached" errors and spent > considerable time adjusting time-based settings like > {{senderMaxStreamDuration}}, when the real issue was > {{CONNECT_EXECUTE_REATTACHABLE_SENDER_MAX_STREAM_SIZE}} (default 1GB) being > exceeded by my large streaming responses. > *Documentation Gap:* The situation was made worse because these critical > streaming configurations ({{senderMaxStreamDuration}}, > {{senderMaxStreamSize}}) are completely missing from the official Spark > Configuration documentation, making it nearly impossible to discover the > correct parameters to tune. > *My Working Solution:* > After identifying both the size and time limit issues, I configured: > {code:java} > --conf spark.connect.execute.reattachable.senderMaxStreamDuration=1200s > --conf spark.connect.execute.reattachable.senderMaxStreamSize=2g > --conf spark.connect.grpc.maxInboundMessageSize=268435456 > --conf spark.connect.grpc.deadline=1200s > --conf spark.network.timeout=1200s > {code} > This achieved 500K rows/s throughput from my Delta tables in S3. > h4. Proposed Improvements > *1. Fix Misleading Error Messages (High Priority):* > - Distinguish between time-based and size-based stream termination in log > messages > - Include specific configuration suggestions in error messages > - Example: "Size limit reached (2.1GB > 1GB). Consider increasing > spark.connect.execute.reattachable.senderMaxStreamSize" > - Example: "Time deadline reached (300s). Consider increasing > spark.connect.execute.reattachable.senderMaxStreamDuration" > *2. Documentation Enhancement (Critical):* > - Add {{senderMaxStreamDuration}} and {{senderMaxStreamSize}} to the official > Configuration documentation > - These parameters are essential for large streaming workloads but completely > undocumented > - Provide guidance on sizing timeouts and limits based on expected data > volumes > - Include troubleshooting section for large streaming queries > *3. Default Value Review:* > - Consider increasing default {{senderMaxStreamDuration}} from 2m to 10m > - Evaluate increasing default {{senderMaxStreamSize}} from 1GB for > large-scale streaming scenarios > h4. Environment > - Spark Version: 4.0.0 > - Deployment: Docker container using apache/spark:4.0.0 > - Data Source: Delta tables on S3 (s3a://) > - Dataset Size: 20+ million rows -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org