Callum Dempsey Leach created SPARK-52571: --------------------------------------------
Summary: ExecuteGrpcResponseSender Deadline Exceeded Occurs when Size Limits are hit Key: SPARK-52571 URL: https://issues.apache.org/jira/browse/SPARK-52571 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 4.0.0 Environment: h2. Environment *Spark Version:* 4.0.0 *Deployment:* Docker container using {{apache/spark:4.0.0}} *Client:* Scala application using Spark Connect *Data Source:* Delta tables on S3 (s3a://) *Dataset Size:* 20+ million rows *Docker Configuration:* {{services: spark: image: apache/spark:4.0.0 mem_limit: 12g environment: SPARK_MODE: master ports: - "15002:15002" # Spark-Connect gRPC}} *Spark Connect Server Configuration:* {{/opt/spark/sbin/start-connect-server.sh \ --conf spark.driver.memory=10g \ --conf spark.driver.maxResultSize=8g \ --conf spark.connect.execute.reattachable.senderMaxStreamDuration=1200s \ --conf spark.connect.execute.reattachable.senderMaxStreamSize=2g \ --conf spark.connect.grpc.maxInboundMessageSize=268435456 \ --conf spark.connect.grpc.deadline=1200s \ --conf spark.network.timeout=1200s}} h2. Reporter: Callum Dempsey Leach Fix For: 4.0.1 h4. Issue When streaming large result sets (20M+ rows) from Delta tables using Spark Connect, the {{ExecuteGrpcResponseSender}} encounters {{DEADLINE_EXCEEDED}} errors due to default timeout configurations and bad error handling. 1. The default {{senderMaxStreamDuration}} of 2 minutes is inadequate for long-running queries that need to stream substantial amounts of data back to the client. Documentation for this is missing from the Configuration docs, and should be added. 2. Even after configuring this I was still encountering problems based on source code [{{ExecuteGrpcResponseSender.scala:}}|https://github.com/apache/spark/blob/master/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala] , timeout logic checks check both time and size limits but only reports "Deadline reached": {{}} {code:java} // Line ~240: The condition checks BOTH time AND size def deadlineLimitReached = sentResponsesSize > maximumResponseSize || deadlineTimeNs < System.nanoTime() So users see "Deadline reached" even when the issue is actually: {code} As a user, I was focused on time-based configurations when the real issue was also in size-based limits in {{{}CONNECT_EXECUTE_REATTACHABLE_SENDER_MAX_STREAM_SIZE{}}}. I resolved it by configuring some gnarly values, giving me a cool 500K rows / s throughput from my Delta table in S3: {{--conf spark.connect.execute.reattachable.senderMaxStreamDuration=1200s --conf spark.connect.execute.reattachable.senderMaxStreamSize=2g --conf spark.connect.grpc.maxInboundMessageSize=268435456 --conf spark.connect.grpc.deadline=1200s --conf spark.network.timeout=1200s}} h4. Proposed Improvements # {*}Documentation Enhancement{*}: ** Add clear documentation about timeout configurations for long-running streaming queries ** Provide guidance on sizing timeouts based on expected data volumes # {*}Default Value Review{*}: ** Consider increasing default {{senderMaxStreamDuration}} from 2m to a more practical value (e.g., 10m) ** Evaluate default {{senderMaxStreamSize}} of 1GB for large-scale streaming scenarios # {*}Better Error Messages{*}: ** Fix misleading "Deadline reached" messages that can be triggered by size limits. ** Distinguish between time-based and size-based stream termination in log messages ** Improve error messages to clearly indicate when timeouts are configuration-related vs size-related ** Suggest specific configuration parameters in timeout/size error messages -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org