[ 
https://issues.apache.org/jira/browse/SPARK-46702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46702:
----------------------------------
    Priority: Critical  (was: Blocker)

> Spark Cluster Crashing
> ----------------------
>
>                 Key: SPARK-46702
>                 URL: https://issues.apache.org/jira/browse/SPARK-46702
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Spark Docker
>    Affects Versions: 3.4.0, 3.5.0
>            Reporter: Mohamad Haidar
>            Priority: Critical
>              Labels: databricks
>         Attachments: CV62A4~1.LOG, cveshv-events-streaming-TRACE (2).zip, 
> cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log,
>  image-2024-01-12-10-44-45-717.png, image-2024-01-12-10-45-18-905.png, 
> image-2024-01-12-10-45-30-398.png, image-2024-01-12-10-45-40-397.png, 
> image-2024-01-12-10-45-50-427.png, 
> logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log
>
>
> h3. Description:
>  * We have a spark cluster installed over a k8s cluster with one driver and 
> multiple executors (120).
>  * We configure our batch duration to 30 seconds.
>  * The Spark Cluster is reading from a 120 partition topic at Kafka and 
> writing to an hourly index at ElasticSearch.
>  * ES has 30 DataNodes, 1 shard per DataNode for each index.
>  * Configuration of Driver STS is in Appendix.
>  * The driver is observed periodically restarting every 10 mins, although the 
> restart do not necessarily occur each 10mins, but when it happens it happens 
> each 10 mins.
>  * The restarts frequency increase with the increase of the throughput.
>  * When the restarts are happening, we see OptionalDataException, attached 
> “logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log”
>  is the log resulting in a restart of the driver.
> h3. Analysis:
>  # We’ve done a test with 250 K Records/second, and the processing was good 
> between 15 and 20 seconds.
>  # We were able to avoid all the restarts by simply disabling liveness checks.
>  # This resulted in NO RESTARTS to Streaming Core, we tried the above with 
> two scenarios:
>  * Speculation Disabled --> After 10 to 20 minutes the batch duration 
> increased to minutes and eventually processing was very slow, during which, 
> main error logs observed are about {*}The executor with id 7 exited with exit 
> code 50(Uncaught exception).{*}, logs at WARN level and TRACE level were 
> collected:
>  * {*}WARN{*}: Logs attached 
> “cveshv-events-streaming-core-cp-type2-filter-driver-0_liveness_300000_failed_120124_0336_2.log”
>  * {*}TRACE{*}: Logs attached “cveshv-events-streaming-TRACE (2).zip”
>  * Speculation Enabled -->  the batch duration increased to minutes (big lag) 
> only after around 2 hours, logs related are 
> “cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log”.
> h3. Conclusion:
>  * The liveness check is failing and thus causing the restarts.
>  * The logs indicates that there are some unhandled exceptions to executors.
>  * Issue can be somewhere else as well, below is the liveness check that was 
> disabled and that was causing the restarts initially every 10 mins after 3 
> occurrences.
>  
> h3. !image-2024-01-12-10-44-45-717.png!
> h3. Next Action:
>  * Please help us identify the RC of the issue, we’ve tried too many 
> configurations and with 2 different spark versions 3.4 and 3.5 and we’re not 
> able to avoid the issue.
>  
> Appendix:
>  
> !image-2024-01-12-10-45-18-905.png!
> !image-2024-01-12-10-45-30-398.png!
> !image-2024-01-12-10-45-40-397.png!
> !image-2024-01-12-10-45-50-427.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to