[ https://issues.apache.org/jira/browse/SPARK-46702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-46702: ---------------------------------- Priority: Critical (was: Blocker) > Spark Cluster Crashing > ---------------------- > > Key: SPARK-46702 > URL: https://issues.apache.org/jira/browse/SPARK-46702 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Docker > Affects Versions: 3.4.0, 3.5.0 > Reporter: Mohamad Haidar > Priority: Critical > Labels: databricks > Attachments: CV62A4~1.LOG, cveshv-events-streaming-TRACE (2).zip, > cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log, > image-2024-01-12-10-44-45-717.png, image-2024-01-12-10-45-18-905.png, > image-2024-01-12-10-45-30-398.png, image-2024-01-12-10-45-40-397.png, > image-2024-01-12-10-45-50-427.png, > logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log > > > h3. Description: > * We have a spark cluster installed over a k8s cluster with one driver and > multiple executors (120). > * We configure our batch duration to 30 seconds. > * The Spark Cluster is reading from a 120 partition topic at Kafka and > writing to an hourly index at ElasticSearch. > * ES has 30 DataNodes, 1 shard per DataNode for each index. > * Configuration of Driver STS is in Appendix. > * The driver is observed periodically restarting every 10 mins, although the > restart do not necessarily occur each 10mins, but when it happens it happens > each 10 mins. > * The restarts frequency increase with the increase of the throughput. > * When the restarts are happening, we see OptionalDataException, attached > “logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log” > is the log resulting in a restart of the driver. > h3. Analysis: > # We’ve done a test with 250 K Records/second, and the processing was good > between 15 and 20 seconds. > # We were able to avoid all the restarts by simply disabling liveness checks. > # This resulted in NO RESTARTS to Streaming Core, we tried the above with > two scenarios: > * Speculation Disabled --> After 10 to 20 minutes the batch duration > increased to minutes and eventually processing was very slow, during which, > main error logs observed are about {*}The executor with id 7 exited with exit > code 50(Uncaught exception).{*}, logs at WARN level and TRACE level were > collected: > * {*}WARN{*}: Logs attached > “cveshv-events-streaming-core-cp-type2-filter-driver-0_liveness_300000_failed_120124_0336_2.log” > * {*}TRACE{*}: Logs attached “cveshv-events-streaming-TRACE (2).zip” > * Speculation Enabled --> the batch duration increased to minutes (big lag) > only after around 2 hours, logs related are > “cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log”. > h3. Conclusion: > * The liveness check is failing and thus causing the restarts. > * The logs indicates that there are some unhandled exceptions to executors. > * Issue can be somewhere else as well, below is the liveness check that was > disabled and that was causing the restarts initially every 10 mins after 3 > occurrences. > > h3. !image-2024-01-12-10-44-45-717.png! > h3. Next Action: > * Please help us identify the RC of the issue, we’ve tried too many > configurations and with 2 different spark versions 3.4 and 3.5 and we’re not > able to avoid the issue. > > Appendix: > > !image-2024-01-12-10-45-18-905.png! > !image-2024-01-12-10-45-30-398.png! > !image-2024-01-12-10-45-40-397.png! > !image-2024-01-12-10-45-50-427.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org