I have a Spark Streaming job that runs great the first time around (Elastic MapReduce 4.1.0), but when recovering from a checkpoint in S3, the job runs but Spark itself seems to be jacked-up in lots of little ways:
- Executors, which are normally stable for days, are terminated within a couple hours. I can see the termination notices in the logs, but no related exceptions. The nodes are active in YARN, but Spark doesn't pick them up again. - Hadoop web proxy can't find Spark web UI ("no route to host") - When I get to the web UI, the Streaming tab is missing - The web UI appears to stop updating after a few thousand jobs I'm kind of at wits end here. I've been banging my head against this for a couple weeks now, and any help would be greatly appreciated. Below is the configuration that I'm sending to EMR. - dpk [ { "Classification": "emrfs-site", "Properties": {"fs.s3.consistent": "true"} }, { "Classification": "spark-defaults", "Properties": { "spark.default.parallelism": "8", "spark.dynamicAllocation.enabled": "true", "spark.dynamicAllocation.minExecutors": "1", "spark.executor.cores": "4", "spark.executor.memory": "4148M", "spark.streaming.receiver.writeAheadLog.enable": "true", "spark.yarn.executor.memoryOverhead": "460" } }, { "Classification": "spark-env", "Configurations": [{ "Classification": "export", "Properties": {"SPARK_YARN_MODE": "true"} }] }, { "Classification": "spark-log4j", "Properties": {"log4j.rootCategory": "WARN,console"} } ]