Spark Streaming (1.5.0) flaky when recovering from checkpoint

David P. Kleinschmidt Fri, 30 Oct 2015 05:45:08 -0700

I have a Spark Streaming job that runs great the first time around (Elastic
MapReduce 4.1.0), but when recovering from a checkpoint in S3, the job runs
but Spark itself seems to be jacked-up in lots of little ways:


   - Executors, which are normally stable for days, are terminated within a
   couple hours. I can see the termination notices in the logs, but no related
   exceptions. The nodes are active in YARN, but Spark doesn't pick them up
   again.
   - Hadoop web proxy can't find Spark web UI ("no route to host")
   - When I get to the web UI, the Streaming tab is missing
   - The web UI appears to stop updating after a few thousand jobs

I'm kind of at wits end here. I've been banging my head against this for a
couple weeks now, and any help would be greatly appreciated. Below is the
configuration that I'm sending to EMR.

- dpk

[
  {
    "Classification": "emrfs-site",
    "Properties": {"fs.s3.consistent": "true"}
  },
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.default.parallelism": "8",
      "spark.dynamicAllocation.enabled": "true",
      "spark.dynamicAllocation.minExecutors": "1",
      "spark.executor.cores": "4",
      "spark.executor.memory": "4148M",
      "spark.streaming.receiver.writeAheadLog.enable": "true",
      "spark.yarn.executor.memoryOverhead": "460"
    }
  },
  {
    "Classification": "spark-env",
    "Configurations": [{
        "Classification": "export",
        "Properties": {"SPARK_YARN_MODE": "true"}
    }]
  },
  {
    "Classification": "spark-log4j",
    "Properties": {"log4j.rootCategory": "WARN,console"}
  }
]

Spark Streaming (1.5.0) flaky when recovering from checkpoint

Reply via email to