Hi, I have a Spark streaming application that is pumping data from Kafka into HDFS and Elasticsearch. The application is running on a Spark Standalone cluster (in client mode).
Whenever one of the executors fails (or is killed), new executor for the application is spawned on every node in the cluster. Since the executors have a pretty high memory allocation, this typically leads to a cascading failure. The application is using updateStateByKey and checkpointing (on HDFS). I was able to reproduce the issue with the application on Spark 1.6.0 and 1.6.1. I tried to reproduce the behavior with the streaming.HdfsWordCount example, but the executor restart worked fine there. Thanks, V.