Re: statefulStreaming checkpointing too often

2017-06-02 Thread Tathagata Das
There are two kinds of checkpointing going on here - metadata and data. The 100 second that you have configured is the data checkpointing (expensive, large data) where the RDD data is being written to HDFS. The 10 second one is the metadata checkpoint (cheap, small data) where the metadata of the q

statefulStreaming checkpointing too often

2017-06-01 Thread David Rosenstrauch
I'm running into a weird issue with a stateful streaming job I'm running. (Spark 2.1.0 reading from kafka 0-10 input stream.) >From what I understand from the docs, by default the checkpoint interval for stateful streaming is 10 * batchInterval. Since I'm running a batch interval of 10 seconds, I