Hi, I've got 3 questions/issues regarding checkpointing, was hoping someone could help shed some light on this.
We've got a Spark Streaming consumer consuming data from a Kafka topic; works fine generally until I switch it to the checkpointing mode by calling the 'checkpoint' method on the context and pointing the checkpointing at a directory in HDFS. I can see that files get written to that directory however I don't see new Kafka content being processed. *Question 1.* Is it possible that the checkpointed consumer is off base in its understanding of where the offsets are on the topic and how could I troubleshoot that? Is it possible that some "confusion" happens if a consumer is switched back and forth between checkpointed and not? How could we tell? *Question 2.* About spark.streaming.receiver.writeAheadLog.enable. By default this is false. "All the input data received through receivers will be saved to write ahead logs that will allow it to be recovered after driver failures." So if we don't set this to true, what *will* get saved into checkpointing and what data *will* be recovered upon the driver restarting? *Question 3.* We want the RDD's to be treated as successfully processed only once we have done all the necessary transformations and actions on the data. By default, will the Spark Streaming checkpointing simply mark the topic offsets as having been processed once the data has been received by Spark? Or, once the data has been processed by the driver + the workers successfully? If the former, how can we configure checkpointing to do the latter? Thanks, - Dmitry