Issues with Spark Streaming checkpointing of Kafka topic content

Dmitry Goldenberg Tue, 02 Apr 2019 08:39:49 -0700

Hi,

I've got 3 questions/issues regarding checkpointing, was hoping someone
could help shed some light on this.


We've got a Spark Streaming consumer consuming data from a Kafka topic;
works fine generally until I switch it to the checkpointing mode by calling
the 'checkpoint' method on the context and pointing the checkpointing at a
directory in HDFS.

I can see that files get written to that directory however I don't see new
Kafka content being processed.

*Question 1.* Is it possible that the checkpointed consumer is off base in
its understanding of where the offsets are on the topic and how could I
troubleshoot that?  Is it possible that some "confusion" happens if a
consumer is switched back and forth between checkpointed and not? How could
we tell?

*Question 2.* About spark.streaming.receiver.writeAheadLog.enable. By
default this is false. "All the input data received through receivers will
be saved to write ahead logs that will allow it to be recovered after
driver failures."  So if we don't set this to true, what *will* get saved
into checkpointing and what data *will* be recovered upon the driver
restarting?

*Question 3.* We want the RDD's to be treated as successfully processed
only once we have done all the necessary transformations and actions on the
data.  By default, will the Spark Streaming checkpointing simply mark the
topic offsets as having been processed once the data has been received by
Spark?  Or, once the data has been processed by the driver + the workers
successfully?  If the former, how can we configure checkpointing to do the
latter?

Thanks,
- Dmitry

Issues with Spark Streaming checkpointing of Kafka topic content

Reply via email to