Hi,

We are using StructuredStreaming (Spark 2.2.0) for processing data from
Kafka. We read from a Kafka topic, do some conversions, computation and
then use FileSink to store data to partitioned path in HDFS. We have
enabled checkpoint (using a dir in HDFS).

For cases when there is a bad code push to streaming job, we want to replay
data from Kafka (I was able to do that using custom starting offset).
During replay, how do I make FileSink to overwrite the existing data.
>From code (
https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L99)
it looks like, it checks for latest batchId and skips the processing. Any
recommended way to avoid that? I am thinking of deleting files and
corresponding entries in _spark_metadata based on last modified time (and
time from which I want to replay).

Any other better solutions?

Thank you

Reply via email to