There is no automated way to do this today, but you are on the right track for a hack. If you delete both the entries in _spark_metadata and the corresponding entries from the checkpoint/offesets of the streaming query, it will reprocess the corresponding section of the Kafka stream.
On Wed, Sep 20, 2017 at 6:20 PM, Bandish Chheda <bandish.chh...@gmail.com> wrote: > Hi, > > We are using StructuredStreaming (Spark 2.2.0) for processing data from > Kafka. We read from a Kafka topic, do some conversions, computation and > then use FileSink to store data to partitioned path in HDFS. We have > enabled checkpoint (using a dir in HDFS). > > For cases when there is a bad code push to streaming job, we want to > replay data from Kafka (I was able to do that using custom starting > offset). During replay, how do I make FileSink to overwrite the existing > data. > From code (https://github.com/apache/spark/blob/branch-2.2/sql/ > core/src/main/scala/org/apache/spark/sql/execution/ > streaming/FileStreamSink.scala#L99) it looks like, it checks for latest > batchId and skips the processing. Any recommended way to avoid that? I am > thinking of deleting files and corresponding entries in _spark_metadata > based on last modified time (and time from which I want to replay). > > Any other better solutions? > > Thank you >