There is no automated way to do this today, but you are on the right track
for a hack.  If you delete both the entries in _spark_metadata and the
corresponding entries from the checkpoint/offesets of the streaming query,
it will reprocess the corresponding section of the Kafka stream.

On Wed, Sep 20, 2017 at 6:20 PM, Bandish Chheda <bandish.chh...@gmail.com>
wrote:

> Hi,
>
> We are using StructuredStreaming (Spark 2.2.0) for processing data from
> Kafka. We read from a Kafka topic, do some conversions, computation and
> then use FileSink to store data to partitioned path in HDFS. We have
> enabled checkpoint (using a dir in HDFS).
>
> For cases when there is a bad code push to streaming job, we want to
> replay data from Kafka (I was able to do that using custom starting
> offset). During replay, how do I make FileSink to overwrite the existing
> data.
> From code (https://github.com/apache/spark/blob/branch-2.2/sql/
> core/src/main/scala/org/apache/spark/sql/execution/
> streaming/FileStreamSink.scala#L99) it looks like, it checks for latest
> batchId and skips the processing. Any recommended way to avoid that? I am
> thinking of deleting files and corresponding entries in _spark_metadata
> based on last modified time (and time from which I want to replay).
>
> Any other better solutions?
>
> Thank you
>

Reply via email to