Re: [Structured Streaming] How to replay data and overwrite using FileSink

2017-09-22 Thread Michael Armbrust
There is no automated way to do this today, but you are on the right track for a hack. If you delete both the entries in _spark_metadata and the corresponding entries from the checkpoint/offesets of the streaming query, it will reprocess the corresponding section of the Kafka stream. On Wed, Sep

[Structured Streaming] How to replay data and overwrite using FileSink

2017-09-20 Thread Bandish Chheda
Hi, We are using StructuredStreaming (Spark 2.2.0) for processing data from Kafka. We read from a Kafka topic, do some conversions, computation and then use FileSink to store data to partitioned path in HDFS. We have enabled checkpoint (using a dir in HDFS). For cases when there is a bad code pus