Unless its some sink metadata to be maintained by the framework (e.g sink
state that needs to be passed back to the sink etc), would it make sense
to keep it under the checkpoint dir ?

Maybe I am missing the motivation of the proposed approach but I guess
the sink mostly needs to store the last seen batchId to discard duplicate
data during a batch replay. It would be ideal
for the sink to store this information in the external store (along with
the data) for de-duplication to work correctly.

Thanks,
Arun



On Mon, 25 Feb 2019 at 22:13, Jungtaek Lim <kabh...@gmail.com> wrote:

> Hi devs,
>
> I was about to give it a try, but it would relate to DSv2 so decide to
> initiate new thread before actual work. I also don't think this should be
> along with DSv2 discussion since the change would be minor.
>
> While dealing with SPARK-24295 [1] and SPARK-26411 [2], I feel the needs
> of participating sink metadata into checkpoint directory, but unlike source
> which metadata directory is provided as subdirectory of checkpoint
> directory, sink doesn't receive its own metadata directory.
>
> For example, FileStreamSink creates metadata directory on output directory
> - though it is a bit intentional to share between queries - but sometimes
> we may want to make it coupled with query checkpoint.
>
> What do you think about passing metadata path to sink (we have only one
> for query) so that sink metadata can be coupled with query checkpoint?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-24295
> 2. https://issues.apache.org/jira/browse/SPARK-26411
>
>

Reply via email to