I guess thats roughly it.

As of now theres no in-built support to co-ordinate the commits across the
executors in an atomic way. So you need to commit the batch (global commit)
at the driver.

And when the batch is replayed and if any of the intermediate operations
are not idempotent or can cause side effects, the result produced during
replay may be different from what you committed and would be ignored.

Thanks,
Arun

On Wed, 5 Dec 2018 at 11:36, Eric Wohlstadter <wohls...@gmail.com> wrote:

> Hi all,
>  We are working on implementing a streaming sink on 2.3.1 with the
> DataSourceV2 APIs.
>
> Can anyone help check if my understanding is correct, with respect to the
> failure modes which need to be covered?
>
> We are assuming that a Reliable Receiver (such as Kafka) is used as the
> stream source. And we only want to support micro-batch execution at this
> time (not yet Continuous Processing).
>
> I believe the possible failures that need to be covered are:
>
> 1. Task failure: If a task fails, it may have written data to the sink
> output before failure. Subsequent attempts for a failed task must be
> idempotent, so that no data is duplicated in the output.
> 2. Driver failure: If the driver fails, upon recovery, it might replay a
> micro-batch that was already seen by the sink (if a failure occurs after
> the sink has committed output but before the driver has updated the
> checkpoint). In this case, the sink must be idempotent when a micro-batch
> is replayed so that no data is duplicated in the output.
>
> Are there any other cases where data might be duplicated in the stream?
> i.e. if neither of these 2 failures occur, is there still a case where
> data can be duplicated?
>
> Thanks for any help to check if my understanding is correct.
>
>
>
>
>

Reply via email to