Thanks Arun.
In our case, we only commit sink task output to the datastore by
coordinating with the driver.
Sink tasks write output to a "staging" area, and the driver only commits
the staging data to a datastore once all tasks for a micro-batch have
reported success back to the driver.
In the cas
Hi Eric,
I think it will depend on how you implement the sink and when the data in
the sink partitions are committed.
I think the batch can be repeated during task retries as well as if the
driver fails before the batch id is committed in sparks checkpoint. In the
first case may be the sink had n
Hi Arun, Gabor,
Thanks for the feedback. We are using the "Exactly-once using
transactional writes" approach, so we don't rely on message keys for
idempotent writes.
So I should clarify that my question is specific to the "Exactly-once using
transactional writes" approach.
We are following the ap
Hi Eric,
In order to have exactly-once one need re-playable source and idempotent
sink.
The cases what you've mentioned are covering the 2 main group of issues.
Practically any kind of programming problem can end-up in duplicated data
(even in the code which feeds kafka).
Don't know why have you a
I guess thats roughly it.
As of now theres no in-built support to co-ordinate the commits across the
executors in an atomic way. So you need to commit the batch (global commit)
at the driver.
And when the batch is replayed and if any of the intermediate operations
are not idempotent or can cause