Hi Niels!

In general, exactly once output requires transactional cooperation from the
target system. Kafka has that on the roadmap, we should be able to
integrate that once it is out.
That means output is "committed" upon completed checkpoints, which
guarantees nothing is written multiple times.

Chesnay is working on an interesting prototype as a generic solution (also
for Kafka, while they don't have that feature):
It buffers the data in the sink persistently (using the fault tolerance
state backends) and pushes the results out on notification of a completed
checkpoint.
That gives you exactly once semantics, but involves an extra
materialization of the data.


I think that there is actually a fundamental latency issue with "exactly
once sinks", no matter how you implement them in any systems:
You can only commit once you are sure that everything went well, to a
specific point where you are sure no replay will ever be needed.

So the latency in Flink for an exactly-once output would be at least the
checkpoint interval.

I'm eager to hear your thoughts on this.

Greetings,
Stephan


On Fri, Feb 5, 2016 at 11:17 AM, Niels Basjes <[email protected]> wrote:

> Hi,
>
> It is my understanding that the exactly-once semantics regarding the input
> from Kafka is based on the checkpointing in the source component retaining
> the offset where it was at the checkpoint moment.
>
> My question is how does that work for a sink? How can I make sure that (in
> light of failures) each message that is read from Kafka (my input) is
> written to Kafka (my output) exactly once?
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>

Reply via email to