Re: Duplicated data when using Externalized Checkpoints in a Flink Highly Available cluster

Robert Metzger Thu, 01 Jun 2017 09:06:37 -0700

Hi Amara,
how are you validating if you have duplicates in your output or not?

If you are just writing the output to another Kafka topic or print it to
standard out, you'll see duplicates even if exactly once works.
Flink does not provide exactly once delivery. Flink has exactly once
semantics for registered state.

This means you need to cooperate with the system to achieve exactly once.
For example for files, you need to remove invalid data from previous failed
checkpoints. Our bucketing sink is doing that.

On Tue, May 30, 2017 at 9:01 AM, F.Amara <fath...@wso2.com> wrote:

> Hi Gordan,
>
> Thanks alot for the reply.
> The events are produced using a KafkaProducer, submitted to a topic and
> thereby consumed by the Flink application using a FlinkKafkaConsumer. I
> verified that during a failure recovery scenario(of the Flink application)
> the KafkaProducer was not interrupted, resulting in not sending duplicated
> values from the data source. I observed the output from the
> FlinkKafkaConsumer and noticed duplicates starting from that point onwards.
> Is the FlinkKafkaConsumer capable of intoducing duplicates?
>
> How can I implement exactly-once processing for my application? Could you
> please guide me on what I might have missed?
>
> Thanks,
> Amara
>
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/Duplicated-data-
> when-using-Externalized-Checkpoints-in-a-Flink-Highly-Available-cluster-
> tp13301p13379.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

Re: Duplicated data when using Externalized Checkpoints in a Flink Highly Available cluster

Reply via email to