Re: Duplicated data when using Externalized Checkpoints in a Flink Highly Available cluster

F.Amara Sun, 04 Jun 2017 22:50:15 -0700

Hi Robert,

Thanks a lot for the reply.


To further explain how I verify the presence of duplicates, I write the
output stream received at the FlinkKafkaConsumer (after being sent from the
KafkaProducer) to a csv file. 
Then the content of the file is scanned to see whether we received the exact
amount of events sent from the KafkaProducer and then look for values that
have appeared more than once indicating duplicates. 
In my case the total number of events received is always higher than what we
sent.  

The following diagram explains the procedure.

|----------------------------------|       |-------------------|       
|---------------------------------|
|             KafkaProducer              |-------->|          Kafka          
|------>|   
FlinkKafkaConsumer        |
|(A separate Java process|                |                       |      |      
  
(Starting point of              |
|  which generates data     |             |                       |      |      
  
Flink application)              |
|  and writes to Kafka)       |           |                       |      |      
                                         
|
|----------------------------------|           |-------------------|     
|------------------------------------|


Thanks,
Amara



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Duplicated-data-when-using-Externalized-Checkpoints-in-a-Flink-Highly-Available-cluster-tp13301p13481.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.

Re: Duplicated data when using Externalized Checkpoints in a Flink Highly Available cluster

Reply via email to