I am looking at ways how one might have data loss and duplication in a Kafka 
cluster and need some help/pointers/discussions.
So far, here's what I have come up with:
Loss at producer-sideSince the data send call is actually adding data to a 
cache/buffer, a crash of the producer can potentially result in data 
loss.Another scenario for data loss is a producer exiting without closing the 
producer connection.
Loss at broker-sideI think there are several situations here - all of which are 
triggered by a broker or controller crash or network issues with zookeepers 
(kind of simulating broker crashes). 
If I understand correctly, KAFKA-1211 
(https://issues.apache.org/jira/browse/KAFKA-1211) implies that when acks is 
set to 0/1 and the leader crashes, there is a probability of data loss. 
Hopefully implementation of leader generation will help avoid this 
(https://issues.apache.org/jira/browse/KAFKA-1211?focusedCommentId=15402622&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15402622)
And a unique situation as described in KAFKA-3410 
(https://issues.apache.org/jira/browse/KAFKA-3410) can cause broker or cluster 
shutdown leading to data loss as described in KAFKA-3924 (resolved in 0.10.0.1).
And data duplication can attributed primarily to consumer offset management 
which is done at batch/periodic intervals.
Can anyone think or know of any other scenarios?
Thanks,Jayesh



Reply via email to