Hello Folks, I am trying to build fault tolerance on the consumer side, so as to make sure that all failure scenarios are handled. On Data integrity side, there are primary 2 requirements:-
1. No Data loss 2. No data duplication I'm particularly interested in data duplication. e.g. there are various steps in the following order that will happen on the consumer during each consume cycle:- 1. connect 2. consume 3. write offset back to zookeeper/kafka (0.8/0.9) 4. process the message (which will be done by another code, not the consumer api) Please correct the above steps if I'm wrong Now, failures (machine down/process down/unhandled exceptions or bugs) can occur at each of the above steps Especially, if a failure occurs after consuming the message and before writing the offset back to zookeeper/kafka, on restart of the consumer, the same message could be reconsumed - leading to duplication of this message, if the 4th step is asynchronous. e.g. if processing the message happens before writing back the offset, it could cause data duplication after consumer restarts ! Is this a valid scenario ? Also, are there any other scenarios that need to be taken into consideration when consuming ? Thanks, Prabhjot