Hello Folks,

I am trying to build fault tolerance on the consumer side, so as to make
sure that all failure scenarios are handled.
On Data integrity side, there are primary 2 requirements:-

1. No Data loss
2. No data duplication

I'm particularly interested in data duplication. e.g. there are various
steps in the following order that will happen on the consumer during each
consume cycle:-

1. connect
2. consume
3. write offset back to zookeeper/kafka (0.8/0.9)
4. process the message (which will be done by another code, not the
consumer api)

Please correct the above steps if I'm wrong

Now, failures (machine down/process down/unhandled exceptions or bugs) can
occur at each of the above steps
Especially, if a failure occurs after consuming the message and before
writing the offset back to zookeeper/kafka, on restart of the consumer, the
same message could be reconsumed - leading to duplication of this message,
if the 4th step is asynchronous.
e.g. if processing the message happens before writing back the offset, it
could cause data duplication after consumer restarts !

Is this a valid scenario ?
Also, are there any other scenarios that need to be taken into
consideration when consuming ?


Thanks,
Prabhjot

Reply via email to