[ https://issues.apache.org/jira/browse/KAFKA-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ben Stopford updated KAFKA-2909: -------------------------------- Description: This seems very similar to Rajini's reported KAFAK-2891 *Context* The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 2s sleep between restarts. Throughput was limited to 1000 messages per second. *Failure* At least one acked message did not appear in the consumed messages. acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 64403, 64406, 64409, 36799) Missing data was correctly written to Kafka data files: {quote} value 36802 -> partition 1,offset: 12216 kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files worker7/kafka-data-logs/test_topic-1/00000000000000000000.log | grep 'offset: 12216' ->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482 In this case offset 12216 does not contain the value we expected. The implication is that the data was either not written or was lost during rebalancing. {quote} The first missing value was written at: 20:42:30,185, which is around the time the third node goes down. The failed writes correlate with the consumer logging out NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of these messages though over a long period so it’s hard to infer this as being the cause or specifically correlating with the error. *Timeline* {quote} grep -r 'shutdown complete' * 20:42:06,132 - Node 1 shutdown completed 20:42:18,560 - Node 2 shutdown completed 20:42:30,185 - *Writes that never make it are written by producer* 20:42:31,164 - Node 3 shutdown completed 20:42:57,872 - Node 1 shutdown completed … {quote} Logging was verbose during this run so log files can be found here: [click me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0] was: This seems very similar to Rajini's reported KAFAK-2891 *Context* The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 2s sleep between restarts. Throughput was limited to 1000 messages per second. *Failure* At least one acked message did not appear in the consumed messages. acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 64403, 64406, 64409, 36799) Missing data was correctly written to Kafka data files: {quote} value 36802 -> partition 1,offset: 12216 kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files worker7/kafka-data-logs/test_topic-1/00000000000000000000.log | grep 'offset: 12216' ->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482 So in this instance the data is not there! {quote} The first missing value was written at: 20:42:30,185, which is around the time the third node goes down. The failed writes correlate with the consumer logging out NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of these messages though over a long period so it’s hard to infer this as being the cause or specifically correlating with the error. *Timeline* {quote} grep -r 'shutdown complete' * 20:42:06,132 - Node 1 shutdown completed 20:42:18,560 - Node 2 shutdown completed 20:42:30,185 - *Writes that never make it are written by producer* 20:42:31,164 - Node 3 shutdown completed 20:42:57,872 - Node 1 shutdown completed … {quote} Logging was verbose during this run so log files can be found here: [click me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0] > Example of Gap in Consumption after Restart > ------------------------------------------- > > Key: KAFKA-2909 > URL: https://issues.apache.org/jira/browse/KAFKA-2909 > Project: Kafka > Issue Type: Sub-task > Components: consumer > Reporter: Ben Stopford > Assignee: Jason Gustafson > > This seems very similar to Rajini's reported KAFAK-2891 > *Context* > The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a > 2s sleep between restarts. Throughput was limited to 1000 messages per > second. > *Failure* > At least one acked message did not appear in the consumed messages. > acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, > 64403, 64406, 64409, 36799) > Missing data was correctly written to Kafka data files: > {quote} > value 36802 -> partition 1,offset: 12216 > kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files > worker7/kafka-data-logs/test_topic-1/00000000000000000000.log | grep 'offset: > 12216' > ->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 > compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482 > In this case offset 12216 does not contain the value we expected. The > implication is that the data was either not written or was lost during > rebalancing. > {quote} > The first missing value was written at: 20:42:30,185, which is around the > time the third node goes down. > The failed writes correlate with the consumer logging out > NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of > these messages though over a long period so it’s hard to infer this as being > the cause or specifically correlating with the error. > *Timeline* > {quote} > grep -r 'shutdown complete' * > 20:42:06,132 - Node 1 shutdown completed > 20:42:18,560 - Node 2 shutdown completed > 20:42:30,185 - *Writes that never make it are written by producer* > 20:42:31,164 - Node 3 shutdown completed > 20:42:57,872 - Node 1 shutdown completed > … > {quote} > Logging was verbose during this run so log files can be found here: [click > me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0] -- This message was sent by Atlassian JIRA (v6.3.4#6332)