[ https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ben Stopford updated KAFKA-2908: -------------------------------- Summary: Consumer Sporadically Stops Consuming From Partition After Server Restart (was: Another, possibly different, Gap in Consumption after Restart) > Consumer Sporadically Stops Consuming From Partition After Server Restart > ------------------------------------------------------------------------- > > Key: KAFKA-2908 > URL: https://issues.apache.org/jira/browse/KAFKA-2908 > Project: Kafka > Issue Type: Sub-task > Components: consumer > Reporter: Ben Stopford > Assignee: Jason Gustafson > Attachments: 2015-11-28--001.tar.gz > > > *Context:* > Instance of the rolling upgrade test. 10s sleeps have been put around node > restarts to ensure stability when subsequent nodes go down. Consumer timeout > has been set to 60s. Proudcer has been throttled to 100 messages per second. > Failure is rare: It occurred once in 60 executions (6 executions per run #276 > -> #285 of the system_test_branch_builder) > *Reported Failure:* > At least one acked message did not appear in the consumed messages. > acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, > 16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, > 16442, ...plus 1669 more > *Immediate Observations:* > * The list of messages not consumed are all in partition 1. > * Production and Consumption continues throughout the test (there is no > complete write or read failure as we have seen elsewhere) > * The messages ARE present in the data files: > e.g. > {quote} > Examining missing value 16385. This was written to p1,offset:5453 > => there is an entry in all data files for partition 1 for this (presumably > meaning it was replicated correctly) > kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files > worker10/kafka-data-logs/test_topic-1/00000000000000000000.log | grep > 'offset: 5453' > worker10,9,8: > offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 > compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385 > => ID 16385 is definitely present in the Kafka logs suggesting a problem > consumer-side > {quote} > These entries definitely do not appear in the consumer stdout file either. > *Timeline:* > {quote} > 15:27:02,232 - Producer sends first message > ..servers 1, 2 are restarted (clean shutdown) > 15:29:42,718 - Server 3 shutdown complete > 15:29:42,712 - (Controller fails over): Broker 2 starting become controller > state transition (kafka.controller.KafkaController) > 15:29:42,743 - New leasder is 2 (LeaderChangeListner) > 15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 > with correlation id 0 epoch 7 for partition (test_topic,1) since its > associated leader epoch 8 is old. Current leader epoch is 8 > (state.change.logger) > 15:29:45,642 - Producer starts writing messages that are never consumed > 15:30:10,804 - Last message sent by producer > 15:31:10,983 - Consumer times out after 60s wait without messages > {quote} > Logs for this run are attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)