[ https://issues.apache.org/jira/browse/KAFKA-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ismael Juma updated KAFKA-3040: ------------------------------- Labels: reliability (was: ) > Broker didn't report new data after change in leader > ---------------------------------------------------- > > Key: KAFKA-3040 > URL: https://issues.apache.org/jira/browse/KAFKA-3040 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.9.0.0 > Environment: Debian 3.2.54-2 x86_64 GNU/Linux > Reporter: Imran Patel > Priority: Critical > Labels: reliability > > Recently we had an event that causes large Kafka backlogs to develop > suddenty. This happened across multiple partitions. We noticed that after a > brief connection loss to Zookeeper, Kafka brokers were not reporting no new > data to our (SimpleConsumer) consumer although the producers were enqueueing > fine. This went on until another zk blip led to a reconfiguration which > suddenly caused the consumers to "see" the data. Our consumers and our > monitoring tools did not see the offsets move during the outage window. Here > is the sequence of events for a single partition (with logs attached below). > The brokers are running 0.9, the producer is using library version > kafka_2.10:0.8.2.1 and consumer is using kafka_2.10:0.8.0 (both are Java > programs). Our monitoring tool uses kafka-python-9.0 > Can you tell us if this could be due to a consumer bug (the libraries being > too "old" to operate with 0.9 broker, for e.g.)? Or does it look a Kafka core > issue? Please note that we recently upgraded the brokers to 0.9 and hadn't > seen a similar issue prior to that. > - after a brief connection loss to zookeeper, the partition leader (broker 9 > for partition 29 in logs below) came back and shrunk the ISR to itself. > - producers kept on successfully sending data to Kafka and the remaining > replicas (brokers 3 and 4) recorded this data. AFAICT, 3 was the new leader. > Broker 9 did NOT replicate this data. It did repeatedly print the ISR > shrinking message over and over again. > - consumer on the other hand reported no new data presumably because it was > talking to 9 and that broker was doing nothing. > - 6 hours later, another zookeeper blip causes the brokers to reconfigure and > now consumers started seeing new data. > Broker 9: > [2015-12-16 19:46:01,523] INFO Partition [messages,29] on broker 9: Expanding > ISR for partition [messages,29] from 9,4 to 9,4,3 (kafka.cluster.Partition > [2015-12-18 00:59:25,511] INFO New leader is 9 > (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) > [2015-12-18 01:00:18,451] INFO Partition [messages,29] on broker 9: Shrinking > ISR for partition [messages,29] from 9,4,3 to 9 (kafka.cluster.Partition) > [2015-12-18 01:00:18,458] INFO Partition [messages,29] on broker 9: Cached > zkVersion [472] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2015-12-18 07:04:44,552] INFO Truncating log messages-29 to offset > 14169556269. (kafka.log.Log) > [2015-12-18 07:04:44,649] INFO [ReplicaFetcherManager on broker 9] Added > fetcher for partitions List([[messages,61], initOffset 14178575900 to broker > BrokerEndPoint(6,kafka006-prod.c.foo.internal,9092)] , [[messages,13], > initOffset 14156091271 to broker > BrokerEndPoint(2,kafka002-prod.c.foo.internal,9092)] , [[messages,45], > initOffset 14135826155 to broker > BrokerEndPoint(4,kafka004-prod.c.foo.internal,9092)] , [[messages,41], > initOffset 14157926400 to broker > BrokerEndPoint(1,kafka001-prod.c.foo.internal,9092)] , [[messages,29], > initOffset 14169556269 to broker > BrokerEndPoint(3,kafka003-prod.c.foo.internal,9092)] , [[messages,57], > initOffset 14175218230 to broker > BrokerEndPoint(1,kafka001-prod.c.foo.internal,9092)] ) > (kafka.server.ReplicaFetcherManager) > Broker 3: > [2015-12-18 01:00:01,763] INFO [ReplicaFetcherManager on broker 3] Removed > fetcher for partitions [messages,29] (kafka.server.ReplicaFetcherManager) > [2015-12-18 07:09:04,631] INFO Partition [messages,29] on broker 3: Expanding > ISR for partition [messages,29] from 4,3 to 4,3,9 (kafka.cluster.Partition) > [2015-12-18 07:09:49,693] INFO [ReplicaFetcherManager on broker 3] Removed > fetcher for partitions [messages,29] (kafka.server.ReplicaFetcherManager) > Broker 4: > [2015-12-18 01:00:01,783] INFO [ReplicaFetcherManager on broker 4] Removed > fetcher for partitions [messages,29] (kafka.server.ReplicaFetcherManager) > [2015-12-18 01:00:01,866] INFO [ReplicaFetcherManager on broker 4] Added > fetcher for partitions List([[messages,29], initOffset 14169556262 to broker > BrokerEndPoint(3,kafka003-prod.c.foo.internal,9092)] ) > (kafka.server.ReplicaFetcherManager) > [2015-12-18 07:09:50,191] ERROR [ReplicaFetcherThread-0-3], Error for > partition [messages,29] to broker > 3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server > is not the leader for that topic-partition. > (kafka.server.ReplicaFetcherThread) > Consumer: > 2015-12-18 01:00:01.791 [P29-Reader] INFO > com.example.utils.kafkalib.KafkaConsumer - 7 messages read from partition 29 > starting from offset 14169556262 > 2015-12-18 01:00:01.791 [P29-Reader] INFO > com.example.utils.kafkalib.KafkaConsumer - 0 messages read from partition 29 > starting from offset 14169556269 > 2015-12-18 07:04:44.293 [P29-Reader] INFO > com.example.utils.kafkalib.KafkaConsumer - 0 messages read from partition 29 > starting from offset 14169556269 > 2015-12-18 07:04:44.303 [P29-Reader] WARN > com.example.project.consumer.kafka.PartitionReader - Error fetching data from > the Broker:kafka009-prod.c.foo.internal Reason: NotLeaderForPartitionCode > 2015-12-18 07:04:44.304 [P29-Reader] INFO > com.example.project.consumer.kafka.PartitionData - Attempting to > connectAndRead leader for topic: messages partition: 29 > 2015-12-18 07:04:44.309 [P29-Reader] INFO > com.example.project.consumer.kafka.PartitionData - Leader for topic: messages > partition: 29 set to kafka003-prod.c.foo.internal > 2015-12-18 07:04:44.749 [P29-Reader] INFO > com.example.utils.kafkalib.KafkaConsumer - 6514 messages read from partition > 29 starting from offset 14169556269 -- This message was sent by Atlassian JIRA (v6.3.4#6332)