Hi everyone, Our setup: 3 node kafka (0.9.0.1) / 3 node zookeeper (3.4.6-1569965) setup on 6 separate VMware instances. We have around 10 topics with 32 partitions. Offsets are managed by Kafka.
We have had a production issue and we suspect it was caused by VMWare snapshot backups leading to ReadTimeouts on the Zookeeper followers, causing a re-election. Zookeeper leader changed from zk2 to zk3, zk2 came back and joined, then there was another timeout logged by zk2 but it rejoined quickly. So far so good, I would call this expected behavior. Unfortunately our (standard kakfa java) producers had an issue afterwards, specifically with kafka2. It reported errors in its log such as "Number of insync replicas for partition [<partition>,29] is [1], below required minimum [2]" which caused the publish attempts to fail with the same message causing end-user request to fail. I suspect that this happened for certain partitions only (for those where kafka2 was supposed to be the leader). Messages could be processed fine by kafka1 and kafka3, they just shrinked their ISR from 1,2,3 (or similar) to 1,3. So for some reason it looks like the controller still forwarded messages to kafka2, which had only 1 ISR replica and thus declined to write them. How could this happen? After a restart of all zookeeper and kafka instances (quickly, in order to restore production asap) everything worked fine again. The only information we have is the log files, we don't have the output of a "--describe" of the topics or anything like that. I can provide the logs but I have to upload them somewhere and strip out sensitive data. I then tried to reproduce the behavior locally with a similar set of 6 VMs and equal config. I created a test-topic with 32 partitions and started up a producer and a consumer. The producer is running this way: while true;do date;sleep 1;done|./kafka-run-class.sh kafka.tools.ConsoleProducer --broker-list 192.168.104.6:17111,192.168.104.7:17111,192.168.104.8:17111 --topic test32 --request-required-acks -1 The consumer: ./kafka-run-class.sh kafka.tools.ConsoleConsumer --bootstrap-server 192.168.104.6:17111,192.168.104.7:17111,192.168.104.8:17111 --topic test32 --from-beginning --new-consumer Then I tweaked with the network on the zookeeper leader using IP tables causing re-elections at random intervals (see this script: https://thepasteb.in/p/0ghDcWxKQq9g8uz). I ran the script like so: "while true; do ./iptables_block_zk_leader.sh block;sleep $((RANDOM%10)); ./iptables_block_zk_leader.sh unblock;end". It ran over night and the next day I had a somewhat broken state regarding kafka (and I stopped the above script to analyze): 1. Only kafka2 is registered with zookeeper (ls /brokers/ids -> [2]). Isr only shows kafka1 and kafka2, kafka3 seems to be out of the game. ./kafka-topics.sh --zookeeper 192.168.104.3:17001 --topic test32 --describe: Topic:test32 PartitionCount:32 ReplicationFactor:3 Configs: Topic: test32 Partition: 0 Leader: 2 Replicas: 3,1,2 Isr: 2,1 Topic: test32 Partition: 1 Leader: 2 Replicas: 1,2,3 Isr: 2,1 [...] 2. Kafka1 is constantly logging: [2016-11-04 17:14:51,928] INFO Partition [__consumer_offsets,19] on broker 1: Cached zkVersion [120] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2016-11-04 17:15:01,927] INFO Partition [__consumer_offsets,19] on broker 1: Shrinking ISR for partition [__consumer_offsets,19] from 2,1,3 to 1,3 (kafka.cluster.Partition) [2016-11-04 17:15:01,927] INFO Partition [__consumer_offsets,19] on broker 1: Shrinking ISR for partition [__consumer_offsets,19] from 2,1,3 to 1,3 (kafka.cluster.Partition) [2016-11-04 17:15:01,930] INFO Partition [__consumer_offsets,19] on broker 1: Cached zkVersion [120] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2016-11-04 17:15:01,930] INFO Partition [__consumer_offsets,19] on broker 1: Cached zkVersion [120] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2016-11-04 17:15:11,926] INFO Partition [__consumer_offsets,19] on broker 1: Shrinking ISR for partition [__consumer_offsets,19] from 2,1,3 to 1,3 (kafka.cluster.Partition) [2016-11-04 17:15:11,926] INFO Partition [__consumer_offsets,19] on broker 1: Shrinking ISR for partition [__consumer_offsets,19] from 2,1,3 to 1,3 (kafka.cluster.Partition) 3. Kafka2 is not logging any issues. 4. Kafka3 is constantly logging: [2016-11-04 14:34:13,246] INFO Partition [test32,30] on broker 3: Shrinking ISR for partition [test32,30] from 3,1,2 to 3 (kafka.cluster.Partition) [2016-11-04 14:34:13,246] INFO Partition [test32,30] on broker 3: Shrinking ISR for partition [test32,30] from 3,1,2 to 3 (kafka.cluster.Partition) [2016-11-04 14:34:13,252] INFO Partition [test32,30] on broker 3: Cached zkVersion [97] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2016-11-04 14:34:13,252] INFO Partition [test32,30] on broker 3: Cached zkVersion [97] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2016-11-04 14:34:14,109] ERROR [ReplicaFetcherThread-0-1], Error for partition [test32,13] to broker 1:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2016-11-04 14:34:14,109] ERROR [ReplicaFetcherThread-0-1], Error for partition [test32,13] to broker 1:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2016-11-04 14:34:14,109] ERROR [ReplicaFetcherThread-0-1], Error for partition [__consumer_offsets,47] to broker 1:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2016-11-04 14:34:14,109] ERROR [ReplicaFetcherThread-0-1], Error for partition [__consumer_offsets,47] to broker 1:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) 5. Zookeeper is in a stable state as far as I can tell, I see no errors in the logs. ruok returns iamok on all 3 nodes. zxid is equal. 6. Producer and consumer are still running fine! There were occasional errors overnight like: ... [2016-11-04 02:43:04,620] WARN Got error produce response with correlation id 33230 on topic-partition test32-5, retrying (1 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender) [2016-11-04 02:43:04,620] ERROR Error when sending message to topic test32 with key: null, value: 28 bytes with error: The server disconnected before a response was received. (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback) [2016-11-04 02:43:04,620] WARN Got error produce response with correlation id 33229 on topic-partition test32-20, retrying (0 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender) [2016-11-04 02:43:04,620] WARN Got error produce response with correlation id 33229 on topic-partition test32-14, retrying (2 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender) ... But if the producer is only directed to the "broken" kafka3 (--bootstrap-server 192.168.104.8:17111), it receives the following warnings: [kafka_xx@kafka2 /apps/kafka/logs/kafka]$ while true;do date;sleep 1;done|/apps/kafka/kafka/bin/kafka-run-class.sh kafka.tools.ConsoleProducer --broker-list 192.168.104.8:17111 --topic test32 --request-required-acks -1 [2016-11-04 14:24:28,708] WARN Got error produce response with correlation id 2 on topic-partition test32-19, retrying (2 attempts left). Error: NOT_LEADER_FOR_PARTITION (org.apache.kafka.clients.producer.internals.Sender) [2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 on topic-partition test32-27, retrying (2 attempts left). Error: REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender) [2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 on topic-partition test32-12, retrying (2 attempts left). Error: REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender) [2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 on topic-partition test32-28, retrying (2 attempts left). Error: REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender) [2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 on topic-partition test32-18, retrying (2 attempts left). Error: REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender) [2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 on topic-partition test32-3, retrying (2 attempts left). Error: REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender) [2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 on topic-partition test32-9, retrying (2 attempts left). Error: REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender) Still it the messages are being produced and can be consumed by the console consumer. So I am a bit puzzled by this state. Maybe I have reproduced it this way but the Java Producer we have in our app behaved differently and cached something so messages still got routed to the "sick" kafka node. Any help would be very appreciated! Thanks Thomas