kafka in undefined state after zookeeper network issues

Thomas Falkenberg Fri, 04 Nov 2016 21:27:09 -0700

Hi everyone,

Our setup: 3 node kafka (0.9.0.1) / 3 node zookeeper (3.4.6-1569965) setup on 6 
separate VMware instances. We have around 10 topics with 32 partitions. Offsets 
are managed by Kafka.


We have had a production issue and we suspect it was caused by VMWare snapshot 
backups leading to ReadTimeouts on the Zookeeper followers, causing a 
re-election. Zookeeper leader changed from zk2 to zk3, zk2 came back and 
joined, then there was another timeout logged by zk2 but it rejoined quickly. 
So far so good, I would call this expected behavior. 

Unfortunately our (standard kakfa java) producers had an issue afterwards, 
specifically  with kafka2. It reported errors in its log such as "Number of 
insync replicas for partition [<partition>,29] is [1], below required minimum 
[2]" which caused the publish attempts to fail with the same message causing 
end-user request to fail. I suspect that this happened for certain partitions 
only (for those where kafka2 was supposed to be the leader). Messages could be 
processed fine by kafka1 and kafka3, they just shrinked their ISR from 1,2,3 
(or similar) to 1,3. So for some reason it looks like the controller still 
forwarded messages to kafka2, which had only 1 ISR replica and thus declined to 
write them. How could this happen? 

After a restart of all zookeeper and kafka instances (quickly, in order to 
restore production asap) everything worked fine again. The only information we 
have is the log files, we don't have the output of a "--describe" of the topics 
or anything like that. I can provide the logs but I have to upload them 
somewhere and strip out sensitive data. 

I then tried to reproduce the behavior locally with a similar set of 6 VMs and 
equal config. I created a test-topic with 32 partitions and started up a 
producer and a consumer. 

The producer is running this way: 
while true;do date;sleep 1;done|./kafka-run-class.sh  
kafka.tools.ConsoleProducer  --broker-list 
192.168.104.6:17111,192.168.104.7:17111,192.168.104.8:17111 --topic test32 
--request-required-acks -1

The consumer: 
./kafka-run-class.sh  kafka.tools.ConsoleConsumer --bootstrap-server 
192.168.104.6:17111,192.168.104.7:17111,192.168.104.8:17111 --topic test32 
--from-beginning --new-consumer

Then I tweaked with the network on the zookeeper leader using IP tables causing 
re-elections at random intervals (see this script: 
https://thepasteb.in/p/0ghDcWxKQq9g8uz). I ran the script like so: "while true; 
do ./iptables_block_zk_leader.sh block;sleep $((RANDOM%10)); 
./iptables_block_zk_leader.sh unblock;end". 
It ran over night and the next day I had a somewhat broken state regarding 
kafka (and I stopped the above script to analyze):

1. Only kafka2 is registered with zookeeper (ls /brokers/ids -> [2]). Isr only 
shows kafka1 and kafka2, kafka3 seems to be out of the game. 
./kafka-topics.sh --zookeeper 192.168.104.3:17001 --topic test32 --describe:
Topic:test32    PartitionCount:32       ReplicationFactor:3     Configs:
        Topic: test32   Partition: 0    Leader: 2       Replicas: 3,1,2 Isr: 2,1
        Topic: test32   Partition: 1    Leader: 2       Replicas: 1,2,3 Isr: 
2,1 
          [...]


2. Kafka1 is constantly logging:
[2016-11-04 17:14:51,928] INFO Partition [__consumer_offsets,19] on broker 1: 
Cached zkVersion [120] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2016-11-04 17:15:01,927] INFO Partition [__consumer_offsets,19] on broker 1: 
Shrinking ISR for partition [__consumer_offsets,19] from 2,1,3 to 1,3 
(kafka.cluster.Partition)
[2016-11-04 17:15:01,927] INFO Partition [__consumer_offsets,19] on broker 1: 
Shrinking ISR for partition [__consumer_offsets,19] from 2,1,3 to 1,3 
(kafka.cluster.Partition)
[2016-11-04 17:15:01,930] INFO Partition [__consumer_offsets,19] on broker 1: 
Cached zkVersion [120] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2016-11-04 17:15:01,930] INFO Partition [__consumer_offsets,19] on broker 1: 
Cached zkVersion [120] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2016-11-04 17:15:11,926] INFO Partition [__consumer_offsets,19] on broker 1: 
Shrinking ISR for partition [__consumer_offsets,19] from 2,1,3 to 1,3 
(kafka.cluster.Partition)
[2016-11-04 17:15:11,926] INFO Partition [__consumer_offsets,19] on broker 1: 
Shrinking ISR for partition [__consumer_offsets,19] from 2,1,3 to 1,3 
(kafka.cluster.Partition)
3. Kafka2 is not logging any issues. 
4. Kafka3 is constantly logging:
[2016-11-04 14:34:13,246] INFO Partition [test32,30] on broker 3: Shrinking ISR 
for partition [test32,30] from 3,1,2 to 3 (kafka.cluster.Partition)
[2016-11-04 14:34:13,246] INFO Partition [test32,30] on broker 3: Shrinking ISR 
for partition [test32,30] from 3,1,2 to 3 (kafka.cluster.Partition)
[2016-11-04 14:34:13,252] INFO Partition [test32,30] on broker 3: Cached 
zkVersion [97] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2016-11-04 14:34:13,252] INFO Partition [test32,30] on broker 3: Cached 
zkVersion [97] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2016-11-04 14:34:14,109] ERROR [ReplicaFetcherThread-0-1], Error for partition 
[test32,13] to broker 
1:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2016-11-04 14:34:14,109] ERROR [ReplicaFetcherThread-0-1], Error for partition 
[test32,13] to broker 
1:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2016-11-04 14:34:14,109] ERROR [ReplicaFetcherThread-0-1], Error for partition 
[__consumer_offsets,47] to broker 
1:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2016-11-04 14:34:14,109] ERROR [ReplicaFetcherThread-0-1], Error for partition 
[__consumer_offsets,47] to broker 
1:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
5. Zookeeper is in a stable state as far as I can tell, I see no errors in the 
logs. ruok returns iamok on all 3 nodes. zxid is equal. 

6. Producer and consumer are still running fine!
There were occasional errors overnight like:
...
[2016-11-04 02:43:04,620] WARN Got error produce response with correlation id 
33230 on topic-partition test32-5, retrying (1 attempts left). Error: 
NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 02:43:04,620] ERROR Error when sending message to topic test32 with 
key: null, value: 28 bytes with error: The server disconnected before a 
response was received. 
(org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
[2016-11-04 02:43:04,620] WARN Got error produce response with correlation id 
33229 on topic-partition test32-20, retrying (0 attempts left). Error: 
NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 02:43:04,620] WARN Got error produce response with correlation id 
33229 on topic-partition test32-14, retrying (2 attempts left). Error: 
NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
...

But if the producer is only directed to the "broken" kafka3  
(--bootstrap-server 192.168.104.8:17111), it receives the following warnings:
[kafka_xx@kafka2 /apps/kafka/logs/kafka]$ while true;do date;sleep 
1;done|/apps/kafka/kafka/bin/kafka-run-class.sh  kafka.tools.ConsoleProducer  
--broker-list 192.168.104.8:17111 --topic test32 --request-required-acks -1
[2016-11-04 14:24:28,708] WARN Got error produce response with correlation id 2 
on topic-partition test32-19, retrying (2 attempts left). Error: 
NOT_LEADER_FOR_PARTITION (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 
on topic-partition test32-27, retrying (2 attempts left). Error: 
REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 
on topic-partition test32-12, retrying (2 attempts left). Error: 
REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 
on topic-partition test32-28, retrying (2 attempts left). Error: 
REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 
on topic-partition test32-18, retrying (2 attempts left). Error: 
REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 
on topic-partition test32-3, retrying (2 attempts left). Error: 
REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 
on topic-partition test32-9, retrying (2 attempts left). Error: 
REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender)

Still it the messages are being produced and can be consumed by the console 
consumer. So I am a bit puzzled by this state. Maybe I have reproduced it this 
way but the Java Producer we have in our app behaved differently and cached 
something so messages still got routed to the "sick" kafka node. 


Any help would be very appreciated!

Thanks
Thomas

kafka in undefined state after zookeeper network issues

Reply via email to