Hi,

Recently, in a 3.8.1 Kafka cluster with 3 brokers, the topic __consumer_offsets 
became leaderless:

$ /kafka-topics.sh  --zookeeper <zookeeper_addresses>  --describe 
--under-replicated-partitions
                Topic: __consumer_offsets          Partition: 0          
Leader: none      Replicas: 103,101,102    Isr:
                Topic: __consumer_offsets          Partition: 1          
Leader: none      Replicas: 101,102,103    Isr:
                Topic: __consumer_offsets          Partition: 2          
Leader: none      Replicas: 102,103,101    Isr:
                Topic: __consumer_offsets          Partition: 3          
Leader: none      Replicas: 103,102,101    Isr:
                Topic: __consumer_offsets          Partition: 4          
Leader: none      Replicas: 101,103,102    Isr:
                Topic: __consumer_offsets          Partition: 5          
Leader: none      Replicas: 102,101,103    Isr:
                Topic: __consumer_offsets          Partition: 6          
Leader: none      Replicas: 103,101,102    Isr:
                …

When this happened, consumers were unable to consume, with the following error:

o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-2, 
groupId=foo] Sending FindCoordinator request to broker <IP:port> (id: 102 rack: 
<region>)
o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-2, 
groupId=foo] Received FindCoordinator response 
ClientResponse(receivedTimeMs=1639436595264, latencyMs=98, disconnected=false, 
requestHeader=RequestHeader(apiKey=bar, apiVersion=2, clientId=consumer-2, 
correlationId=117), responseBody=FindCoordinatorResponseData(throttleTimeMs=0, 
errorCode=15, errorMessage='The coordinator is not available.', nodeId=-1, 
host='', port=-1))
o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-2, 
groupId=foo] Group coordinator lookup failed: The coordinator is not available.
o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-2, 
groupId=foo] Coordinator discovery failed, refreshing metadata

This issue was solved just restarting all brokers without much investigation, 
since this caused an outage. Unfortunately, there’s no broker logs. During this 
incident, the JMX metrics 
kafka.controller:type=KafkaController,name=OfflinePartitionsCount and 
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions reported 0.

I’m trying to figure out: 1. What could have caused this issue? 2. What JMX 
metrics could we use to get notified of this issue in the future?

Thanks in advance,
Miguel
This email and any attachments thereto may contain private, confidential, 
and/or privileged material for the sole use of the intended recipient. Any 
review, copying, or distribution of this email (or any attachments thereto) by 
others is strictly prohibited. If you are not the intended recipient, please 
contact the sender immediately and permanently delete the original and any 
copies of this email and any attachments thereto.

Reply via email to