Hi, Recently, in a 3.8.1 Kafka cluster with 3 brokers, the topic __consumer_offsets became leaderless:
$ /kafka-topics.sh --zookeeper <zookeeper_addresses> --describe --under-replicated-partitions Topic: __consumer_offsets Partition: 0 Leader: none Replicas: 103,101,102 Isr: Topic: __consumer_offsets Partition: 1 Leader: none Replicas: 101,102,103 Isr: Topic: __consumer_offsets Partition: 2 Leader: none Replicas: 102,103,101 Isr: Topic: __consumer_offsets Partition: 3 Leader: none Replicas: 103,102,101 Isr: Topic: __consumer_offsets Partition: 4 Leader: none Replicas: 101,103,102 Isr: Topic: __consumer_offsets Partition: 5 Leader: none Replicas: 102,101,103 Isr: Topic: __consumer_offsets Partition: 6 Leader: none Replicas: 103,101,102 Isr: … When this happened, consumers were unable to consume, with the following error: o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=consumer-2, groupId=foo] Sending FindCoordinator request to broker <IP:port> (id: 102 rack: <region>) o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=consumer-2, groupId=foo] Received FindCoordinator response ClientResponse(receivedTimeMs=1639436595264, latencyMs=98, disconnected=false, requestHeader=RequestHeader(apiKey=bar, apiVersion=2, clientId=consumer-2, correlationId=117), responseBody=FindCoordinatorResponseData(throttleTimeMs=0, errorCode=15, errorMessage='The coordinator is not available.', nodeId=-1, host='', port=-1)) o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=consumer-2, groupId=foo] Group coordinator lookup failed: The coordinator is not available. o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=consumer-2, groupId=foo] Coordinator discovery failed, refreshing metadata This issue was solved just restarting all brokers without much investigation, since this caused an outage. Unfortunately, there’s no broker logs. During this incident, the JMX metrics kafka.controller:type=KafkaController,name=OfflinePartitionsCount and kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions reported 0. I’m trying to figure out: 1. What could have caused this issue? 2. What JMX metrics could we use to get notified of this issue in the future? Thanks in advance, Miguel This email and any attachments thereto may contain private, confidential, and/or privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto.