Hi, I have an application running with 6 instances of it on Kubernetes. All 6 instances (pods) are the same, using the same consumer group id. Recently we see that when the application is restarted (rolling restart on K8s), the triggered rebalancing sometimes doesn't finish at all and the Kafka Client stucks in rebalancing. Occasionally it finishes after 30-60 minutes, sometimes it doesn't.
If it is stuck, then if we stop the application and wait until kafka-consumer-groups.sh doesn't show the group, and then we restart the application, then the initial rebalancing finishes just fine and all is good... until some hours or days later a rolling restart restarts it all again. I grabbed some logs from the time when it was continuously rebalancing. Logs are mixed from 6 pods, but all pods have the same logs. (Kafka brokers seem like running on localhost, but that's not true, traffic is routed on a service mesh...) 2021-02-05T17:00:18.261422532Z: fin-df8d589bd-95bsz: INFO: Camel (camel-1) thread #2 - KafkaConsumer[topicX]: org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer clientId=consumer-fin-3, groupId=fin] Group coordinator localhost:9204 (id: 2147482641 rack: null) is unavailable or invalid 2021-02-05T17:00:18.261454952Z: fin-df8d589bd-95bsz: INFO: Camel (camel-1) thread #2 - KafkaConsumer[topicX]: org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer clientId=consumer-fin-3, groupId=fin] Rebalance failed.: org.apache.kafka.common.errors.DisconnectException: null 2021-02-05T17:00:18.499108799Z: fin-df8d589bd-85zf9: INFO: Camel (camel-1) thread #42 - KafkaConsumer[topicY]: org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer clientId=consumer-fin-43, groupId=fin] Discovered group coordinator localhost:9204 (id: 2147482641 rack: null) 2021-02-05T17:00:18.499300612Z: fin-df8d589bd-85zf9: INFO: Camel (camel-1) thread #42 - KafkaConsumer[topicY]: org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer clientId=consumer-fin-43, groupId=fin] (Re-)joining group No more logs from Kafka Consumer, it seems that the rebalancing doesn't finish at all, I don't see logs in any of the pods about the partition assignments being calculated, so my _guess_ is that the rebalancing stucks in PreparingRebalance phase and never progress from there. --- About 2 minutes 10 seconds later (sometimes I see a difference here of 1 minutes 10 seconds). 2021-02-05T17:02:29.615402388Z: fin-df8d589bd-95bsz: INFO: kafka-coordinator-heartbeat-thread | fin: org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer clientId=consumer-fin-9, groupId=fin] Group coordinator localhost:9204 (id: 2147482641 rack: null) is unavailable or invalid, will attempt rediscovery 2021-02-05T17:02:29.615520075Z: fin-df8d589bd-95bsz: INFO: Camel (camel-1) thread #28 - KafkaConsumer[twcard.plastic.events.finance.reconciliation]: org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer clientId=consumer-fin-29, groupId=fin] Rebalance failed.: org.apache.kafka.common.errors.RebalanceInProgressException: The group is rebalancing, so a rejoin is needed. --- This last line may has a difference reason for rebalance too: "Rebalance failed.: org.apache.kafka.common.errors.DisconnectException: null" 2021-02-05T17:02:29.74932507Z: fin-df8d589bd-j8mw6: INFO: Camel (camel-1) thread #2 - KafkaConsumer[topicX]: org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer clientId=consumer-fin-3, groupId=fin] Discovered group coordinator localhost:9204 (id: 2147482641 rack: null) 2021-02-05T17:02:29.749488204Z: fin-df8d589bd-j8mw6: INFO: Camel (camel-1) thread #2 - KafkaConsumer[topicX]: org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer clientId=consumer-fin-3, groupId=fin] (Re-)joining group ... and the same repeats forever. Kafka Client version: 2.6.x Kafka Broker version: 2.4.1 What can be the reason for this failing rebalance? I found this bug on 2.4.1, is it possible that I hit this issue? https://issues.apache.org/jira/browse/KAFKA-9752 "Consumer rebalance can be stuck after new member timeout with old JoinGroup version" Thanks for the help, Peter