Hi,

I have an application running with 6 instances of it on Kubernetes. All 6
instances (pods) are the same, using the same consumer group id.
Recently we see that when the application is restarted (rolling restart on
K8s), the triggered rebalancing sometimes doesn't finish at all and the
Kafka Client stucks in rebalancing. Occasionally it finishes after 30-60
minutes, sometimes it doesn't.

If it is stuck, then if we stop the application and wait until
kafka-consumer-groups.sh doesn't show the group, and then we restart the
application, then the initial rebalancing finishes just fine and all is
good... until some hours or days later a rolling restart restarts it all
again.

I grabbed some logs from the time when it was continuously rebalancing.
Logs are mixed from 6 pods, but all pods have the same logs. (Kafka brokers
seem like running on localhost, but that's not true, traffic is routed on a
service mesh...)

2021-02-05T17:00:18.261422532Z:  fin-df8d589bd-95bsz: INFO: Camel (camel-1)
thread #2 - KafkaConsumer[topicX]:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-3, groupId=fin] Group coordinator localhost:9204 (id:
2147482641 rack: null) is unavailable or invalid
2021-02-05T17:00:18.261454952Z:  fin-df8d589bd-95bsz: INFO: Camel (camel-1)
thread #2 - KafkaConsumer[topicX]:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-3, groupId=fin] Rebalance failed.:
org.apache.kafka.common.errors.DisconnectException: null

2021-02-05T17:00:18.499108799Z:  fin-df8d589bd-85zf9: INFO: Camel (camel-1)
thread #42 - KafkaConsumer[topicY]:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-43, groupId=fin] Discovered group coordinator
localhost:9204 (id: 2147482641 rack: null)
2021-02-05T17:00:18.499300612Z:  fin-df8d589bd-85zf9: INFO: Camel (camel-1)
thread #42 - KafkaConsumer[topicY]:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-43, groupId=fin] (Re-)joining group

No more logs from Kafka Consumer, it seems that the rebalancing doesn't
finish at all, I don't see logs in any of the pods about the partition
assignments being calculated, so my _guess_ is that the rebalancing stucks
in PreparingRebalance phase and never progress from there.

--- About 2 minutes 10 seconds later (sometimes I see a difference here of
1 minutes 10 seconds).

2021-02-05T17:02:29.615402388Z:  fin-df8d589bd-95bsz: INFO:
kafka-coordinator-heartbeat-thread | fin:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-9, groupId=fin] Group coordinator localhost:9204 (id:
2147482641 rack: null) is unavailable or invalid, will attempt rediscovery
2021-02-05T17:02:29.615520075Z:  fin-df8d589bd-95bsz: INFO: Camel (camel-1)
thread #28 - KafkaConsumer[twcard.plastic.events.finance.reconciliation]:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-29, groupId=fin] Rebalance failed.:
org.apache.kafka.common.errors.RebalanceInProgressException: The group is
rebalancing, so a rejoin is needed.

--- This last line may has a difference reason for rebalance too:
"Rebalance failed.: org.apache.kafka.common.errors.DisconnectException:
null"

2021-02-05T17:02:29.74932507Z:  fin-df8d589bd-j8mw6: INFO: Camel (camel-1)
thread #2 - KafkaConsumer[topicX]:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-3, groupId=fin] Discovered group coordinator
localhost:9204 (id: 2147482641 rack: null)
2021-02-05T17:02:29.749488204Z:  fin-df8d589bd-j8mw6: INFO: Camel (camel-1)
thread #2 - KafkaConsumer[topicX]:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-3, groupId=fin] (Re-)joining group

... and the same repeats forever.

Kafka Client version: 2.6.x
Kafka Broker version: 2.4.1


What can be the reason for this failing rebalance?

I found this bug on 2.4.1, is it possible that I hit this issue?
https://issues.apache.org/jira/browse/KAFKA-9752
"Consumer rebalance can be stuck after new member timeout with old
JoinGroup version"


Thanks for the help,
Peter

Reply via email to