[ https://issues.apache.org/jira/browse/KAFKA-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhangtongr updated KAFKA-19558: ------------------------------- Component/s: metrics offset manager (was: consumer) > kafka-consumer-groups.sh --describe --all-groups command times out on large > clusters with many consumer groups > -------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-19558 > URL: https://issues.apache.org/jira/browse/KAFKA-19558 > Project: Kafka > Issue Type: Bug > Components: metrics, offset manager > Affects Versions: 2.7.1 > Reporter: zhangtongr > Priority: Blocker > > Description: > When running the following command in a Kafka cluster with a large number of > consumer groups (over 380) and topics (over 500), the > kafka-consumer-groups.sh --describe --all-groups operation consistently times > out and fails to return results. > Command used: > ./kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe > --all-groups > Observed behavior: > The command fails with a TimeoutException, and no consumer group information > is returned. The following stack trace is observed: > java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.TimeoutException: > Call(callName=describeConsumerGroups, deadlineMs=1753170317381, tries=1, > nextAllowedTryMs=1753170317482) timed out at 1753170317382 after 1 attempt(s) > at > org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45) > at > org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32) > ... > Caused by: org.apache.kafka.common.errors.TimeoutException: > Call(callName=describeConsumerGroups, deadlineMs=..., tries=1, ...) timed out > Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting > to send the call. Call: describeConsumerGroups > Expected behavior: > The command should be able to return the description of all consumer groups, > or at least fail more gracefully. Ideally, there should be: > A way to paginate or batch the describe operation; > Or configuration options to increase internal timeout thresholds; > Or better recommendations for dealing with large clusters. > Additional context: > Manually describing individual consumer groups via --group performs as > expected and returns data quickly. > The issue appears to scale linearly with the number of consumer groups. -- This message was sent by Atlassian Jira (v8.20.10#820010)