Hello, Iam looking for guidance on an issue we are having with our KStream clusters of version 3.1.1. We observe that some consumers of input topic of topology get into a paused state and newer return into consuming.
The outcome is clearly seen by consumer lags rising on a few affected partitions of topology input topic and it requires manual intervention to replace some random KStream cluster host that triggers full rebalance. After that KS tasks are migrated to other nodes and the cluster is operational as it should be, consumption continues normally afterwards. We have enabled debug logs on client side and found out this: {"level":"DEBUG","logger":"org.apache.kafka.clients.consumer.internals.Fetcher","thread":"AAAA-StreamThread-1","message":"[Consumer instanceId=ABCDEF-1, clientId=AAAA-StreamThread-1-consumer, groupId=AAAA] Skipping fetching records for assigned partition Input.topic-28 because it is paused"} We assume that probably broker throttling may be behind this. Brokes (version 2.8.1) are setup to limit fetch operation (limit consumer client id to fetch max X Mb per sec) and some KS consumers will turn into paused state and never get back into advancing the stream after being throttled. It is mostly the occuring (nearly always) when we do full redeployment of a cluster where all state is retrieved from internal kafka topics and sometimes we get throttled by brokers as we reach quota. Redeployment may take hours in these cases. The other important notion is that input topic uses kafka transactions for publishing records. Publisher is of version 2.4.1 Is there some config on the client side we can enable to reduce paused state? Is this a bug in KStreams ? What logging should we enable to track this on the client or broker side ? Any guidance appreciated. Thank you Peter Cipov