Hello,

Iam looking for guidance on an issue we are having with our KStream
clusters of version 3.1.1. We observe that some consumers of input topic of
topology get into a paused state and newer return into consuming.

The outcome is clearly seen by consumer lags rising on a few affected
partitions of topology input topic and it requires manual intervention to
replace some random KStream cluster host that triggers full rebalance.
After that KS tasks are migrated to other nodes and the cluster is
operational as it should be, consumption continues normally afterwards.

We have enabled debug logs on client side and found out this:
{"level":"DEBUG","logger":"org.apache.kafka.clients.consumer.internals.Fetcher","thread":"AAAA-StreamThread-1","message":"[Consumer
instanceId=ABCDEF-1, clientId=AAAA-StreamThread-1-consumer, groupId=AAAA]
Skipping fetching records for assigned partition Input.topic-28 because it
is paused"}

We assume that probably broker throttling may be behind this. Brokes
(version 2.8.1) are setup to limit fetch operation (limit consumer client
id to fetch max X Mb per sec) and some KS consumers will turn into paused
state and never get back into advancing the stream after being throttled.
It is mostly the occuring (nearly always) when we do full redeployment of a
cluster where all state is retrieved from internal kafka topics and
sometimes we get throttled by brokers as we reach quota. Redeployment may
take hours in these cases.

The other important notion is that input topic uses kafka transactions for
publishing records.
Publisher is of version 2.4.1

Is there some config on the client side we can enable to reduce paused
state? Is this a bug in KStreams ? What logging should we enable to track
this on the client or broker side ? Any guidance appreciated.

Thank you
Peter Cipov

Reply via email to