Hello! Basically , I don't think that we can simply conclude that consumer lag is dependent on number of replica fetching threads. Maybe the first thing to double check is to use kafka-confumer-group cli instead of some lag exporters (in case you using this type of monitoring for consumer lag). Next I'll be check bytes_in bytes_out per topic in case there is a huge dis-balance between produce and consume. Third thing - check number and rate of rebalance for consumer groups with high lag.

For me this steps can help to clarify state of the issue (if it monitoring issue (some of your exporters are lying to your), client issue (disproportional huge produce load or constant rebalance of consumer groups), or cluster performance issue).

On 6/2/21 5:25 PM, Marcus Horsley-Rai wrote:
Hi all,

Hoping someone can sanity check my logic!
A cluster I'm working on went into production with some topics poorly
configured; ReplicationFactor of 1 mostly being the issue.

To avoid downtime as much as possible, I used the
kafka-reassign-partitions.sh tool to add extra replicas to topic partitions.
This worked like a charm for the majority of topics; except when I got to
our highest throughput one.
The async execution of the re-assign got stuck in a never-ending loop, and
I caused a slight live issue in that some of our consumer groups lag shot
through the roof, meaning data was no longer real-time.
I backed some of the changes out - and went back to the drawing board.

More reading later - I came to know of monitoring ISR shrinks/expands, and
that settings like num.replica.fetchers probably needed tuning since
replication was not keeping up.
A line of documentation "A message is committed only after it has been
successfully copied to all the in-sync replicas" led me to conclude that
consumer lag had increased because of this delay in replication.

I planned to ratchet up the num.replica.fetchers until I saw ISR
shrinks/expands diminish.  In return I expected some extra CPU/Network/Disk
I/O on the brokers, but for consumer lag to decrease. Then I would go back
to increasing the RF on any remaining topics.

The first part went OK - increasing fetcher threads from 1 to 3; I saw
Shrinks/Expands *decrease*, although not entirely to 0.
Contrary to what I expected though, the consumer lag *increased* for some
of our apps.
I couldn't see any resource bottleneck on the hosts the apps are on; can
anyone suggest if there could be any resource contention otherwise in Kafka
itself?

Many thanks in advance,

Marcus

Reply via email to