Re: Increase in consumer lag

Nikita Kretov Thu, 03 Jun 2021 07:06:03 -0700

Hello! Basically , I don't think that we can simply conclude thatconsumer lag is dependent on number of replica fetching threads.Maybe the first thing to double check is to use kafka-confumer-group cliinstead of some lag exporters (in case you using this type of monitoringfor consumer lag). Next I'll be check bytes_in bytes_out per topic incase there is a huge dis-balance between produce and consume. Thirdthing - check number and rate of rebalance for consumer groups with highlag.

For me this steps can help to clarify state of the issue (if itmonitoring issue (some of your exporters are lying to your), clientissue (disproportional huge produce load or constant rebalance ofconsumer groups), or cluster performance issue).


On 6/2/21 5:25 PM, Marcus Horsley-Rai wrote:

Hi all,

Hoping someone can sanity check my logic!
A cluster I'm working on went into production with some topics poorly
configured; ReplicationFactor of 1 mostly being the issue.

To avoid downtime as much as possible, I used the
kafka-reassign-partitions.sh tool to add extra replicas to topic partitions.
This worked like a charm for the majority of topics; except when I got to
our highest throughput one.
The async execution of the re-assign got stuck in a never-ending loop, and
I caused a slight live issue in that some of our consumer groups lag shot
through the roof, meaning data was no longer real-time.
I backed some of the changes out - and went back to the drawing board.

More reading later - I came to know of monitoring ISR shrinks/expands, and
that settings like num.replica.fetchers probably needed tuning since
replication was not keeping up.
A line of documentation "A message is committed only after it has been
successfully copied to all the in-sync replicas" led me to conclude that
consumer lag had increased because of this delay in replication.

I planned to ratchet up the num.replica.fetchers until I saw ISR
shrinks/expands diminish.  In return I expected some extra CPU/Network/Disk
I/O on the brokers, but for consumer lag to decrease. Then I would go back
to increasing the RF on any remaining topics.

The first part went OK - increasing fetcher threads from 1 to 3; I saw
Shrinks/Expands *decrease*, although not entirely to 0.
Contrary to what I expected though, the consumer lag *increased* for some
of our apps.
I couldn't see any resource bottleneck on the hosts the apps are on; can
anyone suggest if there could be any resource contention otherwise in Kafka
itself?

Many thanks in advance,

Marcus

Re: Increase in consumer lag

Reply via email to