Hi all,

Hoping someone can sanity check my logic!
A cluster I'm working on went into production with some topics poorly
configured; ReplicationFactor of 1 mostly being the issue.

To avoid downtime as much as possible, I used the
kafka-reassign-partitions.sh tool to add extra replicas to topic partitions.
This worked like a charm for the majority of topics; except when I got to
our highest throughput one.
The async execution of the re-assign got stuck in a never-ending loop, and
I caused a slight live issue in that some of our consumer groups lag shot
through the roof, meaning data was no longer real-time.
I backed some of the changes out - and went back to the drawing board.

More reading later - I came to know of monitoring ISR shrinks/expands, and
that settings like num.replica.fetchers probably needed tuning since
replication was not keeping up.
A line of documentation "A message is committed only after it has been
successfully copied to all the in-sync replicas" led me to conclude that
consumer lag had increased because of this delay in replication.

I planned to ratchet up the num.replica.fetchers until I saw ISR
shrinks/expands diminish.  In return I expected some extra CPU/Network/Disk
I/O on the brokers, but for consumer lag to decrease. Then I would go back
to increasing the RF on any remaining topics.

The first part went OK - increasing fetcher threads from 1 to 3; I saw
Shrinks/Expands *decrease*, although not entirely to 0.
Contrary to what I expected though, the consumer lag *increased* for some
of our apps.
I couldn't see any resource bottleneck on the hosts the apps are on; can
anyone suggest if there could be any resource contention otherwise in Kafka
itself?

Many thanks in advance,

Marcus

Reply via email to