Hi all, Hoping someone can sanity check my logic! A cluster I'm working on went into production with some topics poorly configured; ReplicationFactor of 1 mostly being the issue.
To avoid downtime as much as possible, I used the kafka-reassign-partitions.sh tool to add extra replicas to topic partitions. This worked like a charm for the majority of topics; except when I got to our highest throughput one. The async execution of the re-assign got stuck in a never-ending loop, and I caused a slight live issue in that some of our consumer groups lag shot through the roof, meaning data was no longer real-time. I backed some of the changes out - and went back to the drawing board. More reading later - I came to know of monitoring ISR shrinks/expands, and that settings like num.replica.fetchers probably needed tuning since replication was not keeping up. A line of documentation "A message is committed only after it has been successfully copied to all the in-sync replicas" led me to conclude that consumer lag had increased because of this delay in replication. I planned to ratchet up the num.replica.fetchers until I saw ISR shrinks/expands diminish. In return I expected some extra CPU/Network/Disk I/O on the brokers, but for consumer lag to decrease. Then I would go back to increasing the RF on any remaining topics. The first part went OK - increasing fetcher threads from 1 to 3; I saw Shrinks/Expands *decrease*, although not entirely to 0. Contrary to what I expected though, the consumer lag *increased* for some of our apps. I couldn't see any resource bottleneck on the hosts the apps are on; can anyone suggest if there could be any resource contention otherwise in Kafka itself? Many thanks in advance, Marcus