Hello All, We are running Kafka in production with 3 brokers and Kafka version 2.1.1. We have noticed that when a Kafka broker was stopped for more than 10 minutes and we are starting it again, after the start-up we are facing degradation of around 90% for up to 4 minutes. During this period(of around 4 minutes) we observe CPU usage reduction from 22% to 2% at all of the brokers. Also, the broker which has just been started have network-out 7 MB/min and network-in 2.2 GB/min, on the other hand, the rest of the brokers has network out 1.1 GB/min and network-in 55 MB/min. We assume that this is due to the fact that the broker, who has been stopped for more than 10 minutes, must catch up with the messages that have been processed during the time while he was stopped. The performance degradation persists until all 3 brokers become insync (we have min.insync.replicas=2 and replication factor of 3).
It is worth mentioning that we have ~5k messages per/sec with an average size ~3kb. We also try to increase broker nodes to 5 (rebalanced) and run with https://kafka.apache.org/081/documentation.html#prodconfig and still see ~35% performance degradation. Thanks in advance. Best regards, Miroslav