Hello,
I'm doing some tests with rolling restarts in a Kafka cluster and I have a
couple of questions related to the impact of rolling restarts on Kafka
consumers/producers and on the overall process.
First, some context on my setup:
- Kafka cluster with 3 nodes.
- Topic replication factor of 3 with minISR of 2.
- All topics have a single partition (I intend to increase the
partitioning factor in the future, but for now it's just 1 for testing
purposes).
- Kafka version is 3.2.3.
- I have two systems that communicate via these Kafka topics. The
high-level flow is:
1. System A sends a message to a Kafka topic (at a rate of ~10
events/sec).
2. System B consumes the message.
3. System B sends a reply to a Kafka topic.
4. System A consumes the reply.
- When the system is stable, I see end-to-end latencies (measured on
System A) around 10ms in the 99th percentile.
- System A is using Kafka client 3.3.1, and System B is using Kafka
client 3.4.0.
- Kafka consumers and producers on both systems are with the default
configurations, except that the Kafka consumers have auto-commits disabled.
- All Kafka brokers are configured with controlled.shutdown.enable set
to true.
- The Kafka cluster is running in Kubernetes and deployed using Strimzi
(this is just for awareness).
- The rolling restart process is the following (when using Strimzi to
manage it, and when we try to do it manually):
1. Restart each broker, one at a time, by sending a SIGTERM to the
broker process. The controller broker is the last one to be restarted.
2. Only restart the next broker when the current broker reports the
broker state as RUNNING. Note: when we do this manually (without
Strimzi),
we wait to see the end-to-end latencies stabilize before moving
to the next
broker.
Now, my questions:
1. When we do this process with Strimzi (waits for the broker state to
be RUNNING before moving to the next one), we've seen end-to-end latencies
growing up to 1-2 minutes (System A is not even able to send events to the
Kafka topic). This is unexpected because AFAIK the configurations that we
are using are the ones recommended for high availability during rolling
restarts. My question is: is it enough to wait for the broker state to be
RUNNING to move on to the next broker?
2. When we do this process manually (we wait for end-to-end latencies to
stabilize and only then move to the next broker), we've seen end-to-end
latencies growing up to 1 second. While this is much better than what we
see in 1., my question is whether this latency increase is expected or not.
Thanks in advance,
Luís Alves