In production-grade clusters downtime of a single broker shouldn't prevent consumers from reading messages and catch up with the offset. What replication factor are you using?
On Wed, Nov 12, 2025 at 10:44 AM Prateek Kohli <[email protected]> wrote: > Hi, > > I am looking for a reliable, production-safe strategy to avoid losing > unread messages when a Kafka broker remains down longer than the topic's > configured retention.ms. > > Since Kafka deletes segments purely based on timestamps, if a broker is > down for (for example) 24 hours and the topic's retention.ms is also 24 > hours, the broker may start deleting segments immediately on startup, even > if no consumers have read those messages yet. > > Is there a recommended way to prevent message loss in this scenario? > > I am running Kafka on Kubernetes using Strimzi, so all topic > configurations are managed through KafkaTopic CRDs and the Topic Operator. > > One solution could to be alter the topic's retention configuration. But > for that to work I would need to ensure that its triggered before Kafka > delete the log segments. So could something be done during startup? > > For example, with a 3-broker cluster, I could prevent the brokers from > fully starting after the first pod comes up, update the retention values in > the Strimzi Kafka CR, and then let the operator complete the rollout so the > cluster restarts with the new retention. Is this safe, or is there a better > recommended approach to ensure that unread messages are preserved after > long broker downtime? > > Regards, > Prateek Kohli > >
