I suppose a downtime of a cluster as a whole for a longer period (hours or even a day) is something you would usually never want to have, at least in most production clusters.
Besides issues for consumers (and the scenario of potential loss of unconsumed messages you describe), this sounds like much trouble also for the producers which then of course also can't write any new data to the cluster during that downtime either. At least this would be for clusters / topics that are constantly being written to, which is a typical scenario for most kafka clusters. But still, to your point: I think setting the retention period larger (maybe even substantially larger) than any such maintenance windows you may want to anticipate for would be the most typical way to avoid such problems, if you can afford this in terms of available storage space. This of course also depends on the usual rate at which data is being written into the cluster. Or as alternative, you could also configure retention in terms of size rather than time - i.e. look at *retention.bytes* vs. *retention.ms <http://retention.ms>* settings. This should technically at least also allow you to get around the specific scenario you describe, since then you can make kafka delete segments based on size instead of timestamps - and then even a very long downtime of the whole cluster shouldn't trigger removal of old segments when starting the cluster back up. In either case, I don't think there is a way to directly link the deletion of log segments to knowledge about which messages have been consumed and which haven't. Usually the log segments that are eligible for deletion should be (way) past any point that is still of relevance for any consumer. Ensuring this still requires proper monitoring and such though. On Wed, Nov 12, 2025 at 3:29 PM Prateek Kohli <[email protected]> wrote: > Thanks for the response. You’re right, a single broker downtime shouldn’t > impact consumer reads in a healthy replicated cluster. > > However, my concern is slightly different; I am referring to a scenario > where the entire Kafka cluster is down (for example, due to a maintenance > window or infrastructure issue) and is brought back up after the topic’s > retention period has already expired. > > In that case, since Kafka deletes segments purely based on timestamps, it > might start deleting data immediately upon startup, even if the messages > were never consumed. > > -----Original Message----- > From: Artem Timchenko <[email protected]> > Sent: 12 November 2025 19:13 > To: [email protected] > Subject: Re: Query: Preventing Message Loss Due to Retention Expiry in > Strimzi Kafka > > In production-grade clusters downtime of a single broker shouldn't prevent > consumers from reading messages and catch up with the offset. What > replication factor are you using? > > On Wed, Nov 12, 2025 at 10:44 AM Prateek Kohli > <[email protected]> > wrote: > > > Hi, > > > > I am looking for a reliable, production-safe strategy to avoid losing > > unread messages when a Kafka broker remains down longer than the > > topic's configured retention.ms. > > > > Since Kafka deletes segments purely based on timestamps, if a broker > > is down for (for example) 24 hours and the topic's retention.ms is > > also 24 hours, the broker may start deleting segments immediately on > > startup, even if no consumers have read those messages yet. > > > > Is there a recommended way to prevent message loss in this scenario? > > > > I am running Kafka on Kubernetes using Strimzi, so all topic > > configurations are managed through KafkaTopic CRDs and the Topic > Operator. > > > > One solution could to be alter the topic's retention configuration. > > But for that to work I would need to ensure that its triggered before > > Kafka delete the log segments. So could something be done during startup? > > > > For example, with a 3-broker cluster, I could prevent the brokers from > > fully starting after the first pod comes up, update the retention > > values in the Strimzi Kafka CR, and then let the operator complete the > > rollout so the cluster restarts with the new retention. Is this safe, > > or is there a better recommended approach to ensure that unread > > messages are preserved after long broker downtime? > > > > Regards, > > Prateek Kohli > > > > >
