We just noticed that one of our topics has been horribly misbehaving.

*retention.ms <http://retention.ms>* for the topic is set to 1209600000 ms

However, segments are getting schedule for deletetion as soon as a new one
is rolled over. And naturally consumers are running into a
kafka.common.OffsetOutOfRangeException whenever this happens.

Is this a known bug? It is incredibly serious. We seem to have lost about
40 million messages on a single topic and are yet to figure out what all
topics are affected.

I thought of restarting Kafka but figured I'd leave it untouched while I
figure out what I can capture for finding the root cause.

Meanwhile in order to keep from losing any more data, I have a periodic job
that is doing a *'cp -al' *of the partitions into a separate folder. That
way Kafka goes ahead and deletes the segment but the data is not lost from
the filesystem.

If this is a unseen bug, what should I save from the running instance.

By the way, this has affected all partitions and replicas of the topic and
not on a specific host.

Reply via email to