We just noticed that one of our topics has been horribly misbehaving. *retention.ms <http://retention.ms>* for the topic is set to 1209600000 ms
However, segments are getting schedule for deletetion as soon as a new one is rolled over. And naturally consumers are running into a kafka.common.OffsetOutOfRangeException whenever this happens. Is this a known bug? It is incredibly serious. We seem to have lost about 40 million messages on a single topic and are yet to figure out what all topics are affected. I thought of restarting Kafka but figured I'd leave it untouched while I figure out what I can capture for finding the root cause. Meanwhile in order to keep from losing any more data, I have a periodic job that is doing a *'cp -al' *of the partitions into a separate folder. That way Kafka goes ahead and deletes the segment but the data is not lost from the filesystem. If this is a unseen bug, what should I save from the running instance. By the way, this has affected all partitions and replicas of the topic and not on a specific host.