Hey folks, I'm wondering if anyone has any insights into what I'm observing in Kafka. We have a 6 node cluster with our topics consuming a lot of disk space. All our apps use transactions. We recently ran kafka-topics.sh --zookeeper --alter --topic --config to alter retention for a single topic, from infinite (-1) to 1 year. This would result in approximately 3 TB of space being freed up.Approximately 1 minute later, all our apps which use vanilla producer/consumer crash due to TimeoutException , while all the kafka stream apps using txns (EOS) hit TaskMigratedException with producer fencing. Tracing through the logs, it seems like the timeline of events are: 1. We run the config change, and then Kafka receives the config change from ZK on its notification handler 2. Every one of the brokers started aborting/rolling back all transactions approximately 1 minute later
Completed rollback of ongoing transaction for transactionalId....due to timeout 3. Which caused the apps to die/producers in Kafka streams to get fenced 4. Disk segment deletion started happening about 1-2 minutes later 5. Our Disk IOPS go through the roof, due to our stream apps restoring stateOne of our theories at the moment is maybe it has something to do with our using of HDDs, but I was wondering if anybody else knows what may have been the cause. How does log deletion work? Does Kafka have to memory map those before deleting or something? We observed Kafka logging the deleting of the segments after all the transactions have been aborted though. Is there some blocking operation on receiving the notification from Zookeeper or does that necessarily cause Kafka to abort all active transactions? Should we expect all txns to be aborted every time we run that script?Any insights into the internals on how Kafka deletes segments or handles the notification from ZK will be appreciated, thanks! (This is Kafka 2.8 but we haven't tried turning on the config for zookeeper-less Kafka)