Hi peng, Thanks for the suggestion, In fact, we already have an implementation of auto deletion on Kafka and it has been running in our environment for more than 1 year.
The implementation add four broker config: - `disk.used.threshold.enable`: Whether to enable disk usage threshold for auto cleanup. - `disk.used.threshold.percent`: The disk usage threshold percent. If exceeded and enabled, old data will be deleted to free up space. - `log.min.retention.ms`: The minimum retention milliseconds of the log. - `log.min.retention.bytes`: The minimum retention size of the log. In this implementation, if the disk usage exceeds `disk.used.threshold.percent`, old data will be deleted according to the data ratio currently occupied by each topic partition. In actual use, we found that this cannot completely prevent disk full failures, mainly for several reasons: - Users have mandatory requirements for data retention time and cannot accept auto delete strategy, or set the minimum retention time to be very long. - Kafka delete data with the segment file as min granularity and must retain the active segment, which may cause the strategy to fail in some cases (especially when there are many partitions). - Kafka fills up disk within the default delete interval (5 minutes). So we want to add a new strategy that checks disk usage more frequently and rejects produce in the worst case to avoid disk fullness. I'm thinking about how to combine these strategies. Currently, I would prefer to use reject-produce as the worst-case strategy, and auto-delete being optional and having a lower threshold than deny write. Best regards, mapan peng <p1070048...@gmail.com> 于2025年8月1日周五 18:33写道: > > In most use cases, Kafka serves as a messaging middleware where messages > that have already been consumed are typically no longer needed and can be > safely deleted. Therefore, I propose enhancing the threshold strategy with > an automatic deletion feature: > > When a broker's disk usage reaches 95%, it should automatically delete the > oldest 10% of messages on the node to free up disk space, allowing new > messages to be produced. This eliminates the need for manual cleanup while > ensuring that new messages (which are almost always more critical than > already-consumed data) take priority. > > Prevents disk-full scenarios by automatically removing stale data. > No admin intervention required for basic cleanup. > Fresh messages are never blocked by obsolete ones. > > The only potential risk arises if consumer groups experience significant > lag where unconsumed messages might be deleted prematurely. However, in > such cases, the root issue is the backlog itself—teams should prioritize > resolving the lag rather than relying on retention. > > > To accommodate different needs, we could introduce a > `disk.threshold.policy` parameter, allowing users to choose between: > 1. Rejecting new messages > 2. Auto deleting the oldest messages > > > Best regards > > mapan <mapan0...@gmail.com> 于 2025年7月31日周四 下午8:18写道: > > > Hi all, > > > > I’d like to start a discussion about a new KIP: > > https://cwiki.apache.org/confluence/x/Nw9JFg > > > > This KIP suggests adding disk threshold configs in Kafka and rejecting new > > product > > requests after reaching the threshold to prevent disk full failure. > > > > This strategy is similar to RocketMQ's diskMaxUsedSpaceRatio config or > > RabbitMQ's > > disk_free_limit config, and I hope to implement this strategy in our > > environment. > > > > Please share your feedback, questions, or concerns so we can refine > > the proposal together. > > > > Best regards, > > mapan > >