Re: [DISCUSS] KIP-1201: Add disk threshold strategy to prevent disk full failure

mapan Sun, 03 Aug 2025 20:47:05 -0700

Hi peng,

Thanks for the suggestion, In fact, we already have an implementation of auto
 deletion on Kafka and it has been running in our environment for more
than 1 year.


The implementation add four broker config:
- `disk.used.threshold.enable`:  Whether to enable disk usage
threshold for auto cleanup.
- `disk.used.threshold.percent`: The disk usage threshold percent. If exceeded
   and enabled, old data will be deleted to free up space.
- `log.min.retention.ms`: The minimum retention milliseconds of the log.
- `log.min.retention.bytes`:  The minimum retention size of the log.

In this implementation, if the disk usage exceeds `disk.used.threshold.percent`,
 old data will be deleted according to the data ratio currently
occupied by each topic partition.

In actual use, we found that this cannot completely prevent disk full failures,
 mainly for several reasons:
- Users have mandatory requirements for data retention time and cannot accept
  auto delete strategy, or set the minimum retention time to be very long.
- Kafka delete data with the segment file as min granularity and must retain
  the active segment, which may cause the strategy to fail
  in some cases (especially when there are many partitions).
- Kafka fills up disk within the default delete interval (5 minutes).

So we want to add a new strategy that checks disk usage more frequently
 and rejects produce in the worst case to avoid disk fullness.

I'm thinking about how to combine these strategies. Currently, I would prefer
 to use reject-produce as the worst-case strategy, and auto-delete being
 optional and having a lower threshold than deny write.

Best regards,
mapan


peng <p1070048...@gmail.com> 于2025年8月1日周五 18:33写道：
>
> In most use cases, Kafka serves as a messaging middleware where messages
> that have already been consumed are typically no longer needed and can be
> safely deleted. Therefore, I propose enhancing the threshold strategy with
> an automatic deletion feature:
>
> When a broker's disk usage reaches 95%, it should automatically delete the
> oldest 10% of messages on the node to free up disk space, allowing new
> messages to be produced. This eliminates the need for manual cleanup while
> ensuring that new messages (which are almost always more critical than
> already-consumed data) take priority.
>
> Prevents disk-full scenarios by automatically removing stale data.
> No admin intervention required for basic cleanup.
>  Fresh messages are never blocked by obsolete ones.
>
> The only potential risk arises if consumer groups experience significant
> lag where unconsumed messages might be deleted prematurely. However, in
> such cases, the root issue is the backlog itself—teams should prioritize
> resolving the lag rather than relying on retention.
>
>
> To accommodate different needs, we could introduce a
> `disk.threshold.policy` parameter, allowing users to choose between:
> 1. Rejecting new messages
> 2. Auto deleting the oldest messages
>
>
> Best regards
>
> mapan <mapan0...@gmail.com> 于 2025年7月31日周四 下午8:18写道：
>
> > Hi all,
> >
> > I’d like to start a discussion about a new KIP:
> > https://cwiki.apache.org/confluence/x/Nw9JFg
> >
> > This KIP suggests adding disk threshold configs in Kafka and rejecting new
> > product
> > requests after reaching the threshold to prevent disk full failure.
> >
> > This strategy is similar to RocketMQ's diskMaxUsedSpaceRatio config or
> > RabbitMQ's
> > disk_free_limit config, and I hope to implement this strategy in our
> > environment.
> >
> > Please share your feedback, questions, or concerns so we can refine
> > the proposal together.
> >
> > Best regards,
> > mapan
> >

Re: [DISCUSS] KIP-1201: Add disk threshold strategy to prevent disk full failure

Reply via email to