Hi Carlos,
           This is a really useful feature and we would like to have it as 
well. I think high_watermark == log_start_offset is a good starting point to 
consider but we may also have a case where the topic is empty and the clients 
producing it may be offline so we might end up garbage collecting which is 
still active.  Having a configurable time period when an empty topic can be 
deleted will help in this case. Also, we should check if there are any 
consumers still reading from topics etc.. 
          It will be good to have a KIP around this and add some edge cases 
handling.

Thanks,
Harsha


On Sun, Jun 23, 2019, at 9:40 PM, Carlos Manuel Duclos-Vergara wrote:
> Hi,
> Thanks for the answer. Looking at high water mark, then the logic would be
> to flag the partitions that have
> 
> high_watermark == log_start_offset
> 
> In addition, I'm thinking that having the leader fulfill that criteria is
> enough to flag a partition, maybe check the replicas only if requested by
> the user.
> 
> 
> fre. 21. jun. 2019, 23:35 skrev Colin McCabe <cmcc...@apache.org>:
> 
> > I don't think this requires a change in the protocol.  It seems like you
> > should be able to use the high water mark to figure something out here?
> >
> > best,
> > Colin
> >
> >
> > On Fri, Jun 21, 2019, at 04:56, Carlos Manuel Duclos-Vergara wrote:
> > > Hi,
> > >
> > > This is an ancient task, but I feel it is still current today (specially
> > > since as somebody that deals with a Kafka cluster I know that this
> > happens
> > > more often than not).
> > >
> > > The task is about garbage collection of topics in a sort of automated
> > way.
> > > After some consideration I started a prototype implementation based on a
> > > manual process:
> > >
> > > 1. Using the cli, I can use the --describe-topic to get a list of topics
> > > that have size 0
> > > 2. Massage that list into something that can be then fed into the cli and
> > > remove the topics that have size 0.
> > >
> > > The guiding principle here is the assumption that abandoned topics will
> > > eventually have size 0, because all records will expire. This is not true
> > > for all topics, but it covers a large portion of them and having
> > something
> > > like this would help admins to find "suspicious" topics at least.
> > >
> > > I started implementing this change and I realized that it would require a
> > > change in the protocol, because the sizes are never sent over the wire.
> > > Funny enough we collect the sizes of the log files, but we do not send
> > them.
> > >
> > > I think this kind of changes will require a KIP, but I wanted to ask what
> > > others think about this.
> > >
> > > The in-progress implementation of this can be found here:
> > >
> > https://github.com/carlosduclos/kafka/commit/0dffe5e131c3bd32b77f56b9be8eded89a96df54
> > >
> > > Comments?
> > >
> > > --
> > > Carlos Manuel Duclos Vergara
> > > Backend Software Developer
> > >
> >
>

Reply via email to