Hi folks, Perhaps a solution option is to only rename partitions to "whatever-topic-x.stray" when processing the LAIR and delete it with a periodic task (so not with a fixed delay but have a thread which scans and deletes them periodically). I think it has an advantage as it is a similar approach that is used in deletion and compaction and won't cause immediate mass deletion.
Viktor On Thu, Jan 16, 2020 at 11:35 PM Colin McCabe <cmcc...@apache.org> wrote: > On Thu, Jan 16, 2020, at 10:29, Dhruvil Shah wrote: > > Hi Colin, > > > > That’s fair though I am unsure if a delay + metric + log message would > > really serve our purpose. There would be no action required from the > > operator in almost all cases. A signal that is not actionable in 99% > cases > > may not be very useful, in my opinion. > > As I understand it, the case we're trying to solve is where a broker has > gone away for a while and then comes back, but some of its partitions have > been moved to a different broker. Because this case is already relatively > rare, I don't think we need to worry too much about adding non-actionable > signals. > > Maybe more importantly, broker downtime will also independently trigger > alerts in a well-managed cluster. So what we are adding is a metric that > indicates that "something bad is happening" that is highly correlated with > other "something bad is happening" metrics. This is similar to URPs, or > even under-min-isr partitions, which are all worth monitoring and possibly > alerting on, and which will all tend to show activity at the same time. > > > > > Additionally, if we add in a delay, we would need to reason about the > > behavior when the same topic is recreated while a stray partition has > been > > queued for deletion. > > > > This is a good question, but I think the current code already handles a > very similar case. The broker currently handles topic deletions in a > two-step process. The first step is renaming the topic directory. The > directory's new name will contain a UUID and end with .deleted. The second > step is actually deleting the directory. (It was done in this way to allow > deletion to be done asynchronously.) I would expect the proposed delay > mechanism to do something like this, such that a new topic created with the > same name would not have a name collision. > > > I would be in support of adding a configuration to disable stray > partition > > deletion. This way, if users find abnormal behavior when testing / > > upgrading development environments, they could choose to disable the > > feature altogether. > > > > Let me know what you think. It would be good to hear what others think as > > well. > > I feel strongly that this should come with a delay period and advance > warning. We just had too much pain with lost data as a result of bugs in > HDFS leading to rapid deletion. These bugs didn't manifest in testing or > routine upgrades. > > best, > Colin > > > > > > Thanks, > > Dhruvil > > > > On Thu, Jan 16, 2020 at 3:24 AM Colin McCabe <cmcc...@apache.org> wrote: > > > > > On Wed, Jan 15, 2020, at 03:54, Dhruvil Shah wrote: > > > > Hi Colin, > > > > > > > > We could add a configuration to disable stray partition deletion if > > > needed, > > > > but I wasn't sure if an operator would really want to disable it. > Perhaps > > > > if the implementation were buggy, the configuration could be used to > > > > disable the feature until a bug fix is made. Is that the kind of use > case > > > > you were thinking of? > > > > > > > > I was thinking that there would not be any delay between detection > and > > > > deletion of stray logs. We would schedule an async task to do the > actual > > > > deletion though. > > > > > > Based on my experience in HDFS, immediately deleting data that looks > out > > > of place can cause severe issues when a bug occurs. See > > > https://issues.apache.org/jira/browse/HDFS-6186 for details. So I > really > > > do think there should be a delay, and a metric + log message in the > > > meantime to alert the operators to what is about to happen. > > > > > > best, > > > Colin > > > > > > > > > > > Thanks, > > > > Dhruvil > > > > > > > > On Tue, Jan 14, 2020 at 11:04 PM Colin McCabe <cmcc...@apache.org> > > > wrote: > > > > > > > > > Hi Dhruvil, > > > > > > > > > > Thanks for the KIP. I think there should be some way to turn this > > > off, in > > > > > case that becomes necessary. I'm also curious how long we intend > to > > > wait > > > > > between detecting the duplication and deleting the extra logs. > The > > > KIP > > > > > says "scheduled for deletion" but doesn't give a time frame -- is > it > > > > > assumed to be immediate? > > > > > > > > > > best, > > > > > Colin > > > > > > > > > > > > > > > On Tue, Jan 14, 2020, at 05:56, Dhruvil Shah wrote: > > > > > > If there are no more questions or concerns, I will start a vote > > > thread > > > > > > tomorrow. > > > > > > > > > > > > Thanks, > > > > > > Dhruvil > > > > > > > > > > > > On Mon, Jan 13, 2020 at 6:59 PM Dhruvil Shah < > dhru...@confluent.io> > > > > > wrote: > > > > > > > > > > > > > Hi Nikhil, > > > > > > > > > > > > > > Thanks for looking at the KIP. The kind of race condition you > > > mention > > > > > is > > > > > > > not possible as stray partition detection is done synchronously > > > while > > > > > > > handling the LeaderAndIsrRequest. In other words, we atomically > > > > > evaluate > > > > > > > the partitions the broker must host and the extra partitions > it is > > > > > hosting > > > > > > > and schedule deletions based on that. > > > > > > > > > > > > > > One possible shortcoming of the KIP is that we do not have the > > > ability > > > > > to > > > > > > > detect a stray partition if the topic has been recreated > since. We > > > will > > > > > > > have the ability to disambiguate between different generations > of a > > > > > > > partition with KIP-516. > > > > > > > > > > > > > > Thanks, > > > > > > > Dhruvil > > > > > > > > > > > > > > On Sat, Jan 11, 2020 at 11:40 AM Nikhil Bhatia < > > > nik...@confluent.io> > > > > > > > wrote: > > > > > > > > > > > > > >> Thanks Dhruvil, the proposal looks reasonable to me. > > > > > > >> > > > > > > >> is there a potential of a race between a new topic being > assigned > > > to > > > > > the > > > > > > >> same node that is still performing a cleanup of the stray > > > partition ? > > > > > > >> Topic > > > > > > >> ID will definitely solve this issue. > > > > > > >> > > > > > > >> Thanks > > > > > > >> Nikhil > > > > > > >> > > > > > > >> On 2020/01/06 04:30:20, Dhruvil Shah <d...@confluent.io> > wrote: > > > > > > >> > Here is the link to the KIP:> > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-550%3A+Mechanism+to+Delete+Stray+Partitions+on+Broker > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > On Mon, Jan 6, 2020 at 9:59 AM Dhruvil Shah < > dh...@confluent.io > > > > > > > > > > >> wrote:> > > > > > > >> > > > > > > > >> > > Hi all, I would like to kick off discussion for KIP-550 > which > > > > > proposes > > > > > > >> a> > > > > > > >> > > mechanism to detect and delete stray partitions on a > broker. > > > > > > >> Suggestions> > > > > > > >> > > and feedback are welcome.> > > > > > > >> > >> > > > > > > >> > > - Dhruvil> > > > > > > >> > >> > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >