Thanks Ismael, I added an extra paragraph in the motivation. We have certainly hit this within our internal Confluent reassignment software and from a quick skim in the popular Cruise Control repository, I notice that similar problems have been hit there too. Hopefully the examples in the KIP are sufficient to make the case
On Wed, Sep 7, 2022 at 11:21 PM Ismael Juma <ism...@juma.me.uk> wrote: > Thanks for the details, Colin. I understand how this can happen. But this > API has been out for a long time. Are we saying that we have seen Cruise > Control cause this kind of problem? If so, it would be good to mention it > in the KIP as evidence that the current approach is brittle. > > Ismael > > On Wed, Sep 7, 2022 at 2:15 PM Colin McCabe <cmcc...@apache.org> wrote: > > > Hi Ismael, > > > > I think this issue comes up when people write software that automatically > > creates partition reassignments to balance the cluster. Cruise Control is > > one example; Confluent also has some software that does this. If there is > > already a reassignment that is going on for some partition and the > software > > tries to create a new reassignment for that partition, the software may > > inadvertently change the replication factor. > > > > In general, I think some people find it surprising that reassignment can > > change the replication factor of a partition. When we outlined the > > reassignment API in KIP-455 we maintained the ability to do this, since > the > > old ZK-based API had always been able to do it. But this was a bit > > controversial. Maybe it would have been more intuitive to preserve > > replication factor by default unless the user explicitly stated that they > > wanted to change it. So in a sense, you could view this as a fix for > > KIP-455 :) (in my opinion, at least) > > > > best, > > Colin > > > > > > On Wed, Sep 7, 2022, at 07:07, Ismael Juma wrote: > > > Thanks for the KIP. Can we explain a bit more why this is an important > > use > > > case to address? For example, do we have concrete examples of people > > > running into this? The way the KIP is written, it sounds like a > potential > > > problem but no information is given on whether it's a real problem in > > > practice. > > > > > > Ismael > > > > > > On Thu, Jul 28, 2022 at 2:00 AM Stanislav Kozlovski > > > <stanis...@confluent.io.invalid> wrote: > > > > > >> Hey all, > > >> > > >> I'd like to start a discussion on a proposal to help API users from > > >> inadvertently increasing the replication factor of a topic through > > >> the alter partition reassignments API. The KIP describes two fairly > > >> easy-to-hit race conditions in which this can happen. > > >> > > >> The KIP itself is pretty simple, yet has a couple of alternatives that > > can > > >> help solve the same problem. I would appreciate thoughts from the > > community > > >> on how you think we should proceed, and whether the proposal makes > > sense in > > >> the first place. > > >> > > >> Thanks! > > >> > > >> KIP: > > >> > > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-860%3A+Add+client-provided+option+to+guard+against+replication+factor+change+during+partition+reassignments > > >> JIRA: https://issues.apache.org/jira/browse/KAFKA-14121 > > >> > > >> -- > > >> Best, > > >> Stanislav > > >> > > > -- Best, Stanislav