Re: [DISCUSS] KIP-726: Make the CooperativeStickyAssignor as the default assignor

Sophie Blee-Goldman Wed, 14 Apr 2021 18:46:29 -0700

1) Once the short-circuit is triggered, the member will downgrade to the
EAGER protocol, but
won't necessarily try to rejoin the group right away.


In the "happy path", the user has implemented #onPartitionsLost correctly
and will not attempt
to commit partitions that are lost. And since these partitions have indeed
been revoked, the user
application should not attempt to commit those partitions after this point.
In this case, there's no
reason for the consumer to immediately rejoin the group. Since a
non-cooperative assignor was
selected, we know that all partitions have been assigned. This member can
continue on as usual,
processing the remaining un-revoked partitions and will follow the EAGER
protocol in the next
rebalance. There's no user-facing impact or handling required; all that
happens is that the work
since the last commit on those revoked partitions has been lost.

In the less-happy path, the user has implemented #onPartitionsLost
incorrectly or not implemented
it at all, falling back on the default which invokes #onPartitionsRevoked
which in turn will attempt to
commit those partitions during the rebalance callback. In this case we rely
on the flag to prevent
this commit request from being sent to the broker.

Originally I was thinking we should throw a CommitFailedException up
through the #onPartitionsLost
callback, and eventually up through poll(), then rejoin the group. But now
I'm wondering if this is really
necessary -- the important point in all cases is just to prevent the
commit, but there's no reason the
consumer should not be allowed to continue processing its other partitions,
and it hasn't dropped out
of the group. What do you think about this slight amendment to my original
proposal: if a user does end up
calling commit for whatever reason when invoking #onPartitionsLost, we'll
just swallow the resulting
CommitFailedException. So the user application wouldn't see anything, and
the only impact would be
that these partitions were not able to commit those last set of offsets on
the revoked partitions.

WDYT? My only concern there is that the user might have some implicit
assumption that unless a
CommitFailedException was thrown, the offsets of revoked partitions were
successfully committed
and they may have some downstream logic that should trigger only in this
case. If that's a concern,
then I would keep the original proposal which says a CommitFailedException
will be thrown up through
poll(), and leave it up to the user to decide if they want to trigger a new
rebalance/rejoin the group or not.

Regarding the flag which prevents committing the revoked partitions, this
will need to continue
blocking such commit attempts until the next time the consumer rejoins the
group, ie until the end
of the next successful rebalance. Technically this shouldn't matter, since
the consumer no longer
owns those partitions this member shouldn't attempt to commit them anyways.
Usually we can
rely on the broker rejecting commit attempts on partitions that are not
owned, in which case the
consumer will throw a CommitFailedException. This is similar, except that
we can't rely on the
broker having been informed of the change in ownership before this consumer
might attempt to
commit. So to avoid this race condition, we'll keep the "blockCommit" flag
until the next rebalance
when we can be certain that the broker is clear on this
partition's ownership.

2) I guess maybe the wording here is unclear -- what I meant is that all
3.0 applications will *eventually*
enable cooperative rebalancing in the stable state. This doesn't mean that
it will select COOPERATIVE
when it first starts up, and in order for this dynamic protocol upgrade to
be safe we do indeed need to
start off with EAGER and only upgrade once the selected assignor indicates
that it's safe to do so.
(This only applies if multiple assignors are used, if the assignors are
"cooperative-sticky" only then it
will just start out and forever remain on COOPERATIVE, like in Streams)

Since it's just the first rebalance, the choice of COOPERATIVE vs EAGER
actually doesn't matter at
all since the consumer won't own any partitions until it's joined the
group. So we may as well continue
the initial protocol selection strategy of "highest commonly supported
protocol", but the point is that
3.0 applications will upgrade to COOPERATIVE as soon as they have any
partitions. If you can think
of a better way to phrase "New applications on 3.0 will enable cooperative
rebalancing by default" then
please let me know.


Thanks for the response -- hope this makes sense so far, but I'm happy to
elaborate any aspects of the
proposal which aren't clear. I'll also update the ticket description
for KAFKA-12477 with the latest.


On Wed, Apr 14, 2021 at 12:03 PM Guozhang Wang <wangg...@gmail.com> wrote:

> Hello Sophie,
>
> Thanks for the detailed explanation, a few clarifying questions:
>
> 1) when the short-circuit is triggered, what would happen next? Would the
> consumers switch back to EAGER, and try to re-join the group, and then upon
> succeeding the next rebalance reset the flag to allow committing? Or would
> we just fail the consumer immediately.
>
> 2) at the overview you mentioned "New applications on 3.0 will enable
> cooperative rebalancing by default", but in the detailed description as
> "With ["cooperative-sticky", "range”], the initial protocol will be EAGER
> when the member first joins the group." which seems contradictory? If we
> want to have cooperative behavior be the default, then with the
> default ["cooperative-sticky", "range”] the member would start with
> COOPERATIVE protocol right away.
>
>
> Guozhang
>
>
>
> On Mon, Apr 12, 2021 at 5:19 AM Chris Egerton <chr...@confluent.io.invalid
> >
> wrote:
>
> > Whoops, small correction--meant to say
> > ConsumerRebalanceListener::onPartitionsLost, not
> Consumer::onPartitionsLost
> >
> > On Mon, Apr 12, 2021 at 8:17 AM Chris Egerton <chr...@confluent.io>
> wrote:
> >
> > > Hi Sophie,
> > >
> > > This sounds fantastic. I've made a note on KAFKA-12487 about being sure
> > to
> > > implement Consumer::onPartitionsLost to avoid unnecessary task failures
> > on
> > > consumer protocol downgrade, but besides that, I don't think things
> could
> > > get any smoother for Connect users or developers. The automatic
> protocol
> > > upgrade/downgrade behavior appears safe, intuitive, and pain-free.
> > >
> > > Really excited for this development and hoping we can see it come to
> > > fruition in time for the 3.0 release!
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Fri, Apr 9, 2021 at 2:43 PM Sophie Blee-Goldman
> > > <sop...@confluent.io.invalid> wrote:
> > >
> > >> 1) Yes, all of the above will be part of KAFKA-12477 (not KIP-726)
> > >>
> > >> 2) No, KAFKA-12638 would be nice to have but I don't think it's
> > >> appropriate
> > >> to remove
> > >> the default implementation of #onPartitionsLost in 3.0 since we never
> > gave
> > >> any indication
> > >> yet that we intend to remove it
> > >>
> > >> 3) Yes, this would be similar to when a Consumer drops out of the
> group.
> > >> It's always been
> > >> possible for a member to miss a rebalance and have its partition be
> > >> reassigned to another
> > >> member, during which time both members would claim to own said
> > partition.
> > >> But this is safe
> > >> because the member who dropped out is blocked from committing offsets
> on
> > >> that partition.
> > >>
> > >> On Fri, Apr 9, 2021 at 2:46 AM Luke Chen <show...@gmail.com> wrote:
> > >>
> > >> > Hi Sophie,
> > >> > That sounds great to take care of each case I can think of.
> > >> > Questions:
> > >> > 1. Do you mean the short-Circuit will also be implemented in
> > >> KAFKA-12477?
> > >> > 2. I don't think KAFKA-12638 is the blocker of this KIP-726, Am I
> > right?
> > >> > 3. So, does that mean we still have possibility to have multiple
> > >> consumer
> > >> > owned the same topic partition? And in this situation, we avoid them
> > >> doing
> > >> > committing, and waiting for next rebalance (should be soon). Is my
> > >> > understanding correct?
> > >> >
> > >> > Thank you very much for finding this great solution.
> > >> >
> > >> > Luke
> > >> >
> > >> > On Fri, Apr 9, 2021 at 11:37 AM Sophie Blee-Goldman
> > >> > <sop...@confluent.io.invalid> wrote:
> > >> >
> > >> > > Alright, here's the detailed proposal for KAFKA-12477. This
> assumes
> > we
> > >> > will
> > >> > > change the default assignor to ["cooperative-sticky", "range"] in
> > >> > KIP-726.
> > >> > > It also acknowledges that users may attempt any kind of upgrade
> > >> without
> > >> > > reading the docs, and so we need to put in safeguards against data
> > >> > > corruption rather than assume everyone will follow the safe
> upgrade
> > >> path.
> > >> > >
> > >> > > With this proposal,
> > >> > > 1) New applications on 3.0 will enable cooperative rebalancing by
> > >> default
> > >> > > 2) Existing applications which don’t set an assignor can safely
> > >> upgrade
> > >> > to
> > >> > > 3.0 using a single rolling bounce with no extra steps, and will
> > >> > > automatically transition to cooperative rebalancing
> > >> > > 3) Existing applications which do set an assignor that uses EAGER
> > can
> > >> > > likewise upgrade their applications to COOPERATIVE with a single
> > >> rolling
> > >> > > bounce
> > >> > > 4) Once on 3.0, applications can safely go back and forth between
> > >> EAGER
> > >> > and
> > >> > > COOPERATIVE
> > >> > > 5) Applications can safely downgrade away from 3.0
> > >> > >
> > >> > > The high-level idea for dynamic protocol upgrades is that the
> group
> > >> will
> > >> > > leverage the assignor selected by the group coordinator to
> determine
> > >> when
> > >> > > it’s safe to upgrade to COOPERATIVE, and trigger a fail-safe to
> > >> protect
> > >> > the
> > >> > > group in case of rare events or user misconfiguration. The group
> > >> > > coordinator selects the most preferred assignor that’s supported
> by
> > >> all
> > >> > > members of the group, so we know that all members will support
> > >> > COOPERATIVE
> > >> > > once we receive the “cooperative-sticky” assignor after a
> rebalance.
> > >> At
> > >> > > this point, each member can upgrade their own protocol to
> > COOPERATIVE.
> > >> > > However, there may be situations in which an EAGER member may join
> > the
> > >> > > group even after upgrading to COOPERATIVE. For example, during a
> > >> rolling
> > >> > > upgrade if the last remaining member on the old bytecode misses a
> > >> > > rebalance, the other members will be allowed to upgrade to
> > >> COOPERATIVE.
> > >> > If
> > >> > > the old member rejoins and is chosen to be the group leader before
> > >> it’s
> > >> > > upgraded to 3.0, it won’t be aware that the other members of the
> > group
> > >> > have
> > >> > > not yet revoked their partitions when computing the assignment.
> > >> > >
> > >> > > Short Circuit:
> > >> > > The risk of mixing the cooperative and eager rebalancing protocols
> > is
> > >> > that
> > >> > > a partition may be assigned to one member while it has yet to be
> > >> revoked
> > >> > > from its previous owner. The danger is that the new owner may
> begin
> > >> > > processing and committing offsets for this partition while the
> > >> previous
> > >> > > owner is also committing offsets in its #onPartitionsRevoked
> > callback,
> > >> > > which is invoked at the end of the rebalance in the cooperative
> > >> protocol.
> > >> > > This can result in these consumers overwriting each other’s
> offsets
> > >> and
> > >> > > getting a corrupted view of the partition. Note that it’s not
> > >> possible to
> > >> > > commit during a rebalance, so we can protect against offset
> > >> corruption by
> > >> > > blocking further commits after we detect that the group leader may
> > not
> > >> > > understand COOPERATIVE, but before we invoke #onPartitionsRevoked.
> > >> This
> > >> > is
> > >> > > the “short-circuit” — if we detect that the group is in an unsafe
> > >> state,
> > >> > we
> > >> > > invoke #onPartitionsLost instead of #onPartitionsRevoked and
> > >> explicitly
> > >> > > prevent offsets from being committed on those revoked partitions.
> > >> > >
> > >> > > Consumer procedure:
> > >> > > Upon startup, the consumer will initially select the highest
> > >> > > commonly-supported protocol across its configured assignors. With
> > >> > > ["cooperative-sticky", "range”], the initial protocol will be
> EAGER
> > >> when
> > >> > > the member first joins the group. Following a rebalance, each
> member
> > >> will
> > >> > > check the selected assignor. If the chosen assignor supports
> > >> COOPERATIVE,
> > >> > > the member can upgrade their used protocol to COOPERATIVE and no
> > >> further
> > >> > > action is required. If the member is already on COOPERATIVE but
> the
> > >> > > selected assignor does NOT support it, then we need to trigger the
> > >> > > short-circuit. In this case we will invoke #onPartitionsLost
> instead
> > >> of
> > >> > > #onPartitionsRevoked, and set a flag to block any attempts at
> > >> committing
> > >> > > those partitions which have been revoked. If a commit is
> attempted,
> > as
> > >> > may
> > >> > > be the case if the user does not implement #onPartitionsLost (see
> > >> > > KAFKA-12638), we will throw a CommitFailedException which will be
> > >> bubbled
> > >> > > up through poll() after completing the rebalance. The member will
> > then
> > >> > > downgrade its protocol to EAGER for the next rebalance.
> > >> > >
> > >> > > Let me know what you think,
> > >> > > Sophie
> > >> > >
> > >> > > On Fri, Apr 2, 2021 at 7:08 PM Luke Chen <show...@gmail.com>
> wrote:
> > >> > >
> > >> > > > Hi Sophie,
> > >> > > > Making the default to "cooperative-sticky, range" is a smart
> idea,
> > >> to
> > >> > > > ensure we can at least fall back to rangeAssignor if consumers
> are
> > >> not
> > >> > > > following our recommended upgrade path. I updated the KIP
> > >> accordingly.
> > >> > > >
> > >> > > > Hi Chris,
> > >> > > > No problem, I updated the KIP to include the change in Connect.
> > >> > > >
> > >> > > > Thank you very much.
> > >> > > >
> > >> > > > Luke
> > >> > > >
> > >> > > > On Thu, Apr 1, 2021 at 3:24 AM Chris Egerton
> > >> > <chr...@confluent.io.invalid
> > >> > > >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi all,
> > >> > > > >
> > >> > > > > @Sophie - I like the sound of the dual-protocol default. The
> > >> smooth
> > >> > > > upgrade
> > >> > > > > path it permits sounds fantastic!
> > >> > > > >
> > >> > > > > @Luke - Do you think we can also include Connect in this KIP?
> > >> Right
> > >> > now
> > >> > > > we
> > >> > > > > don't set any custom partition assignment strategies for the
> > >> consumer
> > >> > > > > groups we bring up for sink tasks, and if we continue to just
> > use
> > >> the
> > >> > > > > default, the assignment strategy for those consumer groups
> would
> > >> > change
> > >> > > > on
> > >> > > > > Connect clusters once people upgrade to 3.0. I think this is
> > fine
> > >> > > > (assuming
> > >> > > > > we can take care of
> > >> > https://issues.apache.org/jira/browse/KAFKA-12487
> > >> > > > > before then, which I'm fairly optimistic about), but it might
> be
> > >> > worth
> > >> > > a
> > >> > > > > sentence or two in the KIP explaining that the change in
> default
> > >> will
> > >> > > > > intentionally propagate to Connect. And, if we think Connect
> > >> should
> > >> > be
> > >> > > > left
> > >> > > > > out of this change and stay on the range assignor instead, we
> > >> should
> > >> > > > > probably call that fact out in the KIP as well and state that
> > >> Connect
> > >> > > > will
> > >> > > > > now override the default partition assignment strategy to be
> the
> > >> > range
> > >> > > > > assignor (assuming the user hasn't specified a value for
> > >> > > > > consumer.partition.assignment.strategy in their worker config
> or
> > >> for
> > >> > > > > consumer.override.partition.assignment.strategy in their
> > connector
> > >> > > > config).
> > >> > > > >
> > >> > > > > Cheers,
> > >> > > > >
> > >> > > > > Chris
> > >> > > > >
> > >> > > > > On Wed, Mar 31, 2021 at 12:18 AM Sophie Blee-Goldman
> > >> > > > > <sop...@confluent.io.invalid> wrote:
> > >> > > > >
> > >> > > > > > Ok I'm still fleshing out all the details of KAFKA-12477
> but I
> > >> > think
> > >> > > we
> > >> > > > > can
> > >> > > > > > simplify some things a bit, and avoid
> > >> > > > > > any kind of "fail-fast" which will require user
> intervention.
> > In
> > >> > > fact I
> > >> > > > > > think we can avoid requiring the user to make
> > >> > > > > > any changes at all for KIP-726, so we don't have to worry
> > about
> > >> > > whether
> > >> > > > > > they actually read our documentation:
> > >> > > > > >
> > >> > > > > > Instead of making ["cooperative-sticky"] the default, we
> > change
> > >> the
> > >> > > > > default
> > >> > > > > > to ["cooperative-sticky", "range"].
> > >> > > > > > Since "range" is the old default, this is equivalent to the
> > >> first
> > >> > > > rolling
> > >> > > > > > bounce of the safe upgrade path in KIP-429.
> > >> > > > > >
> > >> > > > > > Of course this also means that under the current protocol
> > >> selection
> > >> > > > > > mechanism we won't actually upgrade to
> > >> > > > > > cooperative rebalancing with the default assignor. But
> that's
> > >> where
> > >> > > > > > KAFKA-12477 will come in.
> > >> > > > > >
> > >> > > > > > @Guozhang Wang <guozh...@confluent.io>  I'll get back to
> you
> > >> with
> > >> > a
> > >> > > > > > concrete proposal and answer your questions, I just want to
> > >> point
> > >> > out
> > >> > > > > > that it's possible to side-step the risk of users shooting
> > >> > themselves
> > >> > > > in
> > >> > > > > > the foot (well, at least in this one specific case,
> > >> > > > > > obviously they always find a way)
> > >> > > > > >
> > >> > > > > > On Tue, Mar 30, 2021 at 10:37 AM Guozhang Wang <
> > >> wangg...@gmail.com
> > >> > >
> > >> > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Hi Sophie,
> > >> > > > > > >
> > >> > > > > > > My question is more related to KAFKA-12477, but since your
> > >> latest
> > >> > > > > replies
> > >> > > > > > > are on this thread I figured we can follow-up on the same
> > >> venue.
> > >> > > Just
> > >> > > > > so
> > >> > > > > > I
> > >> > > > > > > understand your latest comments above about the approach:
> > >> > > > > > >
> > >> > > > > > > * I think, we would need to persist this decision so that
> > the
> > >> > group
> > >> > > > > would
> > >> > > > > > > never go back to the eager protocol, this bit would be
> > >> written to
> > >> > > the
> > >> > > > > > > internal topic's assignment message. Is that correct?
> > >> > > > > > > * Maybe you can describe the steps, after the group has
> > >> decided
> > >> > to
> > >> > > > move
> > >> > > > > > > forward with cooperative protocols, when:
> > >> > > > > > > 1) a new member joined the group with the old version, and
> > >> hence
> > >> > > only
> > >> > > > > > > recognized eager protocol and executing the eager protocol
> > >> with
> > >> > its
> > >> > > > > first
> > >> > > > > > > rebalance, what would happen.
> > >> > > > > > > 2) in addition to 1), the new member joined the group with
> > the
> > >> > old
> > >> > > > > > version
> > >> > > > > > > and only recognized the old subscription format, and was
> > >> selected
> > >> > > as
> > >> > > > > the
> > >> > > > > > > leader, what would happen.
> > >> > > > > > >
> > >> > > > > > > Guozhang
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Mon, Mar 29, 2021 at 10:30 PM Luke Chen <
> > show...@gmail.com
> > >> >
> > >> > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi Sophie & Ismael,
> > >> > > > > > > > Thank you for your feedback.
> > >> > > > > > > > No problem, let's pause this KIP and wait for this
> > >> improvement:
> > >> > > > > > > KAFKA-12477
> > >> > > > > > > > <https://issues.apache.org/jira/browse/KAFKA-12477>.
> > >> > > > > > > >
> > >> > > > > > > > Stay tuned :)
> > >> > > > > > > >
> > >> > > > > > > > Thank you.
> > >> > > > > > > > Luke
> > >> > > > > > > >
> > >> > > > > > > > On Tue, Mar 30, 2021 at 3:14 AM Ismael Juma <
> > >> ism...@juma.me.uk
> > >> > >
> > >> > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Hi Sophie,
> > >> > > > > > > > >
> > >> > > > > > > > > I didn't analyze the KIP in detail, but the two
> > >> suggestions
> > >> > you
> > >> > > > > > > mentioned
> > >> > > > > > > > > sound like great improvements.
> > >> > > > > > > > >
> > >> > > > > > > > > A bit more context: breaking changes for a widely used
> > >> > product
> > >> > > > like
> > >> > > > > > > Kafka
> > >> > > > > > > > > are costly and hence why we try as hard as we can to
> > avoid
> > >> > > them.
> > >> > > > > When
> > >> > > > > > > it
> > >> > > > > > > > > comes to the brokers, they are often managed by a
> > central
> > >> > group
> > >> > > > (or
> > >> > > > > > > > they're
> > >> > > > > > > > > in the Cloud), so they're a bit easier to manage. Even
> > so,
> > >> > it's
> > >> > > > > still
> > >> > > > > > > > > possible to upgrade from 0.8.x directly to 2.7 since
> all
> > >> > > protocol
> > >> > > > > > > > versions
> > >> > > > > > > > > are still supported. When it comes to the basic
> clients
> > >> > > > (producer,
> > >> > > > > > > > > consumer, admin client), they're often embedded in
> > >> > applications
> > >> > > > so
> > >> > > > > we
> > >> > > > > > > > have
> > >> > > > > > > > > to be even more conservative.
> > >> > > > > > > > >
> > >> > > > > > > > > Ismael
> > >> > > > > > > > >
> > >> > > > > > > > > On Mon, Mar 29, 2021 at 10:50 AM Sophie Blee-Goldman
> > >> > > > > > > > > <sop...@confluent.io.invalid> wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > Ismael,
> > >> > > > > > > > > >
> > >> > > > > > > > > > It seems like given 3.0 is a breaking release, we
> have
> > >> to
> > >> > > rely
> > >> > > > on
> > >> > > > > > > users
> > >> > > > > > > > > > being aware of this and responsible
> > >> > > > > > > > > > enough to read the upgrade guide. Otherwise we could
> > >> never
> > >> > > ever
> > >> > > > > > make
> > >> > > > > > > > any
> > >> > > > > > > > > > breaking changes beyond just
> > >> > > > > > > > > > removing deprecated APIs or other
> compilation-breaking
> > >> > errors
> > >> > > > > that
> > >> > > > > > > > would
> > >> > > > > > > > > be
> > >> > > > > > > > > > immediately visible, no?
> > >> > > > > > > > > >
> > >> > > > > > > > > > That said, obviously it's better to have a
> > >> circuit-breaker
> > >> > > that
> > >> > > > > > will
> > >> > > > > > > > fail
> > >> > > > > > > > > > fast in case of a user misconfiguration
> > >> > > > > > > > > > rather than silently corrupting the consumer group
> > >> state --
> > >> > > eg
> > >> > > > > for
> > >> > > > > > > two
> > >> > > > > > > > > > consumers to overlap in their ownership
> > >> > > > > > > > > > of the same partition(s). We could definitely
> > implement
> > >> > this,
> > >> > > > and
> > >> > > > > > now
> > >> > > > > > > > > that
> > >> > > > > > > > > > I think about it this might solve a
> > >> > > > > > > > > > related problem in KAFKA-12477
> > >> > > > > > > > > > <https://issues.apache.org/jira/browse/KAFKA-12477
> >.
> > We
> > >> > just
> > >> > > > > add a
> > >> > > > > > > new
> > >> > > > > > > > > > field to the Assignment in which the group leader
> > >> > > > > > > > > > indicates whether it's on a recent enough version to
> > >> > > understand
> > >> > > > > > > > > cooperative
> > >> > > > > > > > > > rebalancing. If an upgraded member
> > >> > > > > > > > > > joins the group, it'll only be allowed to start
> > >> following
> > >> > the
> > >> > > > new
> > >> > > > > > > > > > rebalancing protocol after receiving the go-ahead
> > >> > > > > > > > > > from the group leader.
> > >> > > > > > > > > >
> > >> > > > > > > > > > If we do go ahead and add this new field in the
> > >> Assignment
> > >> > > then
> > >> > > > > I'm
> > >> > > > > > > > > pretty
> > >> > > > > > > > > > confident we can reduce the number
> > >> > > > > > > > > > of required rolling bounces to just one with
> > KAFKA-12477
> > >> > > > > > > > > > <https://issues.apache.org/jira/browse/KAFKA-12477
> >.
> > In
> > >> > that
> > >> > > > > case
> > >> > > > > > we
> > >> > > > > > > > > > should
> > >> > > > > > > > > > be in much better shape to
> > >> > > > > > > > > > feel good about changing the default to the
> > >> > > > > > > CooperativeStickyAssignor.
> > >> > > > > > > > > How
> > >> > > > > > > > > > does that sound?
> > >> > > > > > > > > >
> > >> > > > > > > > > > To be clear, I'm not proposing we do this as part of
> > >> > KIP-726.
> > >> > > > > > Here's
> > >> > > > > > > my
> > >> > > > > > > > > > take:
> > >> > > > > > > > > >
> > >> > > > > > > > > > Let's pause this KIP while I work on making these
> two
> > >> > > > > improvements
> > >> > > > > > in
> > >> > > > > > > > > > KAFKA-12477 <
> > >> > > https://issues.apache.org/jira/browse/KAFKA-12477
> > >> > > > >.
> > >> > > > > > > Once
> > >> > > > > > > > I
> > >> > > > > > > > > > can
> > >> > > > > > > > > > confirm the
> > >> > > > > > > > > > short-circuit and single rolling bounce will be
> > >> available
> > >> > for
> > >> > > > > 3.0,
> > >> > > > > > > I'll
> > >> > > > > > > > > > report back on this thread. Then we can move
> > >> > > > > > > > > > forward with this KIP again.
> > >> > > > > > > > > >
> > >> > > > > > > > > > Thoughts?
> > >> > > > > > > > > > Sophie
> > >> > > > > > > > > >
> > >> > > > > > > > > > On Mon, Mar 29, 2021 at 12:01 AM Luke Chen <
> > >> > > show...@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > Hi Ismael,
> > >> > > > > > > > > > > Thanks for your good question. Answer them below:
> > >> > > > > > > > > > > *1. Are we saying that every consumer upgraded
> would
> > >> have
> > >> > > to
> > >> > > > > > follow
> > >> > > > > > > > the
> > >> > > > > > > > > > > complex path described in the KIP? *
> > >> > > > > > > > > > > --> We suggest that every consumer did these 2
> steps
> > >> of
> > >> > > > rolling
> > >> > > > > > > > > upgrade.
> > >> > > > > > > > > > > And after KAFKA-12477 <
> > >> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-12477
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > is completed, it can be reduced to 1 rolling
> > upgrade.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > *2. what happens if they don't read the
> instructions
> > >> and
> > >> > > > > upgrade
> > >> > > > > > as
> > >> > > > > > > > > they
> > >> > > > > > > > > > > have in the past?*
> > >> > > > > > > > > > > --> The reason we want 2 steps of rolling upgrade
> is
> > >> that
> > >> > > we
> > >> > > > > want
> > >> > > > > > > to
> > >> > > > > > > > > > avoid
> > >> > > > > > > > > > > the situation where leader is on old byte-code and
> > >> only
> > >> > > > > recognize
> > >> > > > > > > > > > "eager",
> > >> > > > > > > > > > > but due to compatibility would still be able to
> > >> > deserialize
> > >> > > > the
> > >> > > > > > new
> > >> > > > > > > > > > > protocol data from newer versioned members, and
> > hence
> > >> > just
> > >> > > go
> > >> > > > > > ahead
> > >> > > > > > > > and
> > >> > > > > > > > > > do
> > >> > > > > > > > > > > the assignment while new versioned members did not
> > >> revoke
> > >> > > > their
> > >> > > > > > > > > > partitions
> > >> > > > > > > > > > > before joining the group.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > But I'd say, the new default assignor
> > >> > > > > "CooperativeStickyAssignor"
> > >> > > > > > > was
> > >> > > > > > > > > > > already introduced in V2.4.0, and it should be
> long
> > >> > enough
> > >> > > > for
> > >> > > > > > user
> > >> > > > > > > > to
> > >> > > > > > > > > > > upgrade to the new byte-code to recognize the
> > >> > "cooperative"
> > >> > > > > > > protocol.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > What do you think?
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Thank you.
> > >> > > > > > > > > > > Luke
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > On Mon, Mar 29, 2021 at 12:14 PM Ismael Juma <
> > >> > > > > ism...@juma.me.uk>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > > Thanks for the KIP. Are we saying that every
> > >> consumer
> > >> > > > > upgraded
> > >> > > > > > > > would
> > >> > > > > > > > > > have
> > >> > > > > > > > > > > > to follow the complex path described in the KIP?
> > >> Also,
> > >> > > what
> > >> > > > > > > happens
> > >> > > > > > > > > if
> > >> > > > > > > > > > > they
> > >> > > > > > > > > > > > don't read the instructions and upgrade as they
> > >> have in
> > >> > > the
> > >> > > > > > past?
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > Ismael
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > On Fri, Mar 26, 2021, 1:53 AM Luke Chen <
> > >> > > show...@gmail.com
> > >> > > > >
> > >> > > > > > > wrote:
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > > Hi everyone,
> > >> > > > > > > > > > > > > <Update the subject>
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > I'd like to discuss the following proposal to
> > make
> > >> > the
> > >> > > > > > > > > > > > > CooperativeStickyAssignor as the default
> > assignor.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-726%3A+Make+the+CooperativeStickyAssignor+as+the+default+assignor
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Any comments are welcomed.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Thank you.
> > >> > > > > > > > > > > > > Luke
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > --
> > >> > > > > > > -- Guozhang
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-726: Make the CooperativeStickyAssignor as the default assignor

Reply via email to