Re: [EXTERNAL] Incremental Cooperative Rebalancing

Ryanne Dolan Tue, 16 Oct 2018 18:32:29 -0700

Konstantine, thanks for the explanation, makes sense.

Ryanne


On Tue, Oct 16, 2018, 1:51 PM Konstantine Karantasis <
konstant...@confluent.io> wrote:

> Matthias, Ryanne, Rhys, Guozhang, thank you all for your comments!
>
> Ryanne, to try to address your specific comments, let me start by saying
> that a key concept behind this proposal is the concept of overlapping
> 'communication' with 'computation', which is known for often reducing the
> overall cost (or latency if you prefer) of an operation that involves
> multiple processes compared to global barrier-type synchronization. Of
> course this does not rule out that there might be occasions where
> stop-the-world might incur smaller overall cost. But given that we'll
> always want to minimize communication and shuffling of resources and apply
> more sticky heuristics in assignments, I believe that such edge cases will
> be considerably fewer in practice.
>
> Specifically here, the goal is to exchange (by revoking and reassigning)
> only the necessary resources in the group and allow for the unaffected
> resources to continue being used. As mentioned in the motivation section,
> this is expected to have a positive effect in a number of use cases, for
> which stop-the-world is too strict. Given this key distinction between
> affected and unaffected resources (e.g. topic partitions, tasks in Kafka
> Connect etc) I anticipate that in most cases, even for resources that need
> to change hands, the overall rebalance phase will be faster (especially at
> larger scale) than it is today with all the processes participating in
> resource hand-off and re-assignment.
>
> -Konstantine
>
>
> On Fri, Oct 5, 2018 at 12:00 PM Guozhang Wang <wangg...@gmail.com> wrote:
>
> > Hello Konstantine,
> >
> > Thanks for the great write-up! Here are a few quick comments I have about
> > the proposals:
> >
> > 1. For "Kubernetes process death" and "Rolling bounce" case, there is
> > another parallel work on KIP-345 [1] (cc'ed contributor) that is aimed to
> > mitigate these two issues, but it is relying on the fact that we can
> > disable sending "leave group" request immediately on shutting down.
> Ideally
> > if KIP-345 works well for these cases, then Simple Cooperative
> Rebalancing
> > itself along with KIP-345 should cover most of the scenarios we've
> > described in the wiki. In addition, Delayed / Incremental Imbalance
> > approach can be done incrementally on top of Simple approach, so
> execution
> > wise I'd suggest we start with the Simple approach and observe how well
> it
> > works in practice (especially with K8s etc frameworks) before deciding if
> > we should go further and implemented the more complicated ones.
> >
> > 2. For the "events" section, I think it may worth mentioning if there are
> > any new client / coordinator failure events that need to be addressed
> with
> > the new protocol, as we listed in the original design [2] [3]. For
> example,
> > what if the leader received different client or resource listings during
> > two consecutive rebalances?
> >
> > 3. It's worth mentioning what are the key ideas in the updated protocol:
> >
> > 3.a) In the original protocol we require every member to revoke every
> > resource before joining the group, which can then be used as the
> > "synchronization barrier" and hence it does not matter for clients to
> > receive assignment at different point in time; in the new protocol we do
> > not require members to revoke everything, but instead leveraging on the
> > leader who has the "global picture" to make sure that there are no
> > conflicts between those shared resources, a.k.a as the synchronization
> > barrier.
> > 3.b) The new fields in the Assigned / RevokedPartitions fields in the
> > responses are now "deltas" instead of "overwrites" to the consumers. Any
> > modules relying on it, e.g. Streams who relies on ConsumerCoordinator,
> > needs to adjust their code (PartitionAssignor) correspondingly to
> > incorporate this semantic changes.
> >
> > 4. I've added a child page under yours for illustrating the implications
> > for Streams on rebalance cost reduction [4], since for Streams one key
> > characteristics is that standby tasks exist to help with rebalance
> incurred
> > unavailability, and hence need to be considered upfront how Streams
> should
> > leverage on the new protocol along with standby tasks to achieve the
> better
> > operational goals during rebalances.
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-345%3A+Reduce+multiple+consumer+rebalances+by+specifying+member+id
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Client-side+Assignment+Proposal#KafkaClient-sideAssignmentProposal-CoordinatorStateMachine
> > [3]
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design#Kafka0.9ConsumerRewriteDesign-Interestingscenariostoconsider
> > [4]
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing+for+Streams
> >
> >
> > On Thu, Oct 4, 2018 at 12:16 PM, McCaig, Rhys <rhys_mcc...@comcast.com>
> > wrote:
> >
> > > This is fantastic. Im really excited to see the work on this.
> > >
> > > > On Oct 2, 2018, at 4:22 PM, Konstantine Karantasis <
> > > konstant...@confluent.io> wrote:
> > > >
> > > > Hey everyone,
> > > >
> > > > I'd like to bring to your attention a general design document that
> was
> > > just
> > > > published in Apache Kafka's wiki space:
> > > >
> > > > https://cwiki.apache.org/confluence/display/KAFKA/Incrementa
> > > l+Cooperative+Rebalancing%3A+Support+and+Policies
> > > >
> > > > It deals with the subject of Rebalancing of groups in Kafka and
> > proposes
> > > > basic infrastructure to support improvements on the current
> rebalancing
> > > > protocol as well as a set of policies that can be implemented to
> > optimize
> > > > rebalancing under a number of real-world scenarios.
> > > >
> > > > Currently, this wiki page is meant to serve as a reference to the
> > > > proposition of Incremental Cooperative Rebalancing overall. Specific
> > KIPs
> > > > will follow in order to describe in more detail - using the standard
> > KIP
> > > > format - the basic infrastructure and the first policies that will be
> > > > proposed for implementation in components such as Connect, the Kafka
> > > > Consumer and Streams.
> > > >
> > > > Stay tuned!
> > > > Konstantine
> > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>

Re: [EXTERNAL] Incremental Cooperative Rebalancing

Reply via email to