Hello Konstantine,

Thanks for the great write-up! Here are a few quick comments I have about
the proposals:

1. For "Kubernetes process death" and "Rolling bounce" case, there is
another parallel work on KIP-345 [1] (cc'ed contributor) that is aimed to
mitigate these two issues, but it is relying on the fact that we can
disable sending "leave group" request immediately on shutting down. Ideally
if KIP-345 works well for these cases, then Simple Cooperative Rebalancing
itself along with KIP-345 should cover most of the scenarios we've
described in the wiki. In addition, Delayed / Incremental Imbalance
approach can be done incrementally on top of Simple approach, so execution
wise I'd suggest we start with the Simple approach and observe how well it
works in practice (especially with K8s etc frameworks) before deciding if
we should go further and implemented the more complicated ones.

2. For the "events" section, I think it may worth mentioning if there are
any new client / coordinator failure events that need to be addressed with
the new protocol, as we listed in the original design [2] [3]. For example,
what if the leader received different client or resource listings during
two consecutive rebalances?

3. It's worth mentioning what are the key ideas in the updated protocol:

3.a) In the original protocol we require every member to revoke every
resource before joining the group, which can then be used as the
"synchronization barrier" and hence it does not matter for clients to
receive assignment at different point in time; in the new protocol we do
not require members to revoke everything, but instead leveraging on the
leader who has the "global picture" to make sure that there are no
conflicts between those shared resources, a.k.a as the synchronization
barrier.
3.b) The new fields in the Assigned / RevokedPartitions fields in the
responses are now "deltas" instead of "overwrites" to the consumers. Any
modules relying on it, e.g. Streams who relies on ConsumerCoordinator,
needs to adjust their code (PartitionAssignor) correspondingly to
incorporate this semantic changes.

4. I've added a child page under yours for illustrating the implications
for Streams on rebalance cost reduction [4], since for Streams one key
characteristics is that standby tasks exist to help with rebalance incurred
unavailability, and hence need to be considered upfront how Streams should
leverage on the new protocol along with standby tasks to achieve the better
operational goals during rebalances.


[1]
https://cwiki.apache.org/confluence/display/KAFKA/KIP-345%3A+Reduce+multiple+consumer+rebalances+by+specifying+member+id
[2]
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Client-side+Assignment+Proposal#KafkaClient-sideAssignmentProposal-CoordinatorStateMachine
[3]
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design#Kafka0.9ConsumerRewriteDesign-Interestingscenariostoconsider
[4]
https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing+for+Streams


On Thu, Oct 4, 2018 at 12:16 PM, McCaig, Rhys <rhys_mcc...@comcast.com>
wrote:

> This is fantastic. Im really excited to see the work on this.
>
> > On Oct 2, 2018, at 4:22 PM, Konstantine Karantasis <
> konstant...@confluent.io> wrote:
> >
> > Hey everyone,
> >
> > I'd like to bring to your attention a general design document that was
> just
> > published in Apache Kafka's wiki space:
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/Incrementa
> l+Cooperative+Rebalancing%3A+Support+and+Policies
> >
> > It deals with the subject of Rebalancing of groups in Kafka and proposes
> > basic infrastructure to support improvements on the current rebalancing
> > protocol as well as a set of policies that can be implemented to optimize
> > rebalancing under a number of real-world scenarios.
> >
> > Currently, this wiki page is meant to serve as a reference to the
> > proposition of Incremental Cooperative Rebalancing overall. Specific KIPs
> > will follow in order to describe in more detail - using the standard KIP
> > format - the basic infrastructure and the first policies that will be
> > proposed for implementation in components such as Connect, the Kafka
> > Consumer and Streams.
> >
> > Stay tuned!
> > Konstantine
>
>


-- 
-- Guozhang

Reply via email to