Hey Konstantine, great work for making this happen! Incremental rebalancing is super important for avoiding unnecessary resource shuffle and improving the overall Connect framework stability.
After the first pass, two questions across my mind are: 1. For my understanding, the general rebalancing case could be covered by configuring the workers as static members, so that we don't need to worry about worker temporarily leaving group case. Basically KIP-345 could help with avoiding unexpected rebalances during cluster rolling bounce which I feel the same way as Stanislav that parts of 415 logic could be simplified. It would be great if we could look at these two initiatives holistically to help reduce the common workload. 2. Since I never used Connect before, I do hope you could enlighten me on the potential effort involved in task transfer between workers. The reason I ask is to estimate how much burden will we introduce by starting a task on the brand new worker? Is there any local state to be replayed? It would be good to also provide this background in the KIP motivation so that people could understand better of the symptom and build constructive feedbacks. Thanks a lot! Boyang ________________________________ From: Stanislav Kozlovski <stanis...@confluent.io> Sent: Monday, January 14, 2019 3:15 PM To: dev@kafka.apache.org Subject: Re: [DISCUSS] KIP-415: Incremental Cooperative Rebalancing in Kafka Connect Hey Konstantine, This is a very exciting and fundamental-improving KIP, thanks a lot for working on it! Have you seen KIP-345 <https://cwiki.apache.org/confluence/display/KAFKA/KIP-345>? I was wondering whether Connect would support the static group membership - potentially limiting the need to handle "node bounce" cases through a rebalance (even though there wouldn't be downtime). I find it is somewhat related to the `scheduled.rebalance.max.delay.ms` config described in KIP-415. The main difference I think is that rebalance delay in KIP-345 is configurable through `session.timeout.ms` which is tied to the liveness heartbeat, whereas here we have a separate config. The original design document suggested > Assignment response includes usual assignment information. Start processing any new partitions. (Since we expect sticky assignment, we could also optimize this and omit the assignment when it is just repeating a previous assignment) Have we decided on whether we would make use of the optimization as to not send the assignment that the worker already knows about? I enjoyed reading the rebalancing examples. As a small readability improvement, could I suggest we clarify which Worker (W1,W2,W3) is the leader in the "Initial group and assignment" part? For example, in the `Leader bounces` I was left thinking whether the leaving W2 was the initial leader or not. Thanks, Stanislav On Sat, Jan 12, 2019 at 1:44 AM Konstantine Karantasis < konstant...@confluent.io> wrote: > Hi all, > > I just published KIP-415: Incremental Cooperative Rebalancing in Kafka > Connect > on the wiki here: > > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals > > This is the first KIP to suggest an implementation of incremental and > cooperative rebalancing in the context of Kafka Connect. It aims to provide > an adequate solution to the stop-the-world effect that occurs in a Connect > cluster whenever a new connector configuration is submitted or a Connect > Worker is added or removed from the cluster. > > Looking forward to your insightful feedback! > > Regards, > Konstantine > -- Best, Stanislav