Hi Konstantine,

Thanks for the updated KIP and the PR as well (which is huge :) I briefly
looked through it as well as the KIP, and I have one minor comment to add
(otherwise I'm binding +1 on it as well) about the backward compatibility.
I'll use one example to illustrate the issue:

1) Suppose you have workerA and B on newer version and configured the
connect.protocol as "compatible", they will send both V0/V1 to the leader
(say it's workerA) who will choose V1 as the current protocol, this will be
sent back to A and B who would remember the current protocol version is
already V1. So after this rebalance everyone remembers that V1 can be used,
which means that upon prepareJoin they will not revoke all the assigned
tasks.

2) Now let's say a new worker joins but with old version V0 (practically
this is rare, but for illustration purposes some common scenarios may falls
into this, e.g. an existing worker being downgraded, which is essentially
as being kicked out of the group, and then rejoined as a new member on the
older version), the leader realized that at least one of the member does
not know V1 and hence would fall back to use version V0 to perform
assignment. V0 algorithm would do eager rebalance which may move some tasks
to the new comer immediately from the existing members, as it assumes that
everyone would revoke everything before join (a.k.a the sync-barrier) but
this is actually not true, since everyone other than the old versioned new
comer would still follow the behavior of V1 --- not revoking anything ---
before sending the join group request.

This could be solvable though, e.g. when leader realized that he needs to
use V0, while the previous "currentProtocol" value is V1, instead of just
blindly follow the algorithm of V0 it could just reassign the existing
partitions without migrating anything, while at the same time tell everyone
that the currentProtocol version is downgraded to V0; and then they can
trigger another rebalance based on V0 where everything will revoke the
tasks before sending join group requests.


Guozhang

On Wed, Mar 6, 2019 at 2:28 PM Konstantine Karantasis <
konstant...@confluent.io> wrote:

> I'd like to open the vote on KIP-415: Incremental Cooperative Rebalancing
> in Kafka Connect
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-415%3A+Incremental+Cooperative+Rebalancing+in+Kafka+Connect
>
> a proposal that will allow Kafka Connect to scale significantly the number
> of connectors and tasks it can run in a cluster of Connect workers.
>
> Thanks,
> Konstantine
>


-- 
-- Guozhang

Reply via email to