Hey Boyang,

Thanks for the proposal! This is very useful. I have some comments below:

1) The motivation currently explicitly states that the goal is to improve
performance for heavy state application. It seems that the motivation can
be stronger with the following use-case. Currently for MirrorMaker cluster
with e.g. 100 MirrorMaker processes, it will take a long time to rolling
bounce the entire MirrorMaker cluster. Each MirrorMaker process restart
will trigger a rebalance which currently pause the consumption of the all
partitions of the MirrorMaker cluster. With the change stated in this
patch, as long as a MirrorMaker can restart within the specified timeout
(e.g. 2 minutes), then we only need constant number of rebalance (e.g. for
leader restart) for the entire rolling bounce, which will significantly
improves the availability of the MirrorMaker pipeline. In my opinion, the
main benefit of the KIP is to avoid unnecessary rebalance if the consumer
process can be restarted within soon, which helps performance even if
overhead of state shuffling for a given process is small.

2) In order to simplify the KIP reading, can you follow the writeup style
of other KIP (e.g. KIP-98) and list the interface change such as new
configs (e.g. registration timeout), new request/response, new AdminClient
API and new error code (e.g. DUPLICATE_STATIC_MEMBER)? Currently some of
these are specified in the Proposed Change section which makes it a bit
inconvenient to understand the new interface that will be exposed to user.
Explanation of the current two-phase rebalance protocol probably can be
moved out of public interface section.

3) There are currently two version of JoinGroupRequest in the KIP and only
one of them has field memberId. This seems confusing.

4) It is mentioned in the KIP that "An admin API to force rebalance could
be helpful here, but we will make a call once we finished the major
implementation". So this seems to be still an open question in the current
design. We probably want to agree on this before voting for the KIP.

5) The KIP currently adds new config MEMBER_NAME for consumer. Can you
specify the name of the config key and the default config value? Possible
default values include empty string or null (similar to transaction.id in
producer config).

6) Regarding the use of the topic "static_member_map" to persist member
name map, currently if consumer coordinator broker goes offline, rebalance
is triggered and consumers will try connect to the new coordinator. If
these consumers can connect to the new coordinator within
max.poll.interval.ms which by default is 5 minutes, given that broker can
use a deterministic algorithm to determine the partition -> member_name
mapping, each consumer should get assigned the same set of partitions
without requiring state shuffling. So it is not clear whether we have a
strong use-case for this new logic. Can you help clarify what is the
benefit of using topic "static_member_map" to persist member name map?

7) Regarding the introduction of the expensionTimeoutMs config, it is
mentioned that "we are using expansion timeout to replace rebalance
timeout, which is configured by max.poll.intervals from client side, and
using registration timeout to replace session timeout". Currently the
default max.poll.interval.ms is configured to be 5 minutes and there will
be only one rebalance if all new consumers can join within 5 minutes. So it
is not clear whether we have a strong use-case for this new config. Can you
explain what is the benefit of introducing this new config?

8) It is mentioned that "To distinguish between previous version of
protocol, we will also increase the join group request version to v4 when
MEMBER_NAME is set" and "If the broker version is not the latest (< v4),
the join group request shall be downgraded to v3 without setting the member
Id". It is probably simpler to just say that this feature is enabled if
JoinGroupRequest V4 is supported on both client and broker and MEMBER_NAME
is configured with non-empty string.

9) It is mentioned that broker may return NO_STATIC_MEMBER_INFO_SET error
in OffsetCommitResponse for "commit requests under static membership". Can
you clarify how broker determines whether the commit request is under
static membership?

Thanks,
Dong

Reply via email to