Hi Ismael,

Thanks for your feedback. Let me answer your questions inline.

> 1. I think it's premature to talk about target versions for deprecation and
> removal of the existing group protocol. Unlike KRaft, this affects a core
> client protocol and hence deprecation/removal will be heavily dependent on
> how quickly applications migrate to the new protocol.

That makes sense. I will remove it.

> 2. The KIP says we intend to release this in 4.x, but it wasn't made clear
> why. If we added that as a way to estimate when we'd deprecate and remove
> the group protocol, I also suggest removing this part.

Let me explain my reasoning. As explained, I plan to rewrite the group
coordinator in Java while we implement the new protocol. This means
that the internals will be slightly different (e.g. threading model).
Therefore, I wanted to tighten the switch from the old group
coordinator to the new group coordinator to a major release. The
alternative would be to use a flag to do the switch instead of relying
on the software upgrade.

> 3. We need to flesh out the details of the migration story. It sounds like
> we're saying we will support online migrations. Is that correct? We should
> explain this in detail. It could also be done as a separate KIP, if it's
> easier.

Yes, we will support online migrations for the group. That means that
a group using the old protocol will be able to switch to the new
protocol.

Let me briefly explain how that will work though. It is basically a
four step process:

1. The cluster must be upgraded or rolled to a software supporting the
new group coordinator. Both the old and the new coordinator will
support the old protocol and rely on the same persisted metadata so
they can work together. This point is an offline migration. We cannot
do this one live because it would require shutting down the current
coordinator and starting up the new one and that would cause
unavailabilities.
2. The cluster's metadata version/IBP must be upgraded to X in order
to enable the new protocol. This cannot be done before 1) is
terminated because the old coordinator doesn't support the new
protocol.
3. The consumers must be upgraded to a version supporting the online
migration (must have KIP-792). If the consumer is already there.
Nothing must be done at this point.
4. The consumers must be rolled with the feature flag turned on. The
consumer group is automatically converted when the first consumer
using the new protocol joins the group. While the members using the
old protocol are being upgraded, the old protocol is proxied into the
new one.

Let me clarify all of this in the KIP.

> 4. I am happy that we are pushing the pattern subscriptions to the server,
> but it seems like there could be some tricky compatibility issues. Will we
> have a mechanism for users to detect that they need to update their regex
> before switching to the new protocol?

I think that I am a bit more optimistic than you on this point. I
believe that the majority of the cases are simple regexes which should
work with the new engine. The coordinator will verify the regex anyway
and reject the consumer if the regex is not valid. Coming back to the
migration path, in the worst case, the first upgraded consumer joining
the group will be rejected. This should be used as the last defence, I
would say.

One way for customers to validate their regex before upgrading their
prod would be to test them with another group. For instance, that
could be done in a pre-prod environment. Another way would be to
extend the consumer-group tool to provide a regex validation
mechanism. Would this be enough in your opinion?

> 5. Related to the last question, will the Java client allow the users to
> stick with the current regex engine for compatibility reasons? For example,
> it may be handy to keep using client based regex at first to keep
> migrations simple and then migrate to server based regexes as a second step.

I understand your point but I am concerned that this would allow users
to actually stay in this mode. That would go against our goal of
simplifying the client because we would have to continue monitoring
the metadata on the client side. I would rather not do this.

> 6. When we say that the group coordinator will be responsible for storing
> the configurations and that the configurations will be deleted when the
> group is deleted. Will a transition to DEAD trigger deletion of
> configurations?

That's right. The configurations will be deleted when the group is
deleted. They go together.

> 7. Will the choice to store the configs in the group coordinator make it
> harder to list all cluster configs and their values?

I don't think so. The group configurations are overrides of cluster
configs. If you want to know all the overrides though, you would have
to ask all the group coordinators. You cannot rely on the metadata log
for instance.

> 8. How would someone configure a group before starting the consumers? Have
> we considered allowing the explicit creation of groups? Alternatively, the
> configs could be decoupled from the group lifecycle.

Yes. The group will be automatically created in this case. However,
the configs will be lost after the retention period of the group
passes.

> 9. Will the Consumer.subscribe method for the Java client still take a
> `java.util.regex.Pattern` of do we have to introduce an overload?

That's a very group question. I forgot about that one. As the
`java.util.regex.Pattern` is not fully compatible with the engine that
we plan to use, it might be better to deprecate it and use an overload
which takes a string. We would rely on the server side validation.
During the migration, I think that we could still try to toString the
regex and use it. That should work, I think, in the majority of the
cases.

> 10. I agree with Justine that we should be clearer about the reason to
> switch to IBP/metadata.version from the feature flag. Maybe we mean that we
> can switch the default for the feature flag to true based on the
> metadata.version once we want to make it the default.

My plan was to use that feature flag mainly during the development
phase. I should not have mentioned it, I think, because we could use
an internal config for it.

> 11. Some of the protocol APIs don't mention the required ACLs, it would be
> good to add that for consistency.

Noted.

> 12. It is a bit odd that ConsumerGroupHeartbeat requires "Read Group" even
> though it seems to do more than reading.

I agree. This is how the current protocol works though. We only
require "Read Group" to join a group. We could consider changing this
but I am not sure that it is worth it.

> 13. How is topic recreation handled by the consumer with the new group
> protocol? It would be good to have a section on this.

Noted. From a protocol perspective, the new topic will have a new
topic id so it will treat it like a topic with a different name. The
only issue is that the fetch/commit offsets APIs do not support topic
IDs so the consumer would reuse the offsets based on the same. I think
that we should update those APIs as well in order to be consistent end
to end. That would strengthen the semantics of the consumer.

> 14. The KIP mentions we will write the new coordinator in Java. Even though
> this is an implementation detail, do we plan to have a new gradle module
> for it?

Yes.

> 15. Do we have a scalability goal when it comes to how many members the new
> group protocol can support?

We don't have numbers at the moment. The protocol should support 1000s
of members per group. We will measure this when we have a first
implementation. Note that we might have other bottlenecks down the
road (e.g. offset commits).

> 16. Did we consider having SubscribedTopidIds instead of
> SubscribedTopicNames in ConsumerGroupHeartbeatRequest? Is the idea that
> since we have to resolve the regex on the server, we can do the same for
> the topic name? The difference is that sending the regex is more efficient
> whereas sending the topic names is less efficient. Furthermore, delete and
> recreation is easier to handle if we have topic ids.

The idea was to consolidate the metadata lookup on the server for both
paths but I do agree with your point. As a second though, using topic
ids may be better here for the delete and recreation case. Also, I
suppose that we may allow users to subscribe with topic ids in the
future because that is the only way to be really robust to topic
re-creation.

Best,
David

On Tue, Jul 12, 2022 at 1:38 PM David Jacot <dja...@confluent.io> wrote:
>
> Hi Justine,
>
> Thanks for your comments. Please find my answers below.
>
> - Yes, the new protocol relies on topic IDs with the exception of the
> topic names based in the ConsumerGroupHeartbeatRequest. I am not sure
> if using topic names is the right call here. I need to think about it
> a little more. Obviously, the KIP does not change the fetch/commit
> offsets RPCs to use topic IDs. This may be something that we should
> include though as it would give better overall guarantee in the
> producer.
> - You're right. I think that I should not have mentioned this flag at
> all. I will remove it. We can use an internal configuration while
> developing the feature.
> - Both cluster types will be supported. The change is orthogonal. The
> only requirement is that the cluster uses topic IDs.
>
> Best,
> David
>
> On Mon, Jul 11, 2022 at 9:53 PM Guozhang Wang <wangg...@gmail.com> wrote:
> >
> > Hi Ismael,
> >
> > Thanks for the feedback. Here are some replies inlined below:
> >
> > On Sat, Jul 9, 2022 at 2:53 AM Ismael Juma <ism...@juma.me.uk> wrote:
> >
> > > Thanks for the KIP. This has the potential to be a great improvement. A 
> > > few
> > > initial questions/comments:
> > >
> > > 1. I think it's premature to talk about target versions for deprecation 
> > > and
> > > removal of the existing group protocol. Unlike KRaft, this affects a core
> > > client protocol and hence deprecation/removal will be heavily dependent on
> > > how quickly applications migrate to the new protocol.
> > >
> >
> > Yeah I agree with you. I think we can remove the proposed timeline in the
> > `Compatibility, Deprecation, and Migration Plan` and instead just state
> > that we will decide in the future about when we would deprecate old
> > protocol and behaviors.
> >
> >
> > > 2. The KIP says we intend to release this in 4.x, but it wasn't made clear
> > > why. If we added that as a way to estimate when we'd deprecate and remove
> > > the group protocol, I also suggest removing this part.
> > >
> >
> > I think that's not specifically related to the deprecation/removal timeline
> > plan, but it's more for client upgrades. I.e. the broker-side
> > implementation may be done first, and then the client side, and we would
> > only mark it as "released" by the time clients implementations are done. At
> > that time, to enable the feature the clients need to first swap-in the
> > bytecode with a rolling bounce and then set the flag with a second rolling
> > bounce, and hence we feel it's better to be released in a major version.
> >
> >
> > > 3. We need to flesh out the details of the migration story. It sounds like
> > > we're saying we will support online migrations. Is that correct? We should
> > > explain this in detail. It could also be done as a separate KIP, if it's
> > > easier.
> > >
> >
> > Yes I think that's the part we can be more concrete about for sure (and
> > this is related to your question 2) above). We will work on making it more
> > explicit in parallel as we solicit more feedback.
> >
> >
> > > 4. I am happy that we are pushing the pattern subscriptions to the server,
> > > but it seems like there could be some tricky compatibility issues. Will we
> > > have a mechanism for users to detect that they need to update their regex
> > > before switching to the new protocol?
> > >
> >
> > Yes I think we need some tooling for non-java client users to sort of
> > "dry-run" the client before switching to the new protocol. I do not have a
> > specific idea on top of my head though, maybe others like @Matt Howlett can
> > chime-in here?
> >
> >
> > > 5. Related to the last question, will the Java client allow the users to
> > > stick with the current regex engine for compatibility reasons? For 
> > > example,
> > > it may be handy to keep using client based regex at first to keep
> > > migrations simple and then migrate to server based regexes as a second
> > > step.
> > >
> >
> > Honestly I have not thought about that for java clients, and we can discuss
> > that. What kind of compatibility issues do you have in mind?
> >
> >
> > > 6. When we say that the group coordinator will be responsible for storing
> > > the configurations and that the configurations will be deleted when the
> > > group is deleted. Will a transition to DEAD trigger deletion of
> > > configurations?
> > >
> >
> > Yes, since the DEAD state is an ending state (we would only transit to that
> > state when the group is EMPTY and also all of its metadata are gone), once
> > it's transited to DEAD this group would never be revived.
> >
> >
> > > 7. Will the choice to store the configs in the group coordinator make it
> > > harder to list all cluster configs and their values?
> > >
> >
> > That's a good question, and our thoughts are that the so-called "group
> > configurations" are overrides of the cluster-level configurations
> > customized per group so when an admin list cluster configs it's okay to
> > list just the cluster-level "defaults", not showing any per-group
> > customizations.
> >
> >
> > > 8. How would someone configure a group before starting the consumers? Have
> > > we considered allowing the explicit creation of groups? Alternatively, the
> > > configs could be decoupled from the group lifecycle.
> > >
> >
> > The configs can be created before the group itself as an independent entity
> > --- of course, this requires the corresponding request to be routed to the
> > right coordinator based on the group id --- the only thing that differs is,
> > when the group itself is gone we also check if there are any configuration
> > entities related to that group and delete as well.
> >
> > Admittedly this indeed introduces an asymmetry on the creation / deletion
> > lifecycles of the config entities, and we would like to hear everyone's
> > feelings whether we should aim for symmetry i.e. totally decouple group
> > configs and hence not delete them at all when the group is gone, but always
> > require explicit deletion operations by themselves.
> >
> >
> > > 9. Will the Consumer.subscribe method for the Java client still take a
> > > `java.util.regex.Pattern` of do we have to introduce an overload?
> > >
> >
> > I think we do not need to introduce an overload, but I'm all ears if there
> > may be some compatibility issues that we may overlook.
> >
> >
> > > 10. I agree with Justine that we should be clearer about the reason to
> > > switch to IBP/metadata.version from the feature flag. Maybe we mean that 
> > > we
> > > can switch the default for the feature flag to true based on the
> > > metadata.version once we want to make it the default.
> >
> > 11. Some of the protocol APIs don't mention the required ACLs, it would be
> > > good to add that for consistency.
> > >
> >
> > Ack, we can certainly do that.
> >
> >
> > > 12. It is a bit odd that ConsumerGroupHeartbeat requires "Read Group" even
> > > though it seems to do more than reading.
> > >
> >
> > I had that thought myself as well, but in the end we could not find a
> > better alternative: adding Write Group seems an overkill here since we do
> > not have it elsewhere (we only have Read / Delete and Describe on groups so
> > far). Would like to hear others thoughts.
> >
> >
> > > 13. How is topic recreation handled by the consumer with the new group
> > > protocol? It would be good to have a section on this.
> > >
> >
> > You mean with regex subscription right? Yes we can add a section about
> > that, but basically the idea is that consumer would be totally agnostic in
> > the new protocol as it's handled all by the brokers.
> >
> >
> > > 14. The KIP mentions we will write the new coordinator in Java. Even 
> > > though
> > > this is an implementation detail, do we plan to have a new gradle module
> > > for it?
> > >
> >
> > We have not thought about that. But I think the answer should be yes.
> >
> >
> > > 15. Do we have a scalability goal when it comes to how many members the 
> > > new
> > > group protocol can support?
> > >
> >
> > Within a group, I think we should shoot for 1000s of members. But that
> > scalability goals also depend on the offset management (commit, fetch)
> > capabilities of the coordinator which we did not cover in this KIP, so it's
> > hard to give a number that applies universally.
> >
> >
> > > 16. Did we consider having SubscribedTopidIds instead of
> > > SubscribedTopicNames in ConsumerGroupHeartbeatRequest? Is the idea that
> > > since we have to resolve the regex on the server, we can do the same for
> > > the topic name? The difference is that sending the regex is more efficient
> > > whereas sending the topic names is less efficient. Furthermore, delete and
> > > recreation is easier to handle if we have topic ids.
> > >
> >
> > The main reason to still let the clients send names is to keep the
> > reasoning of names -> ids on the broker / admin client only. Note that
> > although we added topic id in KIP-516, we never implemented the logic on
> > consumer/producers leveraging the related newer versioned RPCs, instead we
> > just set the topic id as empty UUID. We want to keep the consumer/producer
> > to be thin and only delegate the reasoning on broker and potentially admin
> > clients.
> >
> >
> > >
> > > Thanks,
> > > Ismael
> > >
> > > On Wed, Jul 6, 2022 at 10:45 AM David Jacot <dja...@confluent.io.invalid>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to start a discussion thread on KIP-848: The Next
> > > > Generation of the Consumer Rebalance Protocol. With this KIP, we aim
> > > > to make the rebalance protocol (for consumers) more reliable, more
> > > > scalable, easier to implement for clients, and easier to debug for
> > > > operators.
> > > >
> > > > The KIP is here: https://cwiki.apache.org/confluence/x/HhD1D.
> > > >
> > > > Please take a look and let me know what you think.
> > > >
> > > > Best,
> > > > David
> > > >
> > > > PS: I will be away from July 18th to August 8th. That gives you a bit
> > > > of time to read and digest this long KIP.
> > > >
> > >
> >
> >
> > --
> > -- Guozhang

Reply via email to