Hi Colin, Thanks for the feedback, and suggestions! It is a great idea to provide a `--finalize-latest` flag. I agree it's a burden to ask the user to manually upgrade each feature to the latest version, after a release.
I have now updated the KIP adding this idea. > What about a simple solution to problem this where we add a flag to the command-line tool like --enable-latest? The command-line tool could query what the highest possible versions for > each feature were (using the API) and then make another RPC to enable the latest features. (Kowshik): I've updated the KIP with the above idea, please look at this section (point #3 and the tooling example later): https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features#KIP-584:Versioningschemeforfeatures-Toolingsupport > I think this is actually much easier than the version number solution. The version string solution requires us to maintain a complicated mapping table between version strings and features. > In practice, we also have "internal versions" in ApiVersion.scala like 2.4IV0, 2.4IV1, and so on. This isn't simple for users to understand or use. > It's also hard to know what the difference is between different version strings. For example, there's actually no difference between 2.5IV0 and 2.4IV1, but you wouldn't know that unless you > read the comments in ApiVersion.scala. A system administrator who didn't know this might end up doing a cluster roll to upgrade the IBP that turned out to be unnecessary. (Kowshik): Yes, I can see the disadvantages! Cheers, Kowshik On Mon, Apr 6, 2020 at 3:46 PM Colin McCabe <cmcc...@apache.org> wrote: > Hi Jun, > > I agree that asking the user to manually upgrade all features to the > latest version is a burden. Then the user has to know what the latest > version of every feature is when upgrading. > > What about a simple solution to problem this where we add a flag to the > command-line tool like --enable-latest? The command-line tool could query > what the highest possible versions for each feature were (using the API) > and then make another RPC to enable the latest features. > > I think this is actually much easier than the version number solution. > The version string solution requires us to maintain a complicated mapping > table between version strings and features. In practice, we also have > "internal versions" in ApiVersion.scala like 2.4IV0, 2.4IV1, and so on. > This isn't simple for users to understand or use. > > It's also hard to know what the difference is between different version > strings. For example, there's actually no difference between 2.5IV0 and > 2.4IV1, but you wouldn't know that unless you read the comments in > ApiVersion.scala. A system administrator who didn't know this might end up > doing a cluster roll to upgrade the IBP that turned out to be unnecessary. > > best, > Colin > > > On Mon, Apr 6, 2020, at 12:06, Jun Rao wrote: > > Hi, Kowshik, > > > > Thanks for the reply. A few more replies below. > > > > 100.6 You can look for the sentence "This operation requires ALTER on > > CLUSTER." in KIP-455. Also, you can check its usage in > > KafkaApis.authorize(). > > > > 110. From the external client/tooling perspective, it's more natural to > use > > the release version for features. If we can use the same release version > > for internal representation, it seems simpler (easier to understand, no > > mapping overhead, etc). Is there a benefit with separate external and > > internal versioning schemes? > > > > 111. To put this in context, when we had IBP, the default value is the > > current released version. So, if you are a brand new user, you don't need > > to configure IBP and all new features will be immediately available in > the > > new cluster. If you are upgrading from an old version, you do need to > > understand and configure IBP. I see a similar pattern here for > > features. From the ease of use perspective, ideally, we shouldn't > require a > > new user to have an extra step such as running a bootstrap script unless > > it's truly necessary. If someone has a special need (all the cases you > > mentioned seem special cases?), they can configure a mode such that > > features are enabled/disabled manually. > > > > Jun > > > > On Fri, Apr 3, 2020 at 5:54 PM Kowshik Prakasam <kpraka...@confluent.io> > > wrote: > > > > > Hi Jun, > > > > > > Thanks for the feedback and suggestions. Please find my response below. > > > > > > > 100.6 For every new request, the admin needs to control who is > allowed to > > > > issue that request if security is enabled. So, we need to assign the > new > > > > request a ResourceType and possible AclOperations. See > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment > > > > as an example. > > > > > > (Kowshik): I don't see any reference to the words ResourceType or > > > AclOperations > > > in the KIP. Please let me know how I can use the KIP that you linked to > > > know how to > > > setup the appropriate ResourceType and/or ClusterOperation? > > > > > > > 105. If we change delete to disable, it's better to do this > consistently > > > in > > > > request protocol and admin api as well. > > > > > > (Kowshik): The API shouldn't be called 'disable' when it is deleting a > > > feature. > > > I've just changed the KIP to use 'delete'. I don't have a strong > > > preference. > > > > > > > 110. The minVersion/maxVersion for features use int64. Currently, our > > > > release version schema is major.minor.bugfix (e.g. 2.5.0). It's > possible > > > > for new features to be included in minor releases too. Should we > make the > > > > feature versioning match the release versioning? > > > > > > (Kowshik): The release version can be mapped to a set of feature > versions, > > > and this can be done, for example in the tool (or even external to the > > > tool). > > > Can you please clarify what I'm missing? > > > > > > > 111. "During regular operations, the data in the ZK node can be > mutated > > > > only via a specific admin API served only by the controller." I am > > > > wondering why can't the controller auto finalize a feature version > after > > > > all brokers are upgraded? For new users who download the latest > version > > > to > > > > build a new cluster, it's inconvenient for them to have to manually > > > enable > > > > each feature. > > > > > > (Kowshik): I agree that there is a trade-off here, but it will help > > > to decide whether the automation can be thought through in the future > > > in a follow up KIP, or right now in this KIP. We may invest > > > in automation, but we have to decide whether we should do it > > > now or later. > > > > > > For the inconvenience that you mentioned, do you think the problem > that you > > > mentioned can be overcome by asking for the cluster operator to run a > > > bootstrap script when he/she knows that a specific AK release has been > > > almost completely deployed in a cluster for the first time? Idea is > that > > > the > > > bootstrap script will know how to map a specific AK release to > finalized > > > feature versions, and run the `kafka-features.sh` tool appropriately > > > against > > > the cluster. > > > > > > Now, coming back to your automation proposal/question. > > > I do see the value of automated feature version finalization, but I > also > > > see > > > that this will open up several questions and some risks, as explained > > > below. > > > The answers to these depend on the definition of the automation we > choose > > > to build, and how well does it fit into a kafka deployment. > > > Basically, it can be unsafe for the controller to finalize feature > version > > > upgrades automatically, without learning about the intent of the > cluster > > > operator. > > > 1. We would sometimes want to lock feature versions only when we have > > > externally verified > > > the stability of the broker binary. > > > 2. Sometimes only the cluster operator knows that a cluster upgrade is > > > complete, > > > and new brokers are highly unlikely to join the cluster. > > > 3. Only the cluster operator knows that the intent is to deploy the > same > > > version > > > of the new broker release across the entire cluster (i.e. the latest > > > downloaded version). > > > 4. For downgrades, it appears the controller still needs some external > > > input > > > (such as the proposed tool) to finalize a feature version downgrade. > > > > > > If we have automation, that automation can end up failing in some of > the > > > cases > > > above. Then, we need a way to declare that the cluster is "not ready" > if > > > the > > > controller cannot automatically finalize some basic required feature > > > version > > > upgrades across the cluster. We need to make the cluster operator > aware in > > > such a scenario (raise an alert or alike). > > > > > > > 112. DeleteFeaturesResponse: It seems the apiKey should be 49 > instead of > > > 48. > > > > > > (Kowshik): Done. > > > > > > > > > Cheers, > > > Kowshik > > > > > > On Fri, Apr 3, 2020 at 11:24 AM Jun Rao <j...@confluent.io> wrote: > > > > > > > Hi, Kowshik, > > > > > > > > Thanks for the reply. A few more comments below. > > > > > > > > 100.6 For every new request, the admin needs to control who is > allowed to > > > > issue that request if security is enabled. So, we need to assign the > new > > > > request a ResourceType and possible AclOperations. See > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment > > > > as > > > > an example. > > > > > > > > 105. If we change delete to disable, it's better to do this > consistently > > > in > > > > request protocol and admin api as well. > > > > > > > > 110. The minVersion/maxVersion for features use int64. Currently, our > > > > release version schema is major.minor.bugfix (e.g. 2.5.0). It's > possible > > > > for new features to be included in minor releases too. Should we > make the > > > > feature versioning match the release versioning? > > > > > > > > 111. "During regular operations, the data in the ZK node can be > mutated > > > > only via a specific admin API served only by the controller." I am > > > > wondering why can't the controller auto finalize a feature version > after > > > > all brokers are upgraded? For new users who download the latest > version > > > to > > > > build a new cluster, it's inconvenient for them to have to manually > > > enable > > > > each feature. > > > > > > > > 112. DeleteFeaturesResponse: It seems the apiKey should be 49 > instead of > > > > 48. > > > > > > > > Jun > > > > > > > > > > > > On Fri, Apr 3, 2020 at 1:27 AM Kowshik Prakasam < > kpraka...@confluent.io> > > > > wrote: > > > > > > > > > Hey Jun, > > > > > > > > > > Thanks a lot for the great feedback! Please note that the design > > > > > has changed a little bit on the KIP, and we now propagate the > finalized > > > > > features metadata only via ZK watches (instead of > UpdateMetadataRequest > > > > > from the controller). > > > > > > > > > > Please find below my response to your questions/feedback, with the > > > prefix > > > > > "(Kowshik):". > > > > > > > > > > > 100. UpdateFeaturesRequest/UpdateFeaturesResponse > > > > > > 100.1 Since this request waits for responses from brokers, > should we > > > > add > > > > > a > > > > > > timeout in the request (like createTopicRequest)? > > > > > > > > > > (Kowshik): Great point! Done. I have added a timeout field. Note: > we no > > > > > longer > > > > > wait for responses from brokers, since the design has been changed > so > > > > that > > > > > the > > > > > features information is propagated via ZK. Nevertheless, it is > right to > > > > > have a timeout > > > > > for the request. > > > > > > > > > > > 100.2 The response schema is a bit weird. Typically, the response > > > just > > > > > > shows an error code and an error message, instead of echoing the > > > > request. > > > > > > > > > > (Kowshik): Great point! Yeah, I have modified it to just return an > > > error > > > > > code and a message. > > > > > Previously it was not echoing the "request", rather it was > returning > > > the > > > > > latest set of > > > > > cluster-wide finalized features (after applying the updates). But > you > > > are > > > > > right, > > > > > the additional info is not required, so I have removed it from the > > > > response > > > > > schema. > > > > > > > > > > > 100.3 Should we add a separate request to list/describe the > existing > > > > > > features? > > > > > > > > > > (Kowshik): This is already present in the KIP via the > > > 'DescribeFeatures' > > > > > Admin API, > > > > > which, underneath covers uses the ApiVersionsRequest to > list/describe > > > the > > > > > existing features. Please read the 'Tooling support' section. > > > > > > > > > > > 100.4 We are mixing ADD_OR_UPDATE and DELETE in a single > request. For > > > > > > DELETE, the version field doesn't make sense. So, I guess the > broker > > > > just > > > > > > ignores this? An alternative way is to have a separate > > > > > DeleteFeaturesRequest > > > > > > > > > > (Kowshik): Great point! I have modified the KIP now to have 2 > separate > > > > > controller APIs > > > > > serving these different purposes: > > > > > 1. updateFeatures > > > > > 2. deleteFeatures > > > > > > > > > > > 100.5 In UpdateFeaturesResponse, we have "The monotonically > > > increasing > > > > > > version of the metadata for finalized features." I am wondering > why > > > the > > > > > > ordering is important? > > > > > > > > > > (Kowshik): In the latest KIP write-up, it is called epoch (instead > of > > > > > version), and > > > > > it is just the ZK node version. Basically, this is the epoch for > the > > > > > cluster-wide > > > > > finalized feature version metadata. This metadata is served to > clients > > > > via > > > > > the > > > > > ApiVersionsResponse (for reads). We propagate updates from the > > > > '/features' > > > > > ZK node > > > > > to all brokers, via ZK watches setup by each broker on the > '/features' > > > > > node. > > > > > > > > > > Now here is why the ordering is important: > > > > > ZK watches don't propagate at the same time. As a result, the > > > > > ApiVersionsResponse > > > > > is eventually consistent across brokers. This can introduce cases > > > > > where clients see an older lower epoch of the features metadata, > after > > > a > > > > > more recent > > > > > higher epoch was returned at a previous point in time. We expect > > > clients > > > > > to always employ the rule that the latest received higher epoch of > > > > metadata > > > > > always trumps an older smaller epoch. Those clients that are > external > > > to > > > > > Kafka should strongly consider discovering the latest metadata once > > > > during > > > > > startup from the brokers, and if required refresh the metadata > > > > periodically > > > > > (to get the latest metadata). > > > > > > > > > > > 100.6 Could you specify the required ACL for this new request? > > > > > > > > > > (Kowshik): What is ACL, and how could I find out which one to > specify? > > > > > Please could you provide me some pointers? I'll be glad to update > the > > > > > KIP once I know the next steps. > > > > > > > > > > > 101. For the broker registration ZK node, should we bump up the > > > version > > > > > in > > > > > the json? > > > > > > > > > > (Kowshik): Great point! Done. I've increased the version in the > broker > > > > json > > > > > by 1. > > > > > > > > > > > 102. For the /features ZK node, not sure if we need the epoch > field. > > > > Each > > > > > > ZK node has an internal version field that is incremented on > every > > > > > update. > > > > > > > > > > (Kowshik): Great point! Done. I'm using the ZK node version now, > > > instead > > > > of > > > > > explicitly > > > > > incremented epoch. > > > > > > > > > > > 103. "Enabling the actual semantics of a feature version > cluster-wide > > > > is > > > > > > left to the discretion of the logic implementing the feature > (ex: can > > > > be > > > > > > done via dynamic broker config)." Does that mean the broker > > > > registration > > > > > ZK > > > > > > node will be updated dynamically when this happens? > > > > > > > > > > (Kowshik): Not really. The text was just conveying that a broker > could > > > > > "know" of > > > > > a new feature version, but it does not mean the broker should have > also > > > > > activated the effects of the feature version. Knowing vs activation > > > are 2 > > > > > separate things, > > > > > and the latter can be achieved by dynamic config. I have reworded > the > > > > text > > > > > to > > > > > make this clear to the reader. > > > > > > > > > > > > > > > > 104. UpdateMetadataRequest > > > > > > 104.1 It would be useful to describe when the feature metadata is > > > > > included > > > > > > in the request. My understanding is that it's only included if > (1) > > > > there > > > > > is > > > > > > a change to the finalized feature; (2) broker restart; (3) > controller > > > > > > failover. > > > > > > 104.2 The new fields have the following versions. Why are the > > > versions > > > > 3+ > > > > > > when the top version is bumped to 6? > > > > > > "fields": [ > > > > > > {"name": "Name", "type": "string", "versions": "3+", > > > > > > "about": "The name of the feature."}, > > > > > > {"name": "Version", "type": "int64", "versions": "3+", > > > > > > "about": "The finalized version for the feature."} > > > > > > ] > > > > > > > > > > (Kowshik): With the new improved design, we have completely > eliminated > > > > the > > > > > need to > > > > > use UpdateMetadataRequest. This is because we now rely on ZK to > deliver > > > > the > > > > > notifications for changes to the '/features' ZK node. > > > > > > > > > > > 105. kafka-features.sh: Instead of using update/delete, perhaps > it's > > > > > better > > > > > > to use enable/disable? > > > > > > > > > > (Kowshik): For delete, yes, I have changed it so that we instead > call > > > it > > > > > 'disable'. > > > > > However for 'update', it can now also refer to either an upgrade > or a > > > > > forced downgrade. > > > > > Therefore, I have left it the way it is, just calling it as just > > > > 'update'. > > > > > > > > > > > > > > > Cheers, > > > > > Kowshik > > > > > > > > > > On Tue, Mar 31, 2020 at 6:51 PM Jun Rao <j...@confluent.io> wrote: > > > > > > > > > > > Hi, Kowshik, > > > > > > > > > > > > Thanks for the KIP. Looks good overall. A few comments below. > > > > > > > > > > > > 100. UpdateFeaturesRequest/UpdateFeaturesResponse > > > > > > 100.1 Since this request waits for responses from brokers, > should we > > > > add > > > > > a > > > > > > timeout in the request (like createTopicRequest)? > > > > > > 100.2 The response schema is a bit weird. Typically, the response > > > just > > > > > > shows an error code and an error message, instead of echoing the > > > > request. > > > > > > 100.3 Should we add a separate request to list/describe the > existing > > > > > > features? > > > > > > 100.4 We are mixing ADD_OR_UPDATE and DELETE in a single > request. For > > > > > > DELETE, the version field doesn't make sense. So, I guess the > broker > > > > just > > > > > > ignores this? An alternative way is to have a separate > > > > > > DeleteFeaturesRequest > > > > > > 100.5 In UpdateFeaturesResponse, we have "The monotonically > > > increasing > > > > > > version of the metadata for finalized features." I am wondering > why > > > the > > > > > > ordering is important? > > > > > > 100.6 Could you specify the required ACL for this new request? > > > > > > > > > > > > 101. For the broker registration ZK node, should we bump up the > > > version > > > > > in > > > > > > the json? > > > > > > > > > > > > 102. For the /features ZK node, not sure if we need the epoch > field. > > > > Each > > > > > > ZK node has an internal version field that is incremented on > every > > > > > update. > > > > > > > > > > > > 103. "Enabling the actual semantics of a feature version > cluster-wide > > > > is > > > > > > left to the discretion of the logic implementing the feature > (ex: can > > > > be > > > > > > done via dynamic broker config)." Does that mean the broker > > > > registration > > > > > ZK > > > > > > node will be updated dynamically when this happens? > > > > > > > > > > > > 104. UpdateMetadataRequest > > > > > > 104.1 It would be useful to describe when the feature metadata is > > > > > included > > > > > > in the request. My understanding is that it's only included if > (1) > > > > there > > > > > is > > > > > > a change to the finalized feature; (2) broker restart; (3) > controller > > > > > > failover. > > > > > > 104.2 The new fields have the following versions. Why are the > > > versions > > > > 3+ > > > > > > when the top version is bumped to 6? > > > > > > "fields": [ > > > > > > {"name": "Name", "type": "string", "versions": "3+", > > > > > > "about": "The name of the feature."}, > > > > > > {"name": "Version", "type": "int64", "versions": "3+", > > > > > > "about": "The finalized version for the feature."} > > > > > > ] > > > > > > > > > > > > 105. kafka-features.sh: Instead of using update/delete, perhaps > it's > > > > > better > > > > > > to use enable/disable? > > > > > > > > > > > > Jun > > > > > > > > > > > > On Tue, Mar 31, 2020 at 5:29 PM Kowshik Prakasam < > > > > kpraka...@confluent.io > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hey Boyang, > > > > > > > > > > > > > > Thanks for the great feedback! I have updated the KIP based on > your > > > > > > > feedback. > > > > > > > Please find my response below for your comments, look for > sentences > > > > > > > starting > > > > > > > with "(Kowshik)" below. > > > > > > > > > > > > > > > > > > > > > > 1. "When is it safe for the brokers to begin handling EOS > > > traffic" > > > > > > could > > > > > > > be > > > > > > > > converted as "When is it safe for the brokers to start > serving > > > new > > > > > > > > Exactly-Once(EOS) semantics" since EOS is not explained > earlier > > > in > > > > > the > > > > > > > > context. > > > > > > > > > > > > > > (Kowshik): Great point! Done. > > > > > > > > > > > > > > > 2. In the *Explanation *section, the metadata version number > part > > > > > > seems a > > > > > > > > bit blurred. Could you point a reference to later section > that we > > > > > going > > > > > > > to > > > > > > > > store it in Zookeeper and update it every time when there is > a > > > > > feature > > > > > > > > change? > > > > > > > > > > > > > > (Kowshik): Great point! Done. I've added a reference in the > KIP. > > > > > > > > > > > > > > > > > > > > > > 3. For the feature downgrade, although it's a Non-goal of the > > > KIP, > > > > > for > > > > > > > > features such as group coordinator semantics, there is no > legal > > > > > > scenario > > > > > > > to > > > > > > > > perform a downgrade at all. So having downgrade door open is > > > pretty > > > > > > > > error-prone as human faults happen all the time. I'm > assuming as > > > > new > > > > > > > > features are implemented, it's not very hard to add a flag > during > > > > > > feature > > > > > > > > creation to indicate whether this feature is "downgradable". > > > Could > > > > > you > > > > > > > > explain a bit more on the extra engineering effort for > shipping > > > > this > > > > > > KIP > > > > > > > > with downgrade protection in place? > > > > > > > > > > > > > > (Kowshik): Great point! I'd agree and disagree here. While I > agree > > > > that > > > > > > > accidental > > > > > > > downgrades can cause problems, I also think sometimes > downgrades > > > > should > > > > > > > be allowed for emergency reasons (not all downgrades cause > issues). > > > > > > > It is just subjective to the feature being downgraded. > > > > > > > > > > > > > > To be more strict about feature version downgrades, I have > modified > > > > the > > > > > > KIP > > > > > > > proposing that we mandate a `--force-downgrade` flag be used > in the > > > > > > > UPDATE_FEATURES api > > > > > > > and the tooling, whenever the human is downgrading a finalized > > > > feature > > > > > > > version. > > > > > > > Hopefully this should cover the requirement, until we find the > need > > > > for > > > > > > > advanced downgrade support. > > > > > > > > > > > > > > > 4. "Each broker’s supported dictionary of feature versions > will > > > be > > > > > > > defined > > > > > > > > in the broker code." So this means in order to restrict a > certain > > > > > > > feature, > > > > > > > > we need to start the broker first and then send a feature > gating > > > > > > request > > > > > > > > immediately, which introduces a time gap and the > > > intended-to-close > > > > > > > feature > > > > > > > > could actually serve request during this phase. Do you think > we > > > > > should > > > > > > > also > > > > > > > > support configurations as well so that admin user could > freely > > > roll > > > > > up > > > > > > a > > > > > > > > cluster with all nodes complying the same feature gating, > without > > > > > > > worrying > > > > > > > > about the turnaround time to propagate the message only > after the > > > > > > cluster > > > > > > > > starts up? > > > > > > > > > > > > > > (Kowshik): This is a great point/question. One of the > expectations > > > > out > > > > > of > > > > > > > this KIP, which is > > > > > > > already followed in the broker, is the following. > > > > > > > - Imagine at time T1 the broker starts up and registers it’s > > > > presence > > > > > in > > > > > > > ZK, > > > > > > > along with advertising it’s supported features. > > > > > > > - Imagine at a future time T2 the broker receives the > > > > > > > UpdateMetadataRequest > > > > > > > from the controller, which contains the latest finalized > > > features > > > > as > > > > > > > seen by > > > > > > > the controller. The broker validates this data against it’s > > > > > supported > > > > > > > features to > > > > > > > make sure there is no mismatch (it will shutdown if there > is an > > > > > > > incompatibility). > > > > > > > > > > > > > > It is expected that during the time between the 2 events T1 > and T2, > > > > the > > > > > > > broker is > > > > > > > almost a silent entity in the cluster. It does not add any > value to > > > > the > > > > > > > cluster, or carry > > > > > > > out any important broker activities. By “important”, I mean it > is > > > not > > > > > > doing > > > > > > > mutations > > > > > > > on it’s persistence, not mutating critical in-memory state, > won’t > > > be > > > > > > > serving > > > > > > > produce/fetch requests. Note it doesn’t even know it’s assigned > > > > > > partitions > > > > > > > until > > > > > > > it receives UpdateMetadataRequest from controller. Anything the > > > > broker > > > > > is > > > > > > > doing up > > > > > > > until this point is not damaging/useful. > > > > > > > > > > > > > > I’ve clarified the above in the KIP, see this new section: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features#KIP-584:Versioningschemeforfeatures-Incompatiblebrokerlifetime > > > > > > > . > > > > > > > > > > > > > > > 5. "adding a new Feature, updating or deleting an existing > > > > Feature", > > > > > > may > > > > > > > be > > > > > > > > I misunderstood something, I thought the features are > defined in > > > > > broker > > > > > > > > code, so admin could not really create a new feature? > > > > > > > > > > > > > > (Kowshik): Great point! You understood this right. Here adding > a > > > > > feature > > > > > > > means we are > > > > > > > adding a cluster-wide finalized *max* version for a feature > that > > > was > > > > > > > previously never finalized. > > > > > > > I have clarified this in the KIP now. > > > > > > > > > > > > > > > 6. I think we need a separate error code like > > > > > > FEATURE_UPDATE_IN_PROGRESS > > > > > > > to > > > > > > > > reject a concurrent feature update request. > > > > > > > > > > > > > > (Kowshik): Great point! I have modified the KIP adding the > above > > > (see > > > > > > > 'Tooling support -> Admin API changes'). > > > > > > > > > > > > > > > 7. I think we haven't discussed the alternative solution to > pass > > > > the > > > > > > > > feature information through Zookeeper. Is that mentioned in > the > > > KIP > > > > > to > > > > > > > > justify why using UpdateMetadata is more favorable? > > > > > > > > > > > > > > (Kowshik): Nice question! The broker reads finalized feature > info > > > > > stored > > > > > > in > > > > > > > ZK, > > > > > > > only during startup when it does a validation. When serving > > > > > > > `ApiVersionsRequest`, the > > > > > > > broker does not read this info from ZK directly. I'd imagine > the > > > risk > > > > > is > > > > > > > that it can increase > > > > > > > the ZK read QPS which can be a bottleneck for the system. > Today, in > > > > > Kafka > > > > > > > we use the > > > > > > > controller to fan out ZK updates to brokers and we want to > stick to > > > > > that > > > > > > > pattern to avoid > > > > > > > the ZK read bottleneck when serving `ApiVersionsRequest`. > > > > > > > > > > > > > > > 8. I was under the impression that user could configure a > range > > > of > > > > > > > > supported versions, what's the trade-off for allowing single > > > > > finalized > > > > > > > > version only? > > > > > > > > > > > > > > (Kowshik): Great question! The finalized version of a feature > > > > basically > > > > > > > refers to > > > > > > > the cluster-wide finalized feature "maximum" version. For > example, > > > if > > > > > the > > > > > > > 'group_coordinator' feature > > > > > > > has the finalized version set to 10, then, it means that > > > cluster-wide > > > > > all > > > > > > > versions upto v10 are > > > > > > > supported for this feature. However, note that if some version > (ex: > > > > v0) > > > > > > > gets deprecated > > > > > > > for this feature, then we don’t convey that using this scheme > (also > > > > > > > supporting deprecation is a non-goal). > > > > > > > > > > > > > > (Kowshik): I’ve now modified the KIP at all points, refering to > > > > > finalized > > > > > > > feature "maximum" versions. > > > > > > > > > > > > > > > 9. One minor syntax fix: Note that here the "client" here > may be > > > a > > > > > > > producer > > > > > > > > > > > > > > (Kowshik): Great point! Done. > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > Kowshik > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 31, 2020 at 1:17 PM Boyang Chen < > > > > > reluctanthero...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Hey Kowshik, > > > > > > > > > > > > > > > > thanks for the revised KIP. Got a couple of questions: > > > > > > > > > > > > > > > > 1. "When is it safe for the brokers to begin handling EOS > > > traffic" > > > > > > could > > > > > > > be > > > > > > > > converted as "When is it safe for the brokers to start > serving > > > new > > > > > > > > Exactly-Once(EOS) semantics" since EOS is not explained > earlier > > > in > > > > > the > > > > > > > > context. > > > > > > > > > > > > > > > > 2. In the *Explanation *section, the metadata version number > part > > > > > > seems a > > > > > > > > bit blurred. Could you point a reference to later section > that we > > > > > going > > > > > > > to > > > > > > > > store it in Zookeeper and update it every time when there is > a > > > > > feature > > > > > > > > change? > > > > > > > > > > > > > > > > 3. For the feature downgrade, although it's a Non-goal of the > > > KIP, > > > > > for > > > > > > > > features such as group coordinator semantics, there is no > legal > > > > > > scenario > > > > > > > to > > > > > > > > perform a downgrade at all. So having downgrade door open is > > > pretty > > > > > > > > error-prone as human faults happen all the time. I'm > assuming as > > > > new > > > > > > > > features are implemented, it's not very hard to add a flag > during > > > > > > feature > > > > > > > > creation to indicate whether this feature is "downgradable". > > > Could > > > > > you > > > > > > > > explain a bit more on the extra engineering effort for > shipping > > > > this > > > > > > KIP > > > > > > > > with downgrade protection in place? > > > > > > > > > > > > > > > > 4. "Each broker’s supported dictionary of feature versions > will > > > be > > > > > > > defined > > > > > > > > in the broker code." So this means in order to restrict a > certain > > > > > > > feature, > > > > > > > > we need to start the broker first and then send a feature > gating > > > > > > request > > > > > > > > immediately, which introduces a time gap and the > > > intended-to-close > > > > > > > feature > > > > > > > > could actually serve request during this phase. Do you think > we > > > > > should > > > > > > > also > > > > > > > > support configurations as well so that admin user could > freely > > > roll > > > > > up > > > > > > a > > > > > > > > cluster with all nodes complying the same feature gating, > without > > > > > > > worrying > > > > > > > > about the turnaround time to propagate the message only > after the > > > > > > cluster > > > > > > > > starts up? > > > > > > > > > > > > > > > > 5. "adding a new Feature, updating or deleting an existing > > > > Feature", > > > > > > may > > > > > > > be > > > > > > > > I misunderstood something, I thought the features are > defined in > > > > > broker > > > > > > > > code, so admin could not really create a new feature? > > > > > > > > > > > > > > > > 6. I think we need a separate error code like > > > > > > FEATURE_UPDATE_IN_PROGRESS > > > > > > > to > > > > > > > > reject a concurrent feature update request. > > > > > > > > > > > > > > > > 7. I think we haven't discussed the alternative solution to > pass > > > > the > > > > > > > > feature information through Zookeeper. Is that mentioned in > the > > > KIP > > > > > to > > > > > > > > justify why using UpdateMetadata is more favorable? > > > > > > > > > > > > > > > > 8. I was under the impression that user could configure a > range > > > of > > > > > > > > supported versions, what's the trade-off for allowing single > > > > > finalized > > > > > > > > version only? > > > > > > > > > > > > > > > > 9. One minor syntax fix: Note that here the "client" here > may be > > > a > > > > > > > producer > > > > > > > > > > > > > > > > Boyang > > > > > > > > > > > > > > > > On Mon, Mar 30, 2020 at 4:53 PM Colin McCabe < > cmcc...@apache.org > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > On Thu, Mar 26, 2020, at 19:24, Kowshik Prakasam wrote: > > > > > > > > > > Hi Colin, > > > > > > > > > > > > > > > > > > > > Thanks for the feedback! I've changed the KIP to address > your > > > > > > > > > > suggestions. > > > > > > > > > > Please find below my explanation. Here is a link to KIP > 584: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > 1. '__data_version__' is the version of the finalized > feature > > > > > > > metadata > > > > > > > > > > (i.e. actual ZK node contents), while the > > > '__schema_version__' > > > > is > > > > > > the > > > > > > > > > > version of the schema of the data persisted in ZK. These > > > serve > > > > > > > > different > > > > > > > > > > purposes. '__data_version__' is is useful mainly to > clients > > > > > during > > > > > > > > reads, > > > > > > > > > > to differentiate between the 2 versions of eventually > > > > consistent > > > > > > > > > 'finalized > > > > > > > > > > features' metadata (i.e. larger metadata version is more > > > > recent). > > > > > > > > > > '__schema_version__' provides an additional degree of > > > > > flexibility, > > > > > > > > where > > > > > > > > > if > > > > > > > > > > we decide to change the schema for '/features' node in > ZK (in > > > > the > > > > > > > > > future), > > > > > > > > > > then we can manage broker roll outs suitably (i.e. > > > > > > > > > > serialization/deserialization of the ZK data can be > handled > > > > > > safely). > > > > > > > > > > > > > > > > > > Hi Kowshik, > > > > > > > > > > > > > > > > > > If you're talking about a number that lets you know if > data is > > > > more > > > > > > or > > > > > > > > > less recent, we would typically call that an epoch, and > not a > > > > > > version. > > > > > > > > For > > > > > > > > > the ZK data structures, the word "version" is typically > > > reserved > > > > > for > > > > > > > > > describing changes to the overall schema of the data that > is > > > > > written > > > > > > to > > > > > > > > > ZooKeeper. We don't even really change the "version" of > those > > > > > > schemas > > > > > > > > that > > > > > > > > > much, since most changes are backwards-compatible. But we > do > > > > > include > > > > > > > > that > > > > > > > > > version field just in case. > > > > > > > > > > > > > > > > > > I don't think we really need an epoch here, though, since > we > > > can > > > > > just > > > > > > > > look > > > > > > > > > at the broker epoch. Whenever the broker registers, its > epoch > > > > will > > > > > > be > > > > > > > > > greater than the previous broker epoch. And the newly > > > registered > > > > > > data > > > > > > > > will > > > > > > > > > take priority. This will be a lot simpler than adding a > > > separate > > > > > > epoch > > > > > > > > > system, I think. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. Regarding admin client needing min and max > information - > > > you > > > > > are > > > > > > > > > right! > > > > > > > > > > I've changed the KIP such that the Admin API also allows > the > > > > user > > > > > > to > > > > > > > > read > > > > > > > > > > 'supported features' from a specific broker. Please look > at > > > the > > > > > > > section > > > > > > > > > > "Admin API changes". > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. Regarding the use of `long` vs `Long` - it was not > > > > deliberate. > > > > > > > I've > > > > > > > > > > improved the KIP to just use `long` at all places. > > > > > > > > > > > > > > > > > > Sounds good. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4. Regarding kafka.admin.FeatureCommand tool - you are > right! > > > > > I've > > > > > > > > > updated > > > > > > > > > > the KIP sketching the functionality provided by this > tool, > > > with > > > > > > some > > > > > > > > > > examples. Please look at the section "Tooling support > > > > examples". > > > > > > > > > > > > > > > > > > > > Thank you! > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, Kowshik. > > > > > > > > > > > > > > > > > > cheers, > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > Kowshik > > > > > > > > > > > > > > > > > > > > On Wed, Mar 25, 2020 at 11:31 PM Colin McCabe < > > > > > cmcc...@apache.org> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Thanks, Kowshik, this looks good. > > > > > > > > > > > > > > > > > > > > > > In the "Schema" section, do we really need both > > > > > > __schema_version__ > > > > > > > > and > > > > > > > > > > > __data_version__? Can we just have a single version > field > > > > > here? > > > > > > > > > > > > > > > > > > > > > > Shouldn't the Admin(Client) function have some way to > get > > > the > > > > > min > > > > > > > and > > > > > > > > > max > > > > > > > > > > > information that we're exposing as well? I guess we > could > > > > have > > > > > > > min, > > > > > > > > > max, > > > > > > > > > > > and current. Unrelated: is the use of Long rather than > > > long > > > > > > > > deliberate > > > > > > > > > > > here? > > > > > > > > > > > > > > > > > > > > > > It would be good to describe how the command line tool > > > > > > > > > > > kafka.admin.FeatureCommand will work. For example the > > > flags > > > > > that > > > > > > > it > > > > > > > > > will > > > > > > > > > > > take and the output that it will generate to STDOUT. > > > > > > > > > > > > > > > > > > > > > > cheers, > > > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 24, 2020, at 17:08, Kowshik Prakasam wrote: > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > > > > > I've opened KIP-584 > > > > > > <https://issues.apache.org/jira/browse/KIP-584> < > > > > > > > https://issues.apache.org/jira/browse/KIP-584 > > > > > > > > > > > > > > > > > > > > > which > > > > > > > > > > > > is intended to provide a versioning scheme for > features. > > > > I'd > > > > > > like > > > > > > > > to > > > > > > > > > use > > > > > > > > > > > > this thread to discuss the same. I'd appreciate any > > > > feedback > > > > > on > > > > > > > > this. > > > > > > > > > > > > Here > > > > > > > > > > > > is a link to KIP-584 > > > > > > <https://issues.apache.org/jira/browse/KIP-584>: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > > > > Thank you! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > Kowshik > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >