Hi Jun, Thanks for the feedback! I have updated the KIP-584 addressing your comments. Please find my response below.
> 100.6 You can look for the sentence "This operation requires ALTER on > CLUSTER." in KIP-455. Also, you can check its usage in > KafkaApis.authorize(). (Kowshik): Done. Great point! For the newly introduced UPDATE_FEATURES api, I have added a requirement that AclOperation.ALTER is required on ResourceType.CLUSTER. > 110. Keeping the feature version as int is probably fine. I just felt that > for some of the common user interactions, it's more convenient to > relate that to a release version. For example, if a user wants to downgrade > to a release 2.5, it's easier for the user to use the tool like "tool > --downgrade 2.5" instead of "tool --downgrade --feature X --version 6". (Kowshik): Great point. Generally, maximum feature version levels are not downgradable after they are finalized in the cluster. This is because, as a guideline bumping feature version level usually is used mainly to convey important breaking changes. Despite the above, there may be some extreme/rare cases where a user wants to downgrade all features to a specific previous release. The user may want to do this just prior to rolling back a Kafka cluster to a previous release. To support the above, I have made a change to the KIP explaining that the CLI tool is versioned. The CLI tool internally has knowledge about a map of features to their respective max versions supported by the Broker. The tool's knowledge of features and their version values, is limited to the version of the CLI tool itself i.e. the information is packaged into the CLI tool when it is released. Whenever a Kafka release introduces a new feature version, or modifies an existing feature version, the CLI tool shall also be updated with this information, Newer versions of the CLI tool will be released as part of the Kafka releases. Therefore, to achieve the downgrade need, the user just needs to run the version of the CLI tool that's part of the particular previous release that he/she is downgrading to. To help the user with this, there is a new command added to the CLI tool called `downgrade-all`. This essentially downgrades max version levels of all features in the cluster to the versions known to the CLI tool internally. I have explained the above in the KIP under these sections: Tooling support (have explained that the CLI tool is versioned): https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features#KIP-584:Versioningschemeforfeatures-Toolingsupport Regular CLI tool usage (please refer to point #3, and see the tooling example) https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features#KIP-584:Versioningschemeforfeatures-RegularCLItoolusage > 110. Similarly, if the client library finds a feature mismatch with the broker, > the client likely needs to log some error message for the user to take some > actions. It's much more actionable if the error message is "upgrade the > broker to release version 2.6" than just "upgrade the broker to feature > version 7". (Kowshik): That's a really good point! If we use ints for feature versions, the best message that client can print for debugging is "broker doesn't support feature version 7", and alongside that print the supported version range returned by the broker. Then, does it sound reasonable that the user could then reference Kafka release logs to figure out which version of the broker release is required be deployed, to support feature version 7? I couldn't think of a better strategy here. > 120. When should a developer bump up the version of a feature? (Kowshik): Great question! In the KIP, I have added a section: 'Guidelines on feature versions and workflows' providing some guidelines on when to use the versioned feature flags, and what are the regular workflows with the CLI tool. Link to the relevant sections: Guidelines: https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features#KIP-584:Versioningschemeforfeatures-Guidelinesonfeatureversionsandworkflows Regular CLI tool usage: https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features#KIP-584:Versioningschemeforfeatures-RegularCLItoolusage Advanced CLI tool usage: https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features#KIP-584:Versioningschemeforfeatures-AdvancedCLItoolusage Cheers, Kowshik On Fri, Apr 10, 2020 at 4:25 PM Jun Rao <j...@confluent.io> wrote: > Hi, Kowshik, > > Thanks for the reply. A few more comments. > > 110. Keeping the feature version as int is probably fine. I just felt that > for some of the common user interactions, it's more convenient to > relate that to a release version. For example, if a user wants to downgrade > to a release 2.5, it's easier for the user to use the tool like "tool > --downgrade 2.5" instead of "tool --downgrade --feature X --version 6". > Similarly, if the client library finds a feature mismatch with the broker, > the client likely needs to log some error message for the user to take some > actions. It's much more actionable if the error message is "upgrade the > broker to release version 2.6" than just "upgrade the broker to feature > version 7". > > 111. Sounds good. > > 120. When should a developer bump up the version of a feature? > > Jun > > On Tue, Apr 7, 2020 at 12:26 AM Kowshik Prakasam <kpraka...@confluent.io> > wrote: > > > Hi Jun, > > > > I have updated the KIP for the item 111. > > I'm in the process of addressing 100.6, and will provide an update soon. > > I think item 110 is still under discussion given we are now providing a > way > > to finalize > > all features to their latest version levels. In any case, please let us > > know > > how you feel in response to Colin's comments on this topic. > > > > > 111. To put this in context, when we had IBP, the default value is the > > > current released version. So, if you are a brand new user, you don't > need > > > to configure IBP and all new features will be immediately available in > > the > > > new cluster. If you are upgrading from an old version, you do need to > > > understand and configure IBP. I see a similar pattern here for > > > features. From the ease of use perspective, ideally, we shouldn't > require > > a > > > new user to have an extra step such as running a bootstrap script > unless > > > it's truly necessary. If someone has a special need (all the cases you > > > mentioned seem special cases?), they can configure a mode such that > > > features are enabled/disabled manually. > > > > (Kowshik): That makes sense, thanks for the idea! Sorry if I didn't > > understand > > this need earlier. I have updated the KIP with the approach that whenever > > the '/features' node is absent, the controller by default will bootstrap > > the node > > to contain the latest feature levels. Here is the new section in the KIP > > describing > > the same: > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features#KIP-584:Versioningschemeforfeatures-Controller:ZKnodebootstrapwithdefaultvalues > > > > Next, as I explained in my response to Colin's suggestions, we are now > > providing a `--finalize-latest-features` flag with the tooling. This lets > > the sysadmin finalize all features known to the controller to their > latest > > version > > levels. Please look at this section (point #3 and the tooling example > > later): > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features#KIP-584:Versioningschemeforfeatures-Toolingsupport > > > > > > Do you feel this addresses your comment/concern? > > > > > > Cheers, > > Kowshik > > > > On Mon, Apr 6, 2020 at 12:06 PM Jun Rao <j...@confluent.io> wrote: > > > > > Hi, Kowshik, > > > > > > Thanks for the reply. A few more replies below. > > > > > > 100.6 You can look for the sentence "This operation requires ALTER on > > > CLUSTER." in KIP-455. Also, you can check its usage in > > > KafkaApis.authorize(). > > > > > > 110. From the external client/tooling perspective, it's more natural to > > use > > > the release version for features. If we can use the same release > version > > > for internal representation, it seems simpler (easier to understand, no > > > mapping overhead, etc). Is there a benefit with separate external and > > > internal versioning schemes? > > > > > > 111. To put this in context, when we had IBP, the default value is the > > > current released version. So, if you are a brand new user, you don't > need > > > to configure IBP and all new features will be immediately available in > > the > > > new cluster. If you are upgrading from an old version, you do need to > > > understand and configure IBP. I see a similar pattern here for > > > features. From the ease of use perspective, ideally, we shouldn't > > require a > > > new user to have an extra step such as running a bootstrap script > unless > > > it's truly necessary. If someone has a special need (all the cases you > > > mentioned seem special cases?), they can configure a mode such that > > > features are enabled/disabled manually. > > > > > > Jun > > > > > > On Fri, Apr 3, 2020 at 5:54 PM Kowshik Prakasam < > kpraka...@confluent.io> > > > wrote: > > > > > > > Hi Jun, > > > > > > > > Thanks for the feedback and suggestions. Please find my response > below. > > > > > > > > > 100.6 For every new request, the admin needs to control who is > > allowed > > > to > > > > > issue that request if security is enabled. So, we need to assign > the > > > new > > > > > request a ResourceType and possible AclOperations. See > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment > > > > > as an example. > > > > > > > > (Kowshik): I don't see any reference to the words ResourceType or > > > > AclOperations > > > > in the KIP. Please let me know how I can use the KIP that you linked > to > > > > know how to > > > > setup the appropriate ResourceType and/or ClusterOperation? > > > > > > > > > 105. If we change delete to disable, it's better to do this > > > consistently > > > > in > > > > > request protocol and admin api as well. > > > > > > > > (Kowshik): The API shouldn't be called 'disable' when it is deleting > a > > > > feature. > > > > I've just changed the KIP to use 'delete'. I don't have a strong > > > > preference. > > > > > > > > > 110. The minVersion/maxVersion for features use int64. Currently, > our > > > > > release version schema is major.minor.bugfix (e.g. 2.5.0). It's > > > possible > > > > > for new features to be included in minor releases too. Should we > make > > > the > > > > > feature versioning match the release versioning? > > > > > > > > (Kowshik): The release version can be mapped to a set of feature > > > versions, > > > > and this can be done, for example in the tool (or even external to > the > > > > tool). > > > > Can you please clarify what I'm missing? > > > > > > > > > 111. "During regular operations, the data in the ZK node can be > > mutated > > > > > only via a specific admin API served only by the controller." I am > > > > > wondering why can't the controller auto finalize a feature version > > > after > > > > > all brokers are upgraded? For new users who download the latest > > version > > > > to > > > > > build a new cluster, it's inconvenient for them to have to manually > > > > enable > > > > > each feature. > > > > > > > > (Kowshik): I agree that there is a trade-off here, but it will help > > > > to decide whether the automation can be thought through in the future > > > > in a follow up KIP, or right now in this KIP. We may invest > > > > in automation, but we have to decide whether we should do it > > > > now or later. > > > > > > > > For the inconvenience that you mentioned, do you think the problem > that > > > you > > > > mentioned can be overcome by asking for the cluster operator to run > a > > > > bootstrap script when he/she knows that a specific AK release has > been > > > > almost completely deployed in a cluster for the first time? Idea is > > that > > > > the > > > > bootstrap script will know how to map a specific AK release to > > finalized > > > > feature versions, and run the `kafka-features.sh` tool appropriately > > > > against > > > > the cluster. > > > > > > > > Now, coming back to your automation proposal/question. > > > > I do see the value of automated feature version finalization, but I > > also > > > > see > > > > that this will open up several questions and some risks, as explained > > > > below. > > > > The answers to these depend on the definition of the automation we > > choose > > > > to build, and how well does it fit into a kafka deployment. > > > > Basically, it can be unsafe for the controller to finalize feature > > > version > > > > upgrades automatically, without learning about the intent of the > > cluster > > > > operator. > > > > 1. We would sometimes want to lock feature versions only when we have > > > > externally verified > > > > the stability of the broker binary. > > > > 2. Sometimes only the cluster operator knows that a cluster upgrade > is > > > > complete, > > > > and new brokers are highly unlikely to join the cluster. > > > > 3. Only the cluster operator knows that the intent is to deploy the > > same > > > > version > > > > of the new broker release across the entire cluster (i.e. the latest > > > > downloaded version). > > > > 4. For downgrades, it appears the controller still needs some > external > > > > input > > > > (such as the proposed tool) to finalize a feature version downgrade. > > > > > > > > If we have automation, that automation can end up failing in some of > > the > > > > cases > > > > above. Then, we need a way to declare that the cluster is "not ready" > > if > > > > the > > > > controller cannot automatically finalize some basic required feature > > > > version > > > > upgrades across the cluster. We need to make the cluster operator > aware > > > in > > > > such a scenario (raise an alert or alike). > > > > > > > > > 112. DeleteFeaturesResponse: It seems the apiKey should be 49 > instead > > > of > > > > 48. > > > > > > > > (Kowshik): Done. > > > > > > > > > > > > Cheers, > > > > Kowshik > > > > > > > > On Fri, Apr 3, 2020 at 11:24 AM Jun Rao <j...@confluent.io> wrote: > > > > > > > > > Hi, Kowshik, > > > > > > > > > > Thanks for the reply. A few more comments below. > > > > > > > > > > 100.6 For every new request, the admin needs to control who is > > allowed > > > to > > > > > issue that request if security is enabled. So, we need to assign > the > > > new > > > > > request a ResourceType and possible AclOperations. See > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment > > > > > as > > > > > an example. > > > > > > > > > > 105. If we change delete to disable, it's better to do this > > > consistently > > > > in > > > > > request protocol and admin api as well. > > > > > > > > > > 110. The minVersion/maxVersion for features use int64. Currently, > our > > > > > release version schema is major.minor.bugfix (e.g. 2.5.0). It's > > > possible > > > > > for new features to be included in minor releases too. Should we > make > > > the > > > > > feature versioning match the release versioning? > > > > > > > > > > 111. "During regular operations, the data in the ZK node can be > > mutated > > > > > only via a specific admin API served only by the controller." I am > > > > > wondering why can't the controller auto finalize a feature version > > > after > > > > > all brokers are upgraded? For new users who download the latest > > version > > > > to > > > > > build a new cluster, it's inconvenient for them to have to manually > > > > enable > > > > > each feature. > > > > > > > > > > 112. DeleteFeaturesResponse: It seems the apiKey should be 49 > instead > > > of > > > > > 48. > > > > > > > > > > Jun > > > > > > > > > > > > > > > On Fri, Apr 3, 2020 at 1:27 AM Kowshik Prakasam < > > > kpraka...@confluent.io> > > > > > wrote: > > > > > > > > > > > Hey Jun, > > > > > > > > > > > > Thanks a lot for the great feedback! Please note that the design > > > > > > has changed a little bit on the KIP, and we now propagate the > > > finalized > > > > > > features metadata only via ZK watches (instead of > > > UpdateMetadataRequest > > > > > > from the controller). > > > > > > > > > > > > Please find below my response to your questions/feedback, with > the > > > > prefix > > > > > > "(Kowshik):". > > > > > > > > > > > > > 100. UpdateFeaturesRequest/UpdateFeaturesResponse > > > > > > > 100.1 Since this request waits for responses from brokers, > should > > > we > > > > > add > > > > > > a > > > > > > > timeout in the request (like createTopicRequest)? > > > > > > > > > > > > (Kowshik): Great point! Done. I have added a timeout field. Note: > > we > > > no > > > > > > longer > > > > > > wait for responses from brokers, since the design has been > changed > > so > > > > > that > > > > > > the > > > > > > features information is propagated via ZK. Nevertheless, it is > > right > > > to > > > > > > have a timeout > > > > > > for the request. > > > > > > > > > > > > > 100.2 The response schema is a bit weird. Typically, the > response > > > > just > > > > > > > shows an error code and an error message, instead of echoing > the > > > > > request. > > > > > > > > > > > > (Kowshik): Great point! Yeah, I have modified it to just return > an > > > > error > > > > > > code and a message. > > > > > > Previously it was not echoing the "request", rather it was > > returning > > > > the > > > > > > latest set of > > > > > > cluster-wide finalized features (after applying the updates). But > > you > > > > are > > > > > > right, > > > > > > the additional info is not required, so I have removed it from > the > > > > > response > > > > > > schema. > > > > > > > > > > > > > 100.3 Should we add a separate request to list/describe the > > > existing > > > > > > > features? > > > > > > > > > > > > (Kowshik): This is already present in the KIP via the > > > > 'DescribeFeatures' > > > > > > Admin API, > > > > > > which, underneath covers uses the ApiVersionsRequest to > > list/describe > > > > the > > > > > > existing features. Please read the 'Tooling support' section. > > > > > > > > > > > > > 100.4 We are mixing ADD_OR_UPDATE and DELETE in a single > request. > > > For > > > > > > > DELETE, the version field doesn't make sense. So, I guess the > > > broker > > > > > just > > > > > > > ignores this? An alternative way is to have a separate > > > > > > DeleteFeaturesRequest > > > > > > > > > > > > (Kowshik): Great point! I have modified the KIP now to have 2 > > > separate > > > > > > controller APIs > > > > > > serving these different purposes: > > > > > > 1. updateFeatures > > > > > > 2. deleteFeatures > > > > > > > > > > > > > 100.5 In UpdateFeaturesResponse, we have "The monotonically > > > > increasing > > > > > > > version of the metadata for finalized features." I am wondering > > why > > > > the > > > > > > > ordering is important? > > > > > > > > > > > > (Kowshik): In the latest KIP write-up, it is called epoch > (instead > > of > > > > > > version), and > > > > > > it is just the ZK node version. Basically, this is the epoch for > > the > > > > > > cluster-wide > > > > > > finalized feature version metadata. This metadata is served to > > > clients > > > > > via > > > > > > the > > > > > > ApiVersionsResponse (for reads). We propagate updates from the > > > > > '/features' > > > > > > ZK node > > > > > > to all brokers, via ZK watches setup by each broker on the > > > '/features' > > > > > > node. > > > > > > > > > > > > Now here is why the ordering is important: > > > > > > ZK watches don't propagate at the same time. As a result, the > > > > > > ApiVersionsResponse > > > > > > is eventually consistent across brokers. This can introduce cases > > > > > > where clients see an older lower epoch of the features metadata, > > > after > > > > a > > > > > > more recent > > > > > > higher epoch was returned at a previous point in time. We expect > > > > clients > > > > > > to always employ the rule that the latest received higher epoch > of > > > > > metadata > > > > > > always trumps an older smaller epoch. Those clients that are > > external > > > > to > > > > > > Kafka should strongly consider discovering the latest metadata > once > > > > > during > > > > > > startup from the brokers, and if required refresh the metadata > > > > > periodically > > > > > > (to get the latest metadata). > > > > > > > > > > > > > 100.6 Could you specify the required ACL for this new request? > > > > > > > > > > > > (Kowshik): What is ACL, and how could I find out which one to > > > specify? > > > > > > Please could you provide me some pointers? I'll be glad to update > > the > > > > > > KIP once I know the next steps. > > > > > > > > > > > > > 101. For the broker registration ZK node, should we bump up the > > > > version > > > > > > in > > > > > > the json? > > > > > > > > > > > > (Kowshik): Great point! Done. I've increased the version in the > > > broker > > > > > json > > > > > > by 1. > > > > > > > > > > > > > 102. For the /features ZK node, not sure if we need the epoch > > > field. > > > > > Each > > > > > > > ZK node has an internal version field that is incremented on > > every > > > > > > update. > > > > > > > > > > > > (Kowshik): Great point! Done. I'm using the ZK node version now, > > > > instead > > > > > of > > > > > > explicitly > > > > > > incremented epoch. > > > > > > > > > > > > > 103. "Enabling the actual semantics of a feature version > > > cluster-wide > > > > > is > > > > > > > left to the discretion of the logic implementing the feature > (ex: > > > can > > > > > be > > > > > > > done via dynamic broker config)." Does that mean the broker > > > > > registration > > > > > > ZK > > > > > > > node will be updated dynamically when this happens? > > > > > > > > > > > > (Kowshik): Not really. The text was just conveying that a broker > > > could > > > > > > "know" of > > > > > > a new feature version, but it does not mean the broker should > have > > > also > > > > > > activated the effects of the feature version. Knowing vs > activation > > > > are 2 > > > > > > separate things, > > > > > > and the latter can be achieved by dynamic config. I have reworded > > the > > > > > text > > > > > > to > > > > > > make this clear to the reader. > > > > > > > > > > > > > > > > > > > 104. UpdateMetadataRequest > > > > > > > 104.1 It would be useful to describe when the feature metadata > is > > > > > > included > > > > > > > in the request. My understanding is that it's only included if > > (1) > > > > > there > > > > > > is > > > > > > > a change to the finalized feature; (2) broker restart; (3) > > > controller > > > > > > > failover. > > > > > > > 104.2 The new fields have the following versions. Why are the > > > > versions > > > > > 3+ > > > > > > > when the top version is bumped to 6? > > > > > > > "fields": [ > > > > > > > {"name": "Name", "type": "string", "versions": "3+", > > > > > > > "about": "The name of the feature."}, > > > > > > > {"name": "Version", "type": "int64", "versions": > "3+", > > > > > > > "about": "The finalized version for the feature."} > > > > > > > ] > > > > > > > > > > > > (Kowshik): With the new improved design, we have completely > > > eliminated > > > > > the > > > > > > need to > > > > > > use UpdateMetadataRequest. This is because we now rely on ZK to > > > deliver > > > > > the > > > > > > notifications for changes to the '/features' ZK node. > > > > > > > > > > > > > 105. kafka-features.sh: Instead of using update/delete, perhaps > > > it's > > > > > > better > > > > > > > to use enable/disable? > > > > > > > > > > > > (Kowshik): For delete, yes, I have changed it so that we instead > > call > > > > it > > > > > > 'disable'. > > > > > > However for 'update', it can now also refer to either an upgrade > > or a > > > > > > forced downgrade. > > > > > > Therefore, I have left it the way it is, just calling it as just > > > > > 'update'. > > > > > > > > > > > > > > > > > > Cheers, > > > > > > Kowshik > > > > > > > > > > > > On Tue, Mar 31, 2020 at 6:51 PM Jun Rao <j...@confluent.io> > wrote: > > > > > > > > > > > > > Hi, Kowshik, > > > > > > > > > > > > > > Thanks for the KIP. Looks good overall. A few comments below. > > > > > > > > > > > > > > 100. UpdateFeaturesRequest/UpdateFeaturesResponse > > > > > > > 100.1 Since this request waits for responses from brokers, > should > > > we > > > > > add > > > > > > a > > > > > > > timeout in the request (like createTopicRequest)? > > > > > > > 100.2 The response schema is a bit weird. Typically, the > response > > > > just > > > > > > > shows an error code and an error message, instead of echoing > the > > > > > request. > > > > > > > 100.3 Should we add a separate request to list/describe the > > > existing > > > > > > > features? > > > > > > > 100.4 We are mixing ADD_OR_UPDATE and DELETE in a single > request. > > > For > > > > > > > DELETE, the version field doesn't make sense. So, I guess the > > > broker > > > > > just > > > > > > > ignores this? An alternative way is to have a separate > > > > > > > DeleteFeaturesRequest > > > > > > > 100.5 In UpdateFeaturesResponse, we have "The monotonically > > > > increasing > > > > > > > version of the metadata for finalized features." I am wondering > > why > > > > the > > > > > > > ordering is important? > > > > > > > 100.6 Could you specify the required ACL for this new request? > > > > > > > > > > > > > > 101. For the broker registration ZK node, should we bump up the > > > > version > > > > > > in > > > > > > > the json? > > > > > > > > > > > > > > 102. For the /features ZK node, not sure if we need the epoch > > > field. > > > > > Each > > > > > > > ZK node has an internal version field that is incremented on > > every > > > > > > update. > > > > > > > > > > > > > > 103. "Enabling the actual semantics of a feature version > > > cluster-wide > > > > > is > > > > > > > left to the discretion of the logic implementing the feature > (ex: > > > can > > > > > be > > > > > > > done via dynamic broker config)." Does that mean the broker > > > > > registration > > > > > > ZK > > > > > > > node will be updated dynamically when this happens? > > > > > > > > > > > > > > 104. UpdateMetadataRequest > > > > > > > 104.1 It would be useful to describe when the feature metadata > is > > > > > > included > > > > > > > in the request. My understanding is that it's only included if > > (1) > > > > > there > > > > > > is > > > > > > > a change to the finalized feature; (2) broker restart; (3) > > > controller > > > > > > > failover. > > > > > > > 104.2 The new fields have the following versions. Why are the > > > > versions > > > > > 3+ > > > > > > > when the top version is bumped to 6? > > > > > > > "fields": [ > > > > > > > {"name": "Name", "type": "string", "versions": "3+", > > > > > > > "about": "The name of the feature."}, > > > > > > > {"name": "Version", "type": "int64", "versions": > "3+", > > > > > > > "about": "The finalized version for the feature."} > > > > > > > ] > > > > > > > > > > > > > > 105. kafka-features.sh: Instead of using update/delete, perhaps > > > it's > > > > > > better > > > > > > > to use enable/disable? > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > On Tue, Mar 31, 2020 at 5:29 PM Kowshik Prakasam < > > > > > kpraka...@confluent.io > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hey Boyang, > > > > > > > > > > > > > > > > Thanks for the great feedback! I have updated the KIP based > on > > > your > > > > > > > > feedback. > > > > > > > > Please find my response below for your comments, look for > > > sentences > > > > > > > > starting > > > > > > > > with "(Kowshik)" below. > > > > > > > > > > > > > > > > > > > > > > > > > 1. "When is it safe for the brokers to begin handling EOS > > > > traffic" > > > > > > > could > > > > > > > > be > > > > > > > > > converted as "When is it safe for the brokers to start > > serving > > > > new > > > > > > > > > Exactly-Once(EOS) semantics" since EOS is not explained > > earlier > > > > in > > > > > > the > > > > > > > > > context. > > > > > > > > > > > > > > > > (Kowshik): Great point! Done. > > > > > > > > > > > > > > > > > 2. In the *Explanation *section, the metadata version > number > > > part > > > > > > > seems a > > > > > > > > > bit blurred. Could you point a reference to later section > > that > > > we > > > > > > going > > > > > > > > to > > > > > > > > > store it in Zookeeper and update it every time when there > is > > a > > > > > > feature > > > > > > > > > change? > > > > > > > > > > > > > > > > (Kowshik): Great point! Done. I've added a reference in the > > KIP. > > > > > > > > > > > > > > > > > > > > > > > > > 3. For the feature downgrade, although it's a Non-goal of > the > > > > KIP, > > > > > > for > > > > > > > > > features such as group coordinator semantics, there is no > > legal > > > > > > > scenario > > > > > > > > to > > > > > > > > > perform a downgrade at all. So having downgrade door open > is > > > > pretty > > > > > > > > > error-prone as human faults happen all the time. I'm > assuming > > > as > > > > > new > > > > > > > > > features are implemented, it's not very hard to add a flag > > > during > > > > > > > feature > > > > > > > > > creation to indicate whether this feature is > "downgradable". > > > > Could > > > > > > you > > > > > > > > > explain a bit more on the extra engineering effort for > > shipping > > > > > this > > > > > > > KIP > > > > > > > > > with downgrade protection in place? > > > > > > > > > > > > > > > > (Kowshik): Great point! I'd agree and disagree here. While I > > > agree > > > > > that > > > > > > > > accidental > > > > > > > > downgrades can cause problems, I also think sometimes > > downgrades > > > > > should > > > > > > > > be allowed for emergency reasons (not all downgrades cause > > > issues). > > > > > > > > It is just subjective to the feature being downgraded. > > > > > > > > > > > > > > > > To be more strict about feature version downgrades, I have > > > modified > > > > > the > > > > > > > KIP > > > > > > > > proposing that we mandate a `--force-downgrade` flag be used > in > > > the > > > > > > > > UPDATE_FEATURES api > > > > > > > > and the tooling, whenever the human is downgrading a > finalized > > > > > feature > > > > > > > > version. > > > > > > > > Hopefully this should cover the requirement, until we find > the > > > need > > > > > for > > > > > > > > advanced downgrade support. > > > > > > > > > > > > > > > > > 4. "Each broker’s supported dictionary of feature versions > > will > > > > be > > > > > > > > defined > > > > > > > > > in the broker code." So this means in order to restrict a > > > certain > > > > > > > > feature, > > > > > > > > > we need to start the broker first and then send a feature > > > gating > > > > > > > request > > > > > > > > > immediately, which introduces a time gap and the > > > > intended-to-close > > > > > > > > feature > > > > > > > > > could actually serve request during this phase. Do you > think > > we > > > > > > should > > > > > > > > also > > > > > > > > > support configurations as well so that admin user could > > freely > > > > roll > > > > > > up > > > > > > > a > > > > > > > > > cluster with all nodes complying the same feature gating, > > > without > > > > > > > > worrying > > > > > > > > > about the turnaround time to propagate the message only > after > > > the > > > > > > > cluster > > > > > > > > > starts up? > > > > > > > > > > > > > > > > (Kowshik): This is a great point/question. One of the > > > expectations > > > > > out > > > > > > of > > > > > > > > this KIP, which is > > > > > > > > already followed in the broker, is the following. > > > > > > > > - Imagine at time T1 the broker starts up and registers it’s > > > > > presence > > > > > > in > > > > > > > > ZK, > > > > > > > > along with advertising it’s supported features. > > > > > > > > - Imagine at a future time T2 the broker receives the > > > > > > > > UpdateMetadataRequest > > > > > > > > from the controller, which contains the latest finalized > > > > features > > > > > as > > > > > > > > seen by > > > > > > > > the controller. The broker validates this data against > it’s > > > > > > supported > > > > > > > > features to > > > > > > > > make sure there is no mismatch (it will shutdown if there > is > > > an > > > > > > > > incompatibility). > > > > > > > > > > > > > > > > It is expected that during the time between the 2 events T1 > and > > > T2, > > > > > the > > > > > > > > broker is > > > > > > > > almost a silent entity in the cluster. It does not add any > > value > > > to > > > > > the > > > > > > > > cluster, or carry > > > > > > > > out any important broker activities. By “important”, I mean > it > > is > > > > not > > > > > > > doing > > > > > > > > mutations > > > > > > > > on it’s persistence, not mutating critical in-memory state, > > won’t > > > > be > > > > > > > > serving > > > > > > > > produce/fetch requests. Note it doesn’t even know it’s > assigned > > > > > > > partitions > > > > > > > > until > > > > > > > > it receives UpdateMetadataRequest from controller. Anything > the > > > > > broker > > > > > > is > > > > > > > > doing up > > > > > > > > until this point is not damaging/useful. > > > > > > > > > > > > > > > > I’ve clarified the above in the KIP, see this new section: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features#KIP-584:Versioningschemeforfeatures-Incompatiblebrokerlifetime > > > > > > > > . > > > > > > > > > > > > > > > > > 5. "adding a new Feature, updating or deleting an existing > > > > > Feature", > > > > > > > may > > > > > > > > be > > > > > > > > > I misunderstood something, I thought the features are > defined > > > in > > > > > > broker > > > > > > > > > code, so admin could not really create a new feature? > > > > > > > > > > > > > > > > (Kowshik): Great point! You understood this right. Here > adding > > a > > > > > > feature > > > > > > > > means we are > > > > > > > > adding a cluster-wide finalized *max* version for a feature > > that > > > > was > > > > > > > > previously never finalized. > > > > > > > > I have clarified this in the KIP now. > > > > > > > > > > > > > > > > > 6. I think we need a separate error code like > > > > > > > FEATURE_UPDATE_IN_PROGRESS > > > > > > > > to > > > > > > > > > reject a concurrent feature update request. > > > > > > > > > > > > > > > > (Kowshik): Great point! I have modified the KIP adding the > > above > > > > (see > > > > > > > > 'Tooling support -> Admin API changes'). > > > > > > > > > > > > > > > > > 7. I think we haven't discussed the alternative solution to > > > pass > > > > > the > > > > > > > > > feature information through Zookeeper. Is that mentioned in > > the > > > > KIP > > > > > > to > > > > > > > > > justify why using UpdateMetadata is more favorable? > > > > > > > > > > > > > > > > (Kowshik): Nice question! The broker reads finalized feature > > info > > > > > > stored > > > > > > > in > > > > > > > > ZK, > > > > > > > > only during startup when it does a validation. When serving > > > > > > > > `ApiVersionsRequest`, the > > > > > > > > broker does not read this info from ZK directly. I'd imagine > > the > > > > risk > > > > > > is > > > > > > > > that it can increase > > > > > > > > the ZK read QPS which can be a bottleneck for the system. > > Today, > > > in > > > > > > Kafka > > > > > > > > we use the > > > > > > > > controller to fan out ZK updates to brokers and we want to > > stick > > > to > > > > > > that > > > > > > > > pattern to avoid > > > > > > > > the ZK read bottleneck when serving `ApiVersionsRequest`. > > > > > > > > > > > > > > > > > 8. I was under the impression that user could configure a > > range > > > > of > > > > > > > > > supported versions, what's the trade-off for allowing > single > > > > > > finalized > > > > > > > > > version only? > > > > > > > > > > > > > > > > (Kowshik): Great question! The finalized version of a feature > > > > > basically > > > > > > > > refers to > > > > > > > > the cluster-wide finalized feature "maximum" version. For > > > example, > > > > if > > > > > > the > > > > > > > > 'group_coordinator' feature > > > > > > > > has the finalized version set to 10, then, it means that > > > > cluster-wide > > > > > > all > > > > > > > > versions upto v10 are > > > > > > > > supported for this feature. However, note that if some > version > > > (ex: > > > > > v0) > > > > > > > > gets deprecated > > > > > > > > for this feature, then we don’t convey that using this scheme > > > (also > > > > > > > > supporting deprecation is a non-goal). > > > > > > > > > > > > > > > > (Kowshik): I’ve now modified the KIP at all points, refering > to > > > > > > finalized > > > > > > > > feature "maximum" versions. > > > > > > > > > > > > > > > > > 9. One minor syntax fix: Note that here the "client" here > may > > > be > > > > a > > > > > > > > producer > > > > > > > > > > > > > > > > (Kowshik): Great point! Done. > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > Kowshik > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 31, 2020 at 1:17 PM Boyang Chen < > > > > > > reluctanthero...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hey Kowshik, > > > > > > > > > > > > > > > > > > thanks for the revised KIP. Got a couple of questions: > > > > > > > > > > > > > > > > > > 1. "When is it safe for the brokers to begin handling EOS > > > > traffic" > > > > > > > could > > > > > > > > be > > > > > > > > > converted as "When is it safe for the brokers to start > > serving > > > > new > > > > > > > > > Exactly-Once(EOS) semantics" since EOS is not explained > > earlier > > > > in > > > > > > the > > > > > > > > > context. > > > > > > > > > > > > > > > > > > 2. In the *Explanation *section, the metadata version > number > > > part > > > > > > > seems a > > > > > > > > > bit blurred. Could you point a reference to later section > > that > > > we > > > > > > going > > > > > > > > to > > > > > > > > > store it in Zookeeper and update it every time when there > is > > a > > > > > > feature > > > > > > > > > change? > > > > > > > > > > > > > > > > > > 3. For the feature downgrade, although it's a Non-goal of > the > > > > KIP, > > > > > > for > > > > > > > > > features such as group coordinator semantics, there is no > > legal > > > > > > > scenario > > > > > > > > to > > > > > > > > > perform a downgrade at all. So having downgrade door open > is > > > > pretty > > > > > > > > > error-prone as human faults happen all the time. I'm > assuming > > > as > > > > > new > > > > > > > > > features are implemented, it's not very hard to add a flag > > > during > > > > > > > feature > > > > > > > > > creation to indicate whether this feature is > "downgradable". > > > > Could > > > > > > you > > > > > > > > > explain a bit more on the extra engineering effort for > > shipping > > > > > this > > > > > > > KIP > > > > > > > > > with downgrade protection in place? > > > > > > > > > > > > > > > > > > 4. "Each broker’s supported dictionary of feature versions > > will > > > > be > > > > > > > > defined > > > > > > > > > in the broker code." So this means in order to restrict a > > > certain > > > > > > > > feature, > > > > > > > > > we need to start the broker first and then send a feature > > > gating > > > > > > > request > > > > > > > > > immediately, which introduces a time gap and the > > > > intended-to-close > > > > > > > > feature > > > > > > > > > could actually serve request during this phase. Do you > think > > we > > > > > > should > > > > > > > > also > > > > > > > > > support configurations as well so that admin user could > > freely > > > > roll > > > > > > up > > > > > > > a > > > > > > > > > cluster with all nodes complying the same feature gating, > > > without > > > > > > > > worrying > > > > > > > > > about the turnaround time to propagate the message only > after > > > the > > > > > > > cluster > > > > > > > > > starts up? > > > > > > > > > > > > > > > > > > 5. "adding a new Feature, updating or deleting an existing > > > > > Feature", > > > > > > > may > > > > > > > > be > > > > > > > > > I misunderstood something, I thought the features are > defined > > > in > > > > > > broker > > > > > > > > > code, so admin could not really create a new feature? > > > > > > > > > > > > > > > > > > 6. I think we need a separate error code like > > > > > > > FEATURE_UPDATE_IN_PROGRESS > > > > > > > > to > > > > > > > > > reject a concurrent feature update request. > > > > > > > > > > > > > > > > > > 7. I think we haven't discussed the alternative solution to > > > pass > > > > > the > > > > > > > > > feature information through Zookeeper. Is that mentioned in > > the > > > > KIP > > > > > > to > > > > > > > > > justify why using UpdateMetadata is more favorable? > > > > > > > > > > > > > > > > > > 8. I was under the impression that user could configure a > > range > > > > of > > > > > > > > > supported versions, what's the trade-off for allowing > single > > > > > > finalized > > > > > > > > > version only? > > > > > > > > > > > > > > > > > > 9. One minor syntax fix: Note that here the "client" here > may > > > be > > > > a > > > > > > > > producer > > > > > > > > > > > > > > > > > > Boyang > > > > > > > > > > > > > > > > > > On Mon, Mar 30, 2020 at 4:53 PM Colin McCabe < > > > cmcc...@apache.org > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > On Thu, Mar 26, 2020, at 19:24, Kowshik Prakasam wrote: > > > > > > > > > > > Hi Colin, > > > > > > > > > > > > > > > > > > > > > > Thanks for the feedback! I've changed the KIP to > address > > > your > > > > > > > > > > > suggestions. > > > > > > > > > > > Please find below my explanation. Here is a link to KIP > > > 584: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > > 1. '__data_version__' is the version of the finalized > > > feature > > > > > > > > metadata > > > > > > > > > > > (i.e. actual ZK node contents), while the > > > > '__schema_version__' > > > > > is > > > > > > > the > > > > > > > > > > > version of the schema of the data persisted in ZK. > These > > > > serve > > > > > > > > > different > > > > > > > > > > > purposes. '__data_version__' is is useful mainly to > > clients > > > > > > during > > > > > > > > > reads, > > > > > > > > > > > to differentiate between the 2 versions of eventually > > > > > consistent > > > > > > > > > > 'finalized > > > > > > > > > > > features' metadata (i.e. larger metadata version is > more > > > > > recent). > > > > > > > > > > > '__schema_version__' provides an additional degree of > > > > > > flexibility, > > > > > > > > > where > > > > > > > > > > if > > > > > > > > > > > we decide to change the schema for '/features' node in > ZK > > > (in > > > > > the > > > > > > > > > > future), > > > > > > > > > > > then we can manage broker roll outs suitably (i.e. > > > > > > > > > > > serialization/deserialization of the ZK data can be > > handled > > > > > > > safely). > > > > > > > > > > > > > > > > > > > > Hi Kowshik, > > > > > > > > > > > > > > > > > > > > If you're talking about a number that lets you know if > data > > > is > > > > > more > > > > > > > or > > > > > > > > > > less recent, we would typically call that an epoch, and > > not a > > > > > > > version. > > > > > > > > > For > > > > > > > > > > the ZK data structures, the word "version" is typically > > > > reserved > > > > > > for > > > > > > > > > > describing changes to the overall schema of the data that > > is > > > > > > written > > > > > > > to > > > > > > > > > > ZooKeeper. We don't even really change the "version" of > > > those > > > > > > > schemas > > > > > > > > > that > > > > > > > > > > much, since most changes are backwards-compatible. But > we > > do > > > > > > include > > > > > > > > > that > > > > > > > > > > version field just in case. > > > > > > > > > > > > > > > > > > > > I don't think we really need an epoch here, though, since > > we > > > > can > > > > > > just > > > > > > > > > look > > > > > > > > > > at the broker epoch. Whenever the broker registers, its > > > epoch > > > > > will > > > > > > > be > > > > > > > > > > greater than the previous broker epoch. And the newly > > > > registered > > > > > > > data > > > > > > > > > will > > > > > > > > > > take priority. This will be a lot simpler than adding a > > > > separate > > > > > > > epoch > > > > > > > > > > system, I think. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. Regarding admin client needing min and max > > information - > > > > you > > > > > > are > > > > > > > > > > right! > > > > > > > > > > > I've changed the KIP such that the Admin API also > allows > > > the > > > > > user > > > > > > > to > > > > > > > > > read > > > > > > > > > > > 'supported features' from a specific broker. Please > look > > at > > > > the > > > > > > > > section > > > > > > > > > > > "Admin API changes". > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. Regarding the use of `long` vs `Long` - it was not > > > > > deliberate. > > > > > > > > I've > > > > > > > > > > > improved the KIP to just use `long` at all places. > > > > > > > > > > > > > > > > > > > > Sounds good. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4. Regarding kafka.admin.FeatureCommand tool - you are > > > right! > > > > > > I've > > > > > > > > > > updated > > > > > > > > > > > the KIP sketching the functionality provided by this > > tool, > > > > with > > > > > > > some > > > > > > > > > > > examples. Please look at the section "Tooling support > > > > > examples". > > > > > > > > > > > > > > > > > > > > > > Thank you! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, Kowshik. > > > > > > > > > > > > > > > > > > > > cheers, > > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > Kowshik > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 25, 2020 at 11:31 PM Colin McCabe < > > > > > > cmcc...@apache.org> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Thanks, Kowshik, this looks good. > > > > > > > > > > > > > > > > > > > > > > > > In the "Schema" section, do we really need both > > > > > > > __schema_version__ > > > > > > > > > and > > > > > > > > > > > > __data_version__? Can we just have a single version > > > field > > > > > > here? > > > > > > > > > > > > > > > > > > > > > > > > Shouldn't the Admin(Client) function have some way to > > get > > > > the > > > > > > min > > > > > > > > and > > > > > > > > > > max > > > > > > > > > > > > information that we're exposing as well? I guess we > > > could > > > > > have > > > > > > > > min, > > > > > > > > > > max, > > > > > > > > > > > > and current. Unrelated: is the use of Long rather > than > > > > long > > > > > > > > > deliberate > > > > > > > > > > > > here? > > > > > > > > > > > > > > > > > > > > > > > > It would be good to describe how the command line > tool > > > > > > > > > > > > kafka.admin.FeatureCommand will work. For example > the > > > > flags > > > > > > that > > > > > > > > it > > > > > > > > > > will > > > > > > > > > > > > take and the output that it will generate to STDOUT. > > > > > > > > > > > > > > > > > > > > > > > > cheers, > > > > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 24, 2020, at 17:08, Kowshik Prakasam > wrote: > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > > > > > > > I've opened KIP-584 > > > > > > > <https://issues.apache.org/jira/browse/KIP-584> < > > > > > > > > https://issues.apache.org/jira/browse/KIP-584 > > > > > > > > > > > > > > > > > > > > > > > which > > > > > > > > > > > > > is intended to provide a versioning scheme for > > > features. > > > > > I'd > > > > > > > like > > > > > > > > > to > > > > > > > > > > use > > > > > > > > > > > > > this thread to discuss the same. I'd appreciate any > > > > > feedback > > > > > > on > > > > > > > > > this. > > > > > > > > > > > > > Here > > > > > > > > > > > > > is a link to KIP-584 > > > > > > > <https://issues.apache.org/jira/browse/KIP-584>: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > Kowshik > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >