Hi Colin, Thanks for the feedback! I have updated the KIP based on your feedback. Please find my response below.
> The discussion on ZooKeeper reads versus writes makes sense to me. The important thing to keep in mind here is that in the bridge release, > all brokers can read from ZooKeeper, but only the controller writes. (Kowshik): Great, thanks! > Why do we need both UpdateFeaturesRequest and DeleteFeaturesRequest? It seems awkward to have "deleting" be a special case here when the > general idea is that we have an RPC to change the supported feature flags. Changing the feature level from 2 to 1 doesn't seem that different > from changing it from 1 to not present. (Kowshik): Done, makes sense. I have updated the KIP to just use 1 API, and that's the UpdateFeaturesRequest. For the deletion case, we can just ignore the version number passed in the API (which is indicative of 'not present'). > It would be simpler to just say that a feature flag which doesn't appear in the znode is considered to be at version level 0. This will also > simplify the code a lot, I think, since you won't have to keep track of tricky distinctions between "disabled" and "enabled at version 0." > Then you would be able to just use an int in most places. (Kowshik): I'm not sure I understood why we want do it this way. If an entry for some finalized feature is absent in '/features' node, alternatively we can just treat this as a feature with a version that was never finalized/enabled or it was deleted at some point. Then, we can even allow for "enabled at version 0" as the {minVersion, maxVersion} range can be any valid range, not necessarily minVersion > 0. > (By the way, I would propose the term "version level" for this number, since it avoids confusion with all the other meanings of the word > "version" that we have in the code.) (Kowshik): Good idea! I have updated the KIP to refer to "version level" instead of version. > Another thing to keep in mind is that if a request RPC is batch, the corresponding response RPC also needs to be batch. In other words, you > need multiple error codes, one for each feature flag whose level you are trying to change. Unless the idea is that the whole change is a > transaction that all either happens or doesn't? (Kowshik): Yes, the whole change is a transaction. Either all provided FeatureUpdate is carried out in ZK, or none happens. That's why we just allow for a single error code field, as it is easier that way. This transactional guarantee is mentioned under 'Proposed changes > New controller API' > Rather than FeatureUpdateType, I would just go with a boolean like "force." I'm not sure what other values we'd want to add to this later on, > if it were an enum. I think the boolean is clearer. (Kowshik): Since we have decided to go just one API (i.e. UpdateFeaturesRequest), it is better that FeatureUpdateType is an enum with multiple values. A FeatureUpdateType is tied to a feature, and the possible values are: ADD_OR_UPDATE, ADD_OR_UPDATE_ALLOW_DOWNGRADE, DELETE. > This ties in with my comment earlier, but for the result classes, we need methods other than just "all". Batch operations aren't usable if > you can't get the result per operation.... unless the semantics are transactional and it really is just everything succeeded or everything > failed. (Kowshik): The semantics are transactional, as I explained above. > There are a bunch of Java interfaces described like FinalizedFeature, FeatureUpdate, UpdateFeaturesResult, and so on that should just be > regular concrete Java classes. In general we'd only use an interface if we wanted the caller to implement some kind of callback function. We > don't make classes that are just designed to hold data into interfaces, since that just imposes extra work on callers (they have to define > their own concrete class for each interface just to use the API.) There's also probably no reason to have these classes inherit from each > other or have complex type relationships. One more nitpick is that Kafka generally doesn't use "get" in the function names of accessors. (Kowshik): Done, I have changed the KIP. By 'interface', I just meant interface from a pseudocode standpoint (i.e. it was just an abstraction providing at least the specified behavior). Since that was a bit confusing, I have now renamed it calling it a class. Also I have eliminated the type relationships. Cheers, Kowshik On Fri, Apr 3, 2020 at 5:54 PM Kowshik Prakasam <kpraka...@confluent.io> wrote: > Hi Jun, > > Thanks for the feedback and suggestions. Please find my response below. > > > 100.6 For every new request, the admin needs to control who is allowed to > > issue that request if security is enabled. So, we need to assign the new > > request a ResourceType and possible AclOperations. See > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment > > as an example. > > (Kowshik): I don't see any reference to the words ResourceType or > AclOperations > in the KIP. Please let me know how I can use the KIP that you linked to > know how to > setup the appropriate ResourceType and/or ClusterOperation? > > > 105. If we change delete to disable, it's better to do this consistently > in > > request protocol and admin api as well. > > (Kowshik): The API shouldn't be called 'disable' when it is deleting a > feature. > I've just changed the KIP to use 'delete'. I don't have a strong > preference. > > > 110. The minVersion/maxVersion for features use int64. Currently, our > > release version schema is major.minor.bugfix (e.g. 2.5.0). It's possible > > for new features to be included in minor releases too. Should we make the > > feature versioning match the release versioning? > > (Kowshik): The release version can be mapped to a set of feature versions, > and this can be done, for example in the tool (or even external to the > tool). > Can you please clarify what I'm missing? > > > 111. "During regular operations, the data in the ZK node can be mutated > > only via a specific admin API served only by the controller." I am > > wondering why can't the controller auto finalize a feature version after > > all brokers are upgraded? For new users who download the latest version > to > > build a new cluster, it's inconvenient for them to have to manually > enable > > each feature. > > (Kowshik): I agree that there is a trade-off here, but it will help > to decide whether the automation can be thought through in the future > in a follow up KIP, or right now in this KIP. We may invest > in automation, but we have to decide whether we should do it > now or later. > > For the inconvenience that you mentioned, do you think the problem that you > mentioned can be overcome by asking for the cluster operator to run a > bootstrap script when he/she knows that a specific AK release has been > almost completely deployed in a cluster for the first time? Idea is that > the > bootstrap script will know how to map a specific AK release to finalized > feature versions, and run the `kafka-features.sh` tool appropriately > against > the cluster. > > Now, coming back to your automation proposal/question. > I do see the value of automated feature version finalization, but I also > see > that this will open up several questions and some risks, as explained > below. > The answers to these depend on the definition of the automation we choose > to build, and how well does it fit into a kafka deployment. > Basically, it can be unsafe for the controller to finalize feature version > upgrades automatically, without learning about the intent of the cluster > operator. > 1. We would sometimes want to lock feature versions only when we have > externally verified > the stability of the broker binary. > 2. Sometimes only the cluster operator knows that a cluster upgrade is > complete, > and new brokers are highly unlikely to join the cluster. > 3. Only the cluster operator knows that the intent is to deploy the same > version > of the new broker release across the entire cluster (i.e. the latest > downloaded version). > 4. For downgrades, it appears the controller still needs some external > input > (such as the proposed tool) to finalize a feature version downgrade. > > If we have automation, that automation can end up failing in some of the > cases > above. Then, we need a way to declare that the cluster is "not ready" if > the > controller cannot automatically finalize some basic required feature > version > upgrades across the cluster. We need to make the cluster operator aware in > such a scenario (raise an alert or alike). > > > 112. DeleteFeaturesResponse: It seems the apiKey should be 49 instead of > 48. > > (Kowshik): Done. > > > Cheers, > Kowshik > > On Fri, Apr 3, 2020 at 11:24 AM Jun Rao <j...@confluent.io> wrote: > >> Hi, Kowshik, >> >> Thanks for the reply. A few more comments below. >> >> 100.6 For every new request, the admin needs to control who is allowed to >> issue that request if security is enabled. So, we need to assign the new >> request a ResourceType and possible AclOperations. See >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment >> as >> an example. >> >> 105. If we change delete to disable, it's better to do this consistently >> in >> request protocol and admin api as well. >> >> 110. The minVersion/maxVersion for features use int64. Currently, our >> release version schema is major.minor.bugfix (e.g. 2.5.0). It's possible >> for new features to be included in minor releases too. Should we make the >> feature versioning match the release versioning? >> >> 111. "During regular operations, the data in the ZK node can be mutated >> only via a specific admin API served only by the controller." I am >> wondering why can't the controller auto finalize a feature version after >> all brokers are upgraded? For new users who download the latest version to >> build a new cluster, it's inconvenient for them to have to manually enable >> each feature. >> >> 112. DeleteFeaturesResponse: It seems the apiKey should be 49 instead of >> 48. >> >> Jun >> >> >> On Fri, Apr 3, 2020 at 1:27 AM Kowshik Prakasam <kpraka...@confluent.io> >> wrote: >> >> > Hey Jun, >> > >> > Thanks a lot for the great feedback! Please note that the design >> > has changed a little bit on the KIP, and we now propagate the finalized >> > features metadata only via ZK watches (instead of UpdateMetadataRequest >> > from the controller). >> > >> > Please find below my response to your questions/feedback, with the >> prefix >> > "(Kowshik):". >> > >> > > 100. UpdateFeaturesRequest/UpdateFeaturesResponse >> > > 100.1 Since this request waits for responses from brokers, should we >> add >> > a >> > > timeout in the request (like createTopicRequest)? >> > >> > (Kowshik): Great point! Done. I have added a timeout field. Note: we no >> > longer >> > wait for responses from brokers, since the design has been changed so >> that >> > the >> > features information is propagated via ZK. Nevertheless, it is right to >> > have a timeout >> > for the request. >> > >> > > 100.2 The response schema is a bit weird. Typically, the response just >> > > shows an error code and an error message, instead of echoing the >> request. >> > >> > (Kowshik): Great point! Yeah, I have modified it to just return an error >> > code and a message. >> > Previously it was not echoing the "request", rather it was returning the >> > latest set of >> > cluster-wide finalized features (after applying the updates). But you >> are >> > right, >> > the additional info is not required, so I have removed it from the >> response >> > schema. >> > >> > > 100.3 Should we add a separate request to list/describe the existing >> > > features? >> > >> > (Kowshik): This is already present in the KIP via the 'DescribeFeatures' >> > Admin API, >> > which, underneath covers uses the ApiVersionsRequest to list/describe >> the >> > existing features. Please read the 'Tooling support' section. >> > >> > > 100.4 We are mixing ADD_OR_UPDATE and DELETE in a single request. For >> > > DELETE, the version field doesn't make sense. So, I guess the broker >> just >> > > ignores this? An alternative way is to have a separate >> > DeleteFeaturesRequest >> > >> > (Kowshik): Great point! I have modified the KIP now to have 2 separate >> > controller APIs >> > serving these different purposes: >> > 1. updateFeatures >> > 2. deleteFeatures >> > >> > > 100.5 In UpdateFeaturesResponse, we have "The monotonically increasing >> > > version of the metadata for finalized features." I am wondering why >> the >> > > ordering is important? >> > >> > (Kowshik): In the latest KIP write-up, it is called epoch (instead of >> > version), and >> > it is just the ZK node version. Basically, this is the epoch for the >> > cluster-wide >> > finalized feature version metadata. This metadata is served to clients >> via >> > the >> > ApiVersionsResponse (for reads). We propagate updates from the >> '/features' >> > ZK node >> > to all brokers, via ZK watches setup by each broker on the '/features' >> > node. >> > >> > Now here is why the ordering is important: >> > ZK watches don't propagate at the same time. As a result, the >> > ApiVersionsResponse >> > is eventually consistent across brokers. This can introduce cases >> > where clients see an older lower epoch of the features metadata, after a >> > more recent >> > higher epoch was returned at a previous point in time. We expect clients >> > to always employ the rule that the latest received higher epoch of >> metadata >> > always trumps an older smaller epoch. Those clients that are external to >> > Kafka should strongly consider discovering the latest metadata once >> during >> > startup from the brokers, and if required refresh the metadata >> periodically >> > (to get the latest metadata). >> > >> > > 100.6 Could you specify the required ACL for this new request? >> > >> > (Kowshik): What is ACL, and how could I find out which one to specify? >> > Please could you provide me some pointers? I'll be glad to update the >> > KIP once I know the next steps. >> > >> > > 101. For the broker registration ZK node, should we bump up the >> version >> > in >> > the json? >> > >> > (Kowshik): Great point! Done. I've increased the version in the broker >> json >> > by 1. >> > >> > > 102. For the /features ZK node, not sure if we need the epoch field. >> Each >> > > ZK node has an internal version field that is incremented on every >> > update. >> > >> > (Kowshik): Great point! Done. I'm using the ZK node version now, >> instead of >> > explicitly >> > incremented epoch. >> > >> > > 103. "Enabling the actual semantics of a feature version cluster-wide >> is >> > > left to the discretion of the logic implementing the feature (ex: can >> be >> > > done via dynamic broker config)." Does that mean the broker >> registration >> > ZK >> > > node will be updated dynamically when this happens? >> > >> > (Kowshik): Not really. The text was just conveying that a broker could >> > "know" of >> > a new feature version, but it does not mean the broker should have also >> > activated the effects of the feature version. Knowing vs activation are >> 2 >> > separate things, >> > and the latter can be achieved by dynamic config. I have reworded the >> text >> > to >> > make this clear to the reader. >> > >> > >> > > 104. UpdateMetadataRequest >> > > 104.1 It would be useful to describe when the feature metadata is >> > included >> > > in the request. My understanding is that it's only included if (1) >> there >> > is >> > > a change to the finalized feature; (2) broker restart; (3) controller >> > > failover. >> > > 104.2 The new fields have the following versions. Why are the >> versions 3+ >> > > when the top version is bumped to 6? >> > > "fields": [ >> > > {"name": "Name", "type": "string", "versions": "3+", >> > > "about": "The name of the feature."}, >> > > {"name": "Version", "type": "int64", "versions": "3+", >> > > "about": "The finalized version for the feature."} >> > > ] >> > >> > (Kowshik): With the new improved design, we have completely eliminated >> the >> > need to >> > use UpdateMetadataRequest. This is because we now rely on ZK to deliver >> the >> > notifications for changes to the '/features' ZK node. >> > >> > > 105. kafka-features.sh: Instead of using update/delete, perhaps it's >> > better >> > > to use enable/disable? >> > >> > (Kowshik): For delete, yes, I have changed it so that we instead call it >> > 'disable'. >> > However for 'update', it can now also refer to either an upgrade or a >> > forced downgrade. >> > Therefore, I have left it the way it is, just calling it as just >> 'update'. >> > >> > >> > Cheers, >> > Kowshik >> > >> > On Tue, Mar 31, 2020 at 6:51 PM Jun Rao <j...@confluent.io> wrote: >> > >> > > Hi, Kowshik, >> > > >> > > Thanks for the KIP. Looks good overall. A few comments below. >> > > >> > > 100. UpdateFeaturesRequest/UpdateFeaturesResponse >> > > 100.1 Since this request waits for responses from brokers, should we >> add >> > a >> > > timeout in the request (like createTopicRequest)? >> > > 100.2 The response schema is a bit weird. Typically, the response just >> > > shows an error code and an error message, instead of echoing the >> request. >> > > 100.3 Should we add a separate request to list/describe the existing >> > > features? >> > > 100.4 We are mixing ADD_OR_UPDATE and DELETE in a single request. For >> > > DELETE, the version field doesn't make sense. So, I guess the broker >> just >> > > ignores this? An alternative way is to have a separate >> > > DeleteFeaturesRequest >> > > 100.5 In UpdateFeaturesResponse, we have "The monotonically increasing >> > > version of the metadata for finalized features." I am wondering why >> the >> > > ordering is important? >> > > 100.6 Could you specify the required ACL for this new request? >> > > >> > > 101. For the broker registration ZK node, should we bump up the >> version >> > in >> > > the json? >> > > >> > > 102. For the /features ZK node, not sure if we need the epoch field. >> Each >> > > ZK node has an internal version field that is incremented on every >> > update. >> > > >> > > 103. "Enabling the actual semantics of a feature version cluster-wide >> is >> > > left to the discretion of the logic implementing the feature (ex: can >> be >> > > done via dynamic broker config)." Does that mean the broker >> registration >> > ZK >> > > node will be updated dynamically when this happens? >> > > >> > > 104. UpdateMetadataRequest >> > > 104.1 It would be useful to describe when the feature metadata is >> > included >> > > in the request. My understanding is that it's only included if (1) >> there >> > is >> > > a change to the finalized feature; (2) broker restart; (3) controller >> > > failover. >> > > 104.2 The new fields have the following versions. Why are the >> versions 3+ >> > > when the top version is bumped to 6? >> > > "fields": [ >> > > {"name": "Name", "type": "string", "versions": "3+", >> > > "about": "The name of the feature."}, >> > > {"name": "Version", "type": "int64", "versions": "3+", >> > > "about": "The finalized version for the feature."} >> > > ] >> > > >> > > 105. kafka-features.sh: Instead of using update/delete, perhaps it's >> > better >> > > to use enable/disable? >> > > >> > > Jun >> > > >> > > On Tue, Mar 31, 2020 at 5:29 PM Kowshik Prakasam < >> kpraka...@confluent.io >> > > >> > > wrote: >> > > >> > > > Hey Boyang, >> > > > >> > > > Thanks for the great feedback! I have updated the KIP based on your >> > > > feedback. >> > > > Please find my response below for your comments, look for sentences >> > > > starting >> > > > with "(Kowshik)" below. >> > > > >> > > > >> > > > > 1. "When is it safe for the brokers to begin handling EOS traffic" >> > > could >> > > > be >> > > > > converted as "When is it safe for the brokers to start serving new >> > > > > Exactly-Once(EOS) semantics" since EOS is not explained earlier in >> > the >> > > > > context. >> > > > >> > > > (Kowshik): Great point! Done. >> > > > >> > > > > 2. In the *Explanation *section, the metadata version number part >> > > seems a >> > > > > bit blurred. Could you point a reference to later section that we >> > going >> > > > to >> > > > > store it in Zookeeper and update it every time when there is a >> > feature >> > > > > change? >> > > > >> > > > (Kowshik): Great point! Done. I've added a reference in the KIP. >> > > > >> > > > >> > > > > 3. For the feature downgrade, although it's a Non-goal of the KIP, >> > for >> > > > > features such as group coordinator semantics, there is no legal >> > > scenario >> > > > to >> > > > > perform a downgrade at all. So having downgrade door open is >> pretty >> > > > > error-prone as human faults happen all the time. I'm assuming as >> new >> > > > > features are implemented, it's not very hard to add a flag during >> > > feature >> > > > > creation to indicate whether this feature is "downgradable". Could >> > you >> > > > > explain a bit more on the extra engineering effort for shipping >> this >> > > KIP >> > > > > with downgrade protection in place? >> > > > >> > > > (Kowshik): Great point! I'd agree and disagree here. While I agree >> that >> > > > accidental >> > > > downgrades can cause problems, I also think sometimes downgrades >> should >> > > > be allowed for emergency reasons (not all downgrades cause issues). >> > > > It is just subjective to the feature being downgraded. >> > > > >> > > > To be more strict about feature version downgrades, I have modified >> the >> > > KIP >> > > > proposing that we mandate a `--force-downgrade` flag be used in the >> > > > UPDATE_FEATURES api >> > > > and the tooling, whenever the human is downgrading a finalized >> feature >> > > > version. >> > > > Hopefully this should cover the requirement, until we find the need >> for >> > > > advanced downgrade support. >> > > > >> > > > > 4. "Each broker’s supported dictionary of feature versions will be >> > > > defined >> > > > > in the broker code." So this means in order to restrict a certain >> > > > feature, >> > > > > we need to start the broker first and then send a feature gating >> > > request >> > > > > immediately, which introduces a time gap and the intended-to-close >> > > > feature >> > > > > could actually serve request during this phase. Do you think we >> > should >> > > > also >> > > > > support configurations as well so that admin user could freely >> roll >> > up >> > > a >> > > > > cluster with all nodes complying the same feature gating, without >> > > > worrying >> > > > > about the turnaround time to propagate the message only after the >> > > cluster >> > > > > starts up? >> > > > >> > > > (Kowshik): This is a great point/question. One of the expectations >> out >> > of >> > > > this KIP, which is >> > > > already followed in the broker, is the following. >> > > > - Imagine at time T1 the broker starts up and registers it’s >> presence >> > in >> > > > ZK, >> > > > along with advertising it’s supported features. >> > > > - Imagine at a future time T2 the broker receives the >> > > > UpdateMetadataRequest >> > > > from the controller, which contains the latest finalized >> features as >> > > > seen by >> > > > the controller. The broker validates this data against it’s >> > supported >> > > > features to >> > > > make sure there is no mismatch (it will shutdown if there is an >> > > > incompatibility). >> > > > >> > > > It is expected that during the time between the 2 events T1 and T2, >> the >> > > > broker is >> > > > almost a silent entity in the cluster. It does not add any value to >> the >> > > > cluster, or carry >> > > > out any important broker activities. By “important”, I mean it is >> not >> > > doing >> > > > mutations >> > > > on it’s persistence, not mutating critical in-memory state, won’t be >> > > > serving >> > > > produce/fetch requests. Note it doesn’t even know it’s assigned >> > > partitions >> > > > until >> > > > it receives UpdateMetadataRequest from controller. Anything the >> broker >> > is >> > > > doing up >> > > > until this point is not damaging/useful. >> > > > >> > > > I’ve clarified the above in the KIP, see this new section: >> > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features#KIP-584:Versioningschemeforfeatures-Incompatiblebrokerlifetime >> > > > . >> > > > >> > > > > 5. "adding a new Feature, updating or deleting an existing >> Feature", >> > > may >> > > > be >> > > > > I misunderstood something, I thought the features are defined in >> > broker >> > > > > code, so admin could not really create a new feature? >> > > > >> > > > (Kowshik): Great point! You understood this right. Here adding a >> > feature >> > > > means we are >> > > > adding a cluster-wide finalized *max* version for a feature that was >> > > > previously never finalized. >> > > > I have clarified this in the KIP now. >> > > > >> > > > > 6. I think we need a separate error code like >> > > FEATURE_UPDATE_IN_PROGRESS >> > > > to >> > > > > reject a concurrent feature update request. >> > > > >> > > > (Kowshik): Great point! I have modified the KIP adding the above >> (see >> > > > 'Tooling support -> Admin API changes'). >> > > > >> > > > > 7. I think we haven't discussed the alternative solution to pass >> the >> > > > > feature information through Zookeeper. Is that mentioned in the >> KIP >> > to >> > > > > justify why using UpdateMetadata is more favorable? >> > > > >> > > > (Kowshik): Nice question! The broker reads finalized feature info >> > stored >> > > in >> > > > ZK, >> > > > only during startup when it does a validation. When serving >> > > > `ApiVersionsRequest`, the >> > > > broker does not read this info from ZK directly. I'd imagine the >> risk >> > is >> > > > that it can increase >> > > > the ZK read QPS which can be a bottleneck for the system. Today, in >> > Kafka >> > > > we use the >> > > > controller to fan out ZK updates to brokers and we want to stick to >> > that >> > > > pattern to avoid >> > > > the ZK read bottleneck when serving `ApiVersionsRequest`. >> > > > >> > > > > 8. I was under the impression that user could configure a range of >> > > > > supported versions, what's the trade-off for allowing single >> > finalized >> > > > > version only? >> > > > >> > > > (Kowshik): Great question! The finalized version of a feature >> basically >> > > > refers to >> > > > the cluster-wide finalized feature "maximum" version. For example, >> if >> > the >> > > > 'group_coordinator' feature >> > > > has the finalized version set to 10, then, it means that >> cluster-wide >> > all >> > > > versions upto v10 are >> > > > supported for this feature. However, note that if some version (ex: >> v0) >> > > > gets deprecated >> > > > for this feature, then we don’t convey that using this scheme (also >> > > > supporting deprecation is a non-goal). >> > > > >> > > > (Kowshik): I’ve now modified the KIP at all points, refering to >> > finalized >> > > > feature "maximum" versions. >> > > > >> > > > > 9. One minor syntax fix: Note that here the "client" here may be a >> > > > producer >> > > > >> > > > (Kowshik): Great point! Done. >> > > > >> > > > >> > > > Cheers, >> > > > Kowshik >> > > > >> > > > >> > > > On Tue, Mar 31, 2020 at 1:17 PM Boyang Chen < >> > reluctanthero...@gmail.com> >> > > > wrote: >> > > > >> > > > > Hey Kowshik, >> > > > > >> > > > > thanks for the revised KIP. Got a couple of questions: >> > > > > >> > > > > 1. "When is it safe for the brokers to begin handling EOS traffic" >> > > could >> > > > be >> > > > > converted as "When is it safe for the brokers to start serving new >> > > > > Exactly-Once(EOS) semantics" since EOS is not explained earlier in >> > the >> > > > > context. >> > > > > >> > > > > 2. In the *Explanation *section, the metadata version number part >> > > seems a >> > > > > bit blurred. Could you point a reference to later section that we >> > going >> > > > to >> > > > > store it in Zookeeper and update it every time when there is a >> > feature >> > > > > change? >> > > > > >> > > > > 3. For the feature downgrade, although it's a Non-goal of the KIP, >> > for >> > > > > features such as group coordinator semantics, there is no legal >> > > scenario >> > > > to >> > > > > perform a downgrade at all. So having downgrade door open is >> pretty >> > > > > error-prone as human faults happen all the time. I'm assuming as >> new >> > > > > features are implemented, it's not very hard to add a flag during >> > > feature >> > > > > creation to indicate whether this feature is "downgradable". Could >> > you >> > > > > explain a bit more on the extra engineering effort for shipping >> this >> > > KIP >> > > > > with downgrade protection in place? >> > > > > >> > > > > 4. "Each broker’s supported dictionary of feature versions will be >> > > > defined >> > > > > in the broker code." So this means in order to restrict a certain >> > > > feature, >> > > > > we need to start the broker first and then send a feature gating >> > > request >> > > > > immediately, which introduces a time gap and the intended-to-close >> > > > feature >> > > > > could actually serve request during this phase. Do you think we >> > should >> > > > also >> > > > > support configurations as well so that admin user could freely >> roll >> > up >> > > a >> > > > > cluster with all nodes complying the same feature gating, without >> > > > worrying >> > > > > about the turnaround time to propagate the message only after the >> > > cluster >> > > > > starts up? >> > > > > >> > > > > 5. "adding a new Feature, updating or deleting an existing >> Feature", >> > > may >> > > > be >> > > > > I misunderstood something, I thought the features are defined in >> > broker >> > > > > code, so admin could not really create a new feature? >> > > > > >> > > > > 6. I think we need a separate error code like >> > > FEATURE_UPDATE_IN_PROGRESS >> > > > to >> > > > > reject a concurrent feature update request. >> > > > > >> > > > > 7. I think we haven't discussed the alternative solution to pass >> the >> > > > > feature information through Zookeeper. Is that mentioned in the >> KIP >> > to >> > > > > justify why using UpdateMetadata is more favorable? >> > > > > >> > > > > 8. I was under the impression that user could configure a range of >> > > > > supported versions, what's the trade-off for allowing single >> > finalized >> > > > > version only? >> > > > > >> > > > > 9. One minor syntax fix: Note that here the "client" here may be a >> > > > producer >> > > > > >> > > > > Boyang >> > > > > >> > > > > On Mon, Mar 30, 2020 at 4:53 PM Colin McCabe <cmcc...@apache.org> >> > > wrote: >> > > > > >> > > > > > On Thu, Mar 26, 2020, at 19:24, Kowshik Prakasam wrote: >> > > > > > > Hi Colin, >> > > > > > > >> > > > > > > Thanks for the feedback! I've changed the KIP to address your >> > > > > > > suggestions. >> > > > > > > Please find below my explanation. Here is a link to KIP 584: >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features >> > > > > > > . >> > > > > > > >> > > > > > > 1. '__data_version__' is the version of the finalized feature >> > > > metadata >> > > > > > > (i.e. actual ZK node contents), while the >> '__schema_version__' is >> > > the >> > > > > > > version of the schema of the data persisted in ZK. These serve >> > > > > different >> > > > > > > purposes. '__data_version__' is is useful mainly to clients >> > during >> > > > > reads, >> > > > > > > to differentiate between the 2 versions of eventually >> consistent >> > > > > > 'finalized >> > > > > > > features' metadata (i.e. larger metadata version is more >> recent). >> > > > > > > '__schema_version__' provides an additional degree of >> > flexibility, >> > > > > where >> > > > > > if >> > > > > > > we decide to change the schema for '/features' node in ZK (in >> the >> > > > > > future), >> > > > > > > then we can manage broker roll outs suitably (i.e. >> > > > > > > serialization/deserialization of the ZK data can be handled >> > > safely). >> > > > > > >> > > > > > Hi Kowshik, >> > > > > > >> > > > > > If you're talking about a number that lets you know if data is >> more >> > > or >> > > > > > less recent, we would typically call that an epoch, and not a >> > > version. >> > > > > For >> > > > > > the ZK data structures, the word "version" is typically reserved >> > for >> > > > > > describing changes to the overall schema of the data that is >> > written >> > > to >> > > > > > ZooKeeper. We don't even really change the "version" of those >> > > schemas >> > > > > that >> > > > > > much, since most changes are backwards-compatible. But we do >> > include >> > > > > that >> > > > > > version field just in case. >> > > > > > >> > > > > > I don't think we really need an epoch here, though, since we can >> > just >> > > > > look >> > > > > > at the broker epoch. Whenever the broker registers, its epoch >> will >> > > be >> > > > > > greater than the previous broker epoch. And the newly >> registered >> > > data >> > > > > will >> > > > > > take priority. This will be a lot simpler than adding a >> separate >> > > epoch >> > > > > > system, I think. >> > > > > > >> > > > > > > >> > > > > > > 2. Regarding admin client needing min and max information - >> you >> > are >> > > > > > right! >> > > > > > > I've changed the KIP such that the Admin API also allows the >> user >> > > to >> > > > > read >> > > > > > > 'supported features' from a specific broker. Please look at >> the >> > > > section >> > > > > > > "Admin API changes". >> > > > > > >> > > > > > Thanks. >> > > > > > >> > > > > > > >> > > > > > > 3. Regarding the use of `long` vs `Long` - it was not >> deliberate. >> > > > I've >> > > > > > > improved the KIP to just use `long` at all places. >> > > > > > >> > > > > > Sounds good. >> > > > > > >> > > > > > > >> > > > > > > 4. Regarding kafka.admin.FeatureCommand tool - you are right! >> > I've >> > > > > > updated >> > > > > > > the KIP sketching the functionality provided by this tool, >> with >> > > some >> > > > > > > examples. Please look at the section "Tooling support >> examples". >> > > > > > > >> > > > > > > Thank you! >> > > > > > >> > > > > > >> > > > > > Thanks, Kowshik. >> > > > > > >> > > > > > cheers, >> > > > > > Colin >> > > > > > >> > > > > > > >> > > > > > > >> > > > > > > Cheers, >> > > > > > > Kowshik >> > > > > > > >> > > > > > > On Wed, Mar 25, 2020 at 11:31 PM Colin McCabe < >> > cmcc...@apache.org> >> > > > > > wrote: >> > > > > > > >> > > > > > > > Thanks, Kowshik, this looks good. >> > > > > > > > >> > > > > > > > In the "Schema" section, do we really need both >> > > __schema_version__ >> > > > > and >> > > > > > > > __data_version__? Can we just have a single version field >> > here? >> > > > > > > > >> > > > > > > > Shouldn't the Admin(Client) function have some way to get >> the >> > min >> > > > and >> > > > > > max >> > > > > > > > information that we're exposing as well? I guess we could >> have >> > > > min, >> > > > > > max, >> > > > > > > > and current. Unrelated: is the use of Long rather than long >> > > > > deliberate >> > > > > > > > here? >> > > > > > > > >> > > > > > > > It would be good to describe how the command line tool >> > > > > > > > kafka.admin.FeatureCommand will work. For example the flags >> > that >> > > > it >> > > > > > will >> > > > > > > > take and the output that it will generate to STDOUT. >> > > > > > > > >> > > > > > > > cheers, >> > > > > > > > Colin >> > > > > > > > >> > > > > > > > >> > > > > > > > On Tue, Mar 24, 2020, at 17:08, Kowshik Prakasam wrote: >> > > > > > > > > Hi all, >> > > > > > > > > >> > > > > > > > > I've opened KIP-584 >> > > <https://issues.apache.org/jira/browse/KIP-584> < >> > > > https://issues.apache.org/jira/browse/KIP-584 >> > > > > > >> > > > > > > > > which >> > > > > > > > > is intended to provide a versioning scheme for features. >> I'd >> > > like >> > > > > to >> > > > > > use >> > > > > > > > > this thread to discuss the same. I'd appreciate any >> feedback >> > on >> > > > > this. >> > > > > > > > > Here >> > > > > > > > > is a link to KIP-584 >> > > <https://issues.apache.org/jira/browse/KIP-584>: >> > > > > > > > > >> > > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features >> > > > > > > > > . >> > > > > > > > > >> > > > > > > > > Thank you! >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > Cheers, >> > > > > > > > > Kowshik >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> >