After the discussion today about the clarity and flexibility of the flag/lists for internal topics and deleted topics, I think I will switch back to using booleans inside the topic metadata. This is a clearer representation of the intent, should not have too much overhead (especially because users can request no topics now), and is more simple/flexible to evolve forward if we ever decide we need to represent more "states".
I will update the patch in the next day or so and post a vote if no issue is raised with that. Thanks, Grant On Tue, Apr 5, 2016 at 4:05 PM, Grant Henke <ghe...@cloudera.com> wrote: > Hi Jun, > > See my responses below: > > 2. The issues that I was thinking are the following. (a) Say the controller >> has topic deletion disabled and a topic deletion request is submitted to >> ZK. In this case, the controller will ignore this request. However, the >> broker may pick up the topic deletion marker in a transient window. (b) >> Suppose that a topic is deleted and then recreated immediately. It is >> possible for a broker to see the newly created topic and then the previous >> topic deletion marker in a transient window. Thinking about this a bit >> more. Both seem to be transient. So, it may not be a big concern. So, I am >> ok with this as long as the interim solution is not too complicated. >> Another thing to think through. If a topic is marked for deletion, do we >> still return the partition level metadata? > > > I am not changing anything about the metadata content, only adding a > boolean based on the marked for deletion flag in zookeeper. This is > maintaining the same method that the topics script does today. I do think > delete improvements should be considered/reviewed. The goal here is to > allow the broker to report the value that its sees, which is the value in > zookeeper. > > 5. The issue is the following. If you have a partition with 3 replicas >> 4,5,6, leader is on replica 4 and replica 5 is down. Currently, the broker >> will send a REPLICA_NOT_AVAILABLE error code and only replicas 4,6 in the >> assigned replicas. It's more intuitive to send no error code and 4,5,6 in >> the assigned replicas in this case. > > > Should the list with no error code just be 4,6 since 5 is not available? > > > On Mon, Apr 4, 2016 at 1:34 PM, Jun Rao <j...@confluent.io> wrote: > >> Grant, >> >> 2. The issues that I was thinking are the following. (a) Say the >> controller >> has topic deletion disabled and a topic deletion request is submitted to >> ZK. In this case, the controller will ignore this request. However, the >> broker may pick up the topic deletion marker in a transient window. (b) >> Suppose that a topic is deleted and then recreated immediately. It is >> possible for a broker to see the newly created topic and then the previous >> topic deletion marker in a transient window. Thinking about this a bit >> more. Both seem to be transient. So, it may not be a big concern. So, I am >> ok with this as long as the interim solution is not too complicated. >> Another thing to think through. If a topic is marked for deletion, do we >> still return the partition level metadata? >> >> 3. Your explanation on controller id seems reasonable to me. >> >> 5. The issue is the following. If you have a partition with 3 replicas >> 4,5,6, leader is on replica 4 and replica 5 is down. Currently, the broker >> will send a REPLICA_NOT_AVAILABLE error code and only replicas 4,6 in the >> assigned replicas. It's more intuitive to send no error code and 4,5,6 in >> the assigned replicas in this case. >> >> Thanks, >> >> Jun >> >> On Mon, Apr 4, 2016 at 8:33 AM, Grant Henke <ghe...@cloudera.com> wrote: >> >> > Hi Jun, >> > >> > Please See my responses below: >> > >> > Hmm, I am not sure about the listener approach. It ignores configs like >> > > enable.topic.deletion and also opens the door for potential ordering >> > issues >> > > since now there are two separate paths for propagating the metadata to >> > the >> > > brokers. >> > >> > >> > This mechanism is very similar to how deletes are tracked on the >> controller >> > itself. It is also the same way ACLs are tracked on brokers in the >> default >> > implementation. I am not sure I understand what ordering issue there >> could >> > be. This is used to report what topics are marked for deletion, which >> today >> > has no dependency on enable.topic.deletion. I agree that the delete >> > mechanism in Kafka has a lot of room for improvement, but the goal in >> this >> > change is just to enable reporting it to the user, not to fix/improve >> > existing issues. If you have an alternate approach that does not require >> > major changes to the controller code, I would be open to investigate it. >> > >> > Could we just leave out markedForDeletion for now? In the common >> > > case, if a topic is deleted, it will only be in markedForDeletion >> state >> > for >> > > a few milli seconds anyway. >> > >> > >> > I don't think we should leave it out. The point of these changes is to >> > prevent a user from needing to talk directly to zookeeper. We need a way >> > for a user to see if a topic has been marked for deletion. Given the >> issues >> > with the current delete implementation, its fairly common for a topic to >> > remain marked as deleted for quite some time. >> > >> > Yes, for those usage, it just seems it's a bit weird for the client to >> > > issue a MetadataRequest to get the controller info since it doesn't >> need >> > > any topic metadata. >> > >> > >> > Why does this seam weird? The MetadataRequest is the request used to >> > discover the cluster and metadata about that cluster regardless of the >> > topics you are interested in, if any. In fact, a big motivation for the >> > change to allow requesting "no topics" is because the existing producer >> and >> > consumer often want to learn about the cluster without asking for topic >> > metadata and today that means that they request all topics. >> > >> > 5. The issue is that for a client, when handling a metadata response, >> the >> > > natural logic is if there is any error in the response, go to the >> error >> > > handling path (e.g., back off and refresh metadata). Otherwise, get >> the >> > > leader info and initiate a request to the leader if leader is >> available. >> > If >> > > you look at the current logic in >> MetadataCache.getPartitionMetadata(), if >> > > an assigned replica is not alive, we will send a REPLICA_NOT_AVAILABLE >> > > error code in the response. If the client follows the above logic, it >> > will >> > > keep doing the error handling even though there is nothing wrong with >> the >> > > leader. A better behavior is to simply return the list of replica ids >> > with >> > > no error code in this case. >> > >> > >> > To be sure I understand this correctly. Instead of returning the >> complete >> > list of replicas, including the ones that errored as unavailable. You >> are >> > suggesting to drop the unavailable ones and return just the replicas >> with >> > no-errors and return no error code on the partition. Is that correct? >> > >> > Under what scenario does the MetadataCache have a replica that is not >> > available? >> > >> > Thanks, >> > Grant >> > >> > >> > >> > >> > >> > On Mon, Apr 4, 2016 at 12:25 AM, Jun Rao <j...@confluent.io> wrote: >> > >> > > Grant, >> > > >> > > 2. Hmm, I am not sure about the listener approach. It ignores configs >> > like >> > > enable.topic.deletion and also opens the door for potential ordering >> > issues >> > > since now there are two separate paths for propagating the metadata to >> > the >> > > brokers. Could we just leave out markedForDeletion for now? In the >> common >> > > case, if a topic is deleted, it will only be in markedForDeletion >> state >> > for >> > > a few milli seconds anyway. >> > > >> > > 3. Yes, for those usage, it just seems it's a bit weird for the >> client to >> > > issue a MetadataRequest to get the controller info since it doesn't >> need >> > > any topic metadata. >> > > >> > > 5. The issue is that for a client, when handling a metadata response, >> the >> > > natural logic is if there is any error in the response, go to the >> error >> > > handling path (e.g., back off and refresh metadata). Otherwise, get >> the >> > > leader info and initiate a request to the leader if leader is >> available. >> > If >> > > you look at the current logic in >> MetadataCache.getPartitionMetadata(), if >> > > an assigned replica is not alive, we will send a REPLICA_NOT_AVAILABLE >> > > error code in the response. If the client follows the above logic, it >> > will >> > > keep doing the error handling even though there is nothing wrong with >> the >> > > leader. A better behavior is to simply return the list of replica ids >> > with >> > > no error code in this case. >> > > >> > > Thanks, >> > > >> > > Jun >> > > >> > > >> > > >> > > On Sun, Apr 3, 2016 at 6:29 PM, Grant Henke <ghe...@cloudera.com> >> wrote: >> > > >> > > > Responding to a few of the other comments: >> > > > >> > > > it seems that you propagated >> > > > > the topic deletion marker by having the replicaManager read from >> ZK >> > > > > directly. It seems that it would be simpler/consistent if the >> > > controller >> > > > > propagates that information directly through UpdateMetaRequest. >> > > > >> > > > >> > > > I was told that I should not try and modify controller logic with >> KIP-4 >> > > > changes. It was indicated that a larger controller rewrite and >> testing >> > > was >> > > > planned and those changes should be considered then. Since marking a >> > > topic >> > > > for deletion doesn't flow through the controller and therefore the >> > > > UpdateMetadataRequest, >> > > > it would take quite a bit of change. We would need to trigger a new >> > > > UpdateMetadataRequest >> > > > every time a new topic is marked for deletion. >> > > > >> > > > Instead I added a listener to maintain a cache of the topic deletion >> > > znodes >> > > > in the ReplicaManager where the existing UpdateMetadataRequests are >> > > > handled. This would make it easy to swap out later once the data is >> a >> > > part >> > > > of that request and have minimal impact in the mean time. >> > > > >> > > > >> > > > > Could you add a description on how controller id will be used in >> the >> > > > > client? >> > > > >> > > > >> > > > I will add it to the wiki. Today metrics are the only way to access >> > this >> > > > piece of data. It is useful information about the cluster for many >> > > reasons. >> > > > Having programatic access to identify the controller is helpful for >> > > > automation. For example, It can be used during rolling restart >> logic to >> > > > shutdown the controller last to prevent multiple fail overs. Beyond >> > > > automation, it can be leveraged in KIP-4 to route admin requests to >> the >> > > > controller broker. >> > > > >> > > > We had a weird semantic in version 0 of MetadataRequest. If a >> replica >> > is >> > > > > not live, but the leader is live, we return an >> > > > > error ReplicaNotAvailableException in the partition metadata. This >> > > makes >> > > > it >> > > > > a bit confusing for the client to parse since it has to first >> check >> > > > whether >> > > > > leader is available or not before error code checking. We were >> > thinking >> > > > of >> > > > > changing that behavior the next time we bump up the version of >> > > > > MetadataRequest. >> > > > > Now that time has come, could you include that in the proposal? >> > > > >> > > > >> > > > I am not sure I completely follow the issue and requested change. >> Could >> > > you >> > > > point me to the discussion? >> > > > >> > > > Thank you, >> > > > Grant >> > > > >> > > > On Sun, Apr 3, 2016 at 8:09 PM, Grant Henke <ghe...@cloudera.com> >> > wrote: >> > > > >> > > > > Hi Jun and Ismael, >> > > > > >> > > > > Initially I had 2 booleans used to indicate if a topic was >> internal >> > and >> > > > if >> > > > > a topic was marked for deletion. To save space on large >> deployments, >> > > > Ismael >> > > > > suggested I break out the internal topics and deleted topics into >> > their >> > > > own >> > > > > lists. The idea was that instead of 2 bytes added per topic, in >> the >> > > > general >> > > > > case the lists would be empty. Even in those lists I still only >> > return >> > > > > topics that were requested. In fact on the client side they are >> just >> > > > > utilized to translate back to booleans. I do prefer the booleans >> from >> > > an >> > > > > expressiveness standpoint but was not strongly opinionated on the >> > > > > structure. >> > > > > >> > > > > Thank you, >> > > > > Grant >> > > > > >> > > > > On Sun, Apr 3, 2016 at 8:00 PM, Ismael Juma <ism...@juma.me.uk> >> > wrote: >> > > > > >> > > > >> Hi Jun, >> > > > >> >> > > > >> A couple of comments inline. >> > > > >> >> > > > >> On Sun, Apr 3, 2016 at 5:55 PM, Jun Rao <j...@confluent.io> >> wrote: >> > > > >> >> > > > >> > 1. It seems a bit weird to return just a list of internal >> topics >> > w/o >> > > > the >> > > > >> > corresponding metadata. It also seems a bit weird to return the >> > > > internal >> > > > >> > topics even if the client doesn't ask for it. >> > > > >> >> > > > >> >> > > > >> Good point. >> > > > >> >> > > > >> >> > > > >> > Would it be better to just >> > > > >> > add a flag in topic_metadata to indicate whether it's an >> internal >> > > > topic >> > > > >> or >> > > > >> > not, and only include the internal topics when thy are asked >> (or >> > all >> > > > >> topics >> > > > >> > are requested) for? >> > > > >> > >> > > > >> >> > > > >> The disadvantage of this is that we are adding one byte per topic >> > even >> > > > >> though we have a very small number of internal topics (currently >> a >> > > > single >> > > > >> internal topic). It seems a bit wasteful and particularly so when >> > > using >> > > > >> regex subscriptions (since we have to retrieve all topics in that >> > > case). >> > > > >> >> > > > >> 2. A similar comment on topics_marked_for_deletion. Would it be >> > better >> > > > to >> > > > >> > only return them when asked for and just return a new >> TopicDeleted >> > > > error >> > > > >> > code in topic_metadata? >> > > > >> >> > > > >> >> > > > >> I agree that this seems better. >> > > > >> >> > > > >> Ismael >> > > > >> >> > > > > >> > > > > >> > > > > >> > > > > -- >> > > > > Grant Henke >> > > > > Software Engineer | Cloudera >> > > > > gr...@cloudera.com | twitter.com/gchenke | >> > linkedin.com/in/granthenke >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > Grant Henke >> > > > Software Engineer | Cloudera >> > > > gr...@cloudera.com | twitter.com/gchenke | >> linkedin.com/in/granthenke >> > > > >> > > >> > >> > >> > >> > -- >> > Grant Henke >> > Software Engineer | Cloudera >> > gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke >> > >> > > > > -- > Grant Henke > Software Engineer | Cloudera > gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke > -- Grant Henke Software Engineer | Cloudera gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke