Re: [VOTE] KIP-4 Metadata Schema

Grant Henke Tue, 05 Apr 2016 14:22:58 -0700

After the discussion today about the clarity and flexibility of the
flag/lists for internal topics and deleted topics, I think I will switch
back to using booleans inside the topic metadata. This is a clearer
representation of the intent, should not have too much overhead (especially
because users can request no topics now), and is more simple/flexible to
evolve forward if we ever decide we need to represent more "states".


I will update the patch in the next day or so and post a vote if no issue
is raised with that.

Thanks,
Grant

On Tue, Apr 5, 2016 at 4:05 PM, Grant Henke <ghe...@cloudera.com> wrote:

> Hi Jun,
>
> See my responses below:
>
> 2. The issues that I was thinking are the following. (a) Say the controller
>> has topic deletion disabled and a topic deletion request is submitted to
>> ZK. In this case, the controller will ignore this request. However, the
>> broker may pick up the topic deletion marker in a transient window. (b)
>> Suppose that a topic is deleted and then recreated immediately. It is
>> possible for a broker to see the newly created topic and then the previous
>> topic deletion marker in a transient window. Thinking about this a bit
>> more. Both seem to be transient. So, it may not be a big concern. So, I am
>> ok with this as long as the interim solution is not too complicated.
>> Another thing to think through. If a topic is marked for deletion, do we
>> still return the partition level metadata?
>
>
> I am not changing anything about the metadata content, only adding a
> boolean based on the marked for deletion flag in zookeeper. This is
> maintaining the same method that the topics script does today. I do think
> delete improvements should be considered/reviewed. The goal here is to
> allow the broker to report the value that its sees, which is the value in
> zookeeper.
>
> 5. The issue is the following. If you have a partition with 3 replicas
>> 4,5,6, leader is on replica 4 and replica 5 is down. Currently, the broker
>> will send a REPLICA_NOT_AVAILABLE error code and only replicas 4,6 in the
>> assigned replicas. It's more intuitive to send no error code and 4,5,6 in
>> the assigned replicas in this case.
>
>
> Should the list with no error code just be 4,6 since 5 is not available?
>
>
> On Mon, Apr 4, 2016 at 1:34 PM, Jun Rao <j...@confluent.io> wrote:
>
>> Grant,
>>
>> 2. The issues that I was thinking are the following. (a) Say the
>> controller
>> has topic deletion disabled and a topic deletion request is submitted to
>> ZK. In this case, the controller will ignore this request. However, the
>> broker may pick up the topic deletion marker in a transient window. (b)
>> Suppose that a topic is deleted and then recreated immediately. It is
>> possible for a broker to see the newly created topic and then the previous
>> topic deletion marker in a transient window. Thinking about this a bit
>> more. Both seem to be transient. So, it may not be a big concern. So, I am
>> ok with this as long as the interim solution is not too complicated.
>> Another thing to think through. If a topic is marked for deletion, do we
>> still return the partition level metadata?
>>
>> 3. Your explanation on controller id seems reasonable to me.
>>
>> 5. The issue is the following. If you have a partition with 3 replicas
>> 4,5,6, leader is on replica 4 and replica 5 is down. Currently, the broker
>> will send a REPLICA_NOT_AVAILABLE error code and only replicas 4,6 in the
>> assigned replicas. It's more intuitive to send no error code and 4,5,6 in
>> the assigned replicas in this case.
>>
>> Thanks,
>>
>> Jun
>>
>> On Mon, Apr 4, 2016 at 8:33 AM, Grant Henke <ghe...@cloudera.com> wrote:
>>
>> > Hi Jun,
>> >
>> > Please See my responses below:
>> >
>> > Hmm, I am not sure about the listener approach. It ignores configs like
>> > > enable.topic.deletion and also opens the door for potential ordering
>> > issues
>> > > since now there are two separate paths for propagating the metadata to
>> > the
>> > > brokers.
>> >
>> >
>> > This mechanism is very similar to how deletes are tracked on the
>> controller
>> > itself. It is also the same way ACLs are tracked on brokers in the
>> default
>> > implementation. I am not sure I understand what ordering issue there
>> could
>> > be. This is used to report what topics are marked for deletion, which
>> today
>> > has no dependency on enable.topic.deletion. I agree that the delete
>> > mechanism in Kafka has a lot of room for improvement, but the goal in
>> this
>> > change is just to enable reporting it to the user, not to fix/improve
>> > existing issues. If you have an alternate approach that does not require
>> > major changes to the controller code, I would be open to investigate it.
>> >
>> > Could we just leave out markedForDeletion for now? In the common
>> > > case, if a topic is deleted, it will only be in markedForDeletion
>> state
>> > for
>> > > a few milli seconds anyway.
>> >
>> >
>> > I don't think we should leave it out. The point of these changes is to
>> > prevent a user from needing to talk directly to zookeeper. We need a way
>> > for a user to see if a topic has been marked for deletion. Given the
>> issues
>> > with the current delete implementation, its fairly common for a topic to
>> > remain marked as deleted for quite some time.
>> >
>> > Yes, for those usage, it just seems it's a bit weird for the client to
>> > > issue a MetadataRequest to get the controller info since it doesn't
>> need
>> > > any topic metadata.
>> >
>> >
>> > Why does this seam weird? The MetadataRequest is the request used to
>> > discover the cluster and metadata about that cluster regardless of the
>> > topics you are interested in, if any. In fact, a big motivation for the
>> > change to allow requesting "no topics" is because the existing producer
>> and
>> > consumer often want to learn about the cluster without asking for topic
>> > metadata and today that means that they request all topics.
>> >
>> > 5. The issue is that for a client, when handling a metadata response,
>> the
>> > > natural logic is if there is any error in the response, go to the
>> error
>> > > handling path (e.g., back off and refresh metadata). Otherwise, get
>> the
>> > > leader info and initiate a request to the leader if leader is
>> available.
>> > If
>> > > you look at the current logic in
>> MetadataCache.getPartitionMetadata(), if
>> > > an assigned replica is not alive, we will send a REPLICA_NOT_AVAILABLE
>> > > error code in the response. If the client follows the above logic, it
>> > will
>> > > keep doing the error handling even though there is nothing wrong with
>> the
>> > > leader. A better behavior is to simply return the list of replica ids
>> > with
>> > > no error code in this case.
>> >
>> >
>> > To be sure I understand this correctly. Instead of returning the
>> complete
>> > list of replicas, including the ones that errored as unavailable. You
>> are
>> > suggesting to drop the unavailable ones and return just the replicas
>> with
>> > no-errors and return no error code on the partition. Is that correct?
>> >
>> > Under what scenario does the MetadataCache have a replica that is not
>> > available?
>> >
>> > Thanks,
>> > Grant
>> >
>> >
>> >
>> >
>> >
>> > On Mon, Apr 4, 2016 at 12:25 AM, Jun Rao <j...@confluent.io> wrote:
>> >
>> > > Grant,
>> > >
>> > > 2. Hmm, I am not sure about the listener approach. It ignores configs
>> > like
>> > > enable.topic.deletion and also opens the door for potential ordering
>> > issues
>> > > since now there are two separate paths for propagating the metadata to
>> > the
>> > > brokers. Could we just leave out markedForDeletion for now? In the
>> common
>> > > case, if a topic is deleted, it will only be in markedForDeletion
>> state
>> > for
>> > > a few milli seconds anyway.
>> > >
>> > > 3. Yes, for those usage, it just seems it's a bit weird for the
>> client to
>> > > issue a MetadataRequest to get the controller info since it doesn't
>> need
>> > > any topic metadata.
>> > >
>> > > 5. The issue is that for a client, when handling a metadata response,
>> the
>> > > natural logic is if there is any error in the response, go to the
>> error
>> > > handling path (e.g., back off and refresh metadata). Otherwise, get
>> the
>> > > leader info and initiate a request to the leader if leader is
>> available.
>> > If
>> > > you look at the current logic in
>> MetadataCache.getPartitionMetadata(), if
>> > > an assigned replica is not alive, we will send a REPLICA_NOT_AVAILABLE
>> > > error code in the response. If the client follows the above logic, it
>> > will
>> > > keep doing the error handling even though there is nothing wrong with
>> the
>> > > leader. A better behavior is to simply return the list of replica ids
>> > with
>> > > no error code in this case.
>> > >
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > >
>> > >
>> > > On Sun, Apr 3, 2016 at 6:29 PM, Grant Henke <ghe...@cloudera.com>
>> wrote:
>> > >
>> > > > Responding to a few of the other comments:
>> > > >
>> > > > it seems that you propagated
>> > > > > the topic deletion marker by having the replicaManager read from
>> ZK
>> > > > > directly. It seems that it would be simpler/consistent if the
>> > > controller
>> > > > > propagates that information directly through UpdateMetaRequest.
>> > > >
>> > > >
>> > > > I was told that I should not try and modify controller logic with
>> KIP-4
>> > > > changes. It was indicated that a larger controller rewrite and
>> testing
>> > > was
>> > > > planned and those changes should be considered then. Since marking a
>> > > topic
>> > > > for deletion doesn't flow through the controller and therefore the
>> > > > UpdateMetadataRequest,
>> > > > it would take quite a bit of change. We would need to trigger a new
>> > > > UpdateMetadataRequest
>> > > > every time a new topic is marked for deletion.
>> > > >
>> > > > Instead I added a listener to maintain a cache of the topic deletion
>> > > znodes
>> > > > in the ReplicaManager where the existing UpdateMetadataRequests are
>> > > > handled. This would make it easy to swap out later once the data is
>> a
>> > > part
>> > > > of that request and have minimal impact in the mean time.
>> > > >
>> > > >
>> > > > > Could you add a description on how controller id will be used in
>> the
>> > > > > client?
>> > > >
>> > > >
>> > > > I will add it to the wiki. Today metrics are the only way to access
>> > this
>> > > > piece of data. It is useful information about the cluster for many
>> > > reasons.
>> > > > Having programatic access to identify the controller is helpful for
>> > > > automation. For example, It can be used during rolling restart
>> logic to
>> > > > shutdown the controller last to prevent multiple fail overs. Beyond
>> > > > automation, it can be leveraged in KIP-4 to route admin requests to
>> the
>> > > > controller broker.
>> > > >
>> > > > We had a weird semantic in version 0 of MetadataRequest. If a
>> replica
>> > is
>> > > > > not live, but the leader is live, we return an
>> > > > > error ReplicaNotAvailableException in the partition metadata. This
>> > > makes
>> > > > it
>> > > > > a bit confusing for the client to parse since it has to first
>> check
>> > > > whether
>> > > > > leader is available or not before error code checking. We were
>> > thinking
>> > > > of
>> > > > > changing that behavior the next time we bump up the version of
>> > > > > MetadataRequest.
>> > > > > Now that time has come, could you include that in the proposal?
>> > > >
>> > > >
>> > > > I am not sure I completely follow the issue and requested change.
>> Could
>> > > you
>> > > > point me to the discussion?
>> > > >
>> > > > Thank you,
>> > > > Grant
>> > > >
>> > > > On Sun, Apr 3, 2016 at 8:09 PM, Grant Henke <ghe...@cloudera.com>
>> > wrote:
>> > > >
>> > > > > Hi Jun and Ismael,
>> > > > >
>> > > > > Initially I had 2 booleans used to indicate if a topic was
>> internal
>> > and
>> > > > if
>> > > > > a topic was marked for deletion. To save space on large
>> deployments,
>> > > > Ismael
>> > > > > suggested I break out the internal topics and deleted topics into
>> > their
>> > > > own
>> > > > > lists. The idea was that instead of 2 bytes added per topic, in
>> the
>> > > > general
>> > > > > case the lists would be empty. Even in those lists I still only
>> > return
>> > > > > topics that were requested. In fact on the client side they are
>> just
>> > > > > utilized to translate back to booleans. I do prefer the booleans
>> from
>> > > an
>> > > > > expressiveness standpoint but was not strongly opinionated on the
>> > > > > structure.
>> > > > >
>> > > > > Thank you,
>> > > > > Grant
>> > > > >
>> > > > > On Sun, Apr 3, 2016 at 8:00 PM, Ismael Juma <ism...@juma.me.uk>
>> > wrote:
>> > > > >
>> > > > >> Hi Jun,
>> > > > >>
>> > > > >> A couple of comments inline.
>> > > > >>
>> > > > >> On Sun, Apr 3, 2016 at 5:55 PM, Jun Rao <j...@confluent.io>
>> wrote:
>> > > > >>
>> > > > >> > 1. It seems a bit weird to return just a list of internal
>> topics
>> > w/o
>> > > > the
>> > > > >> > corresponding metadata. It also seems a bit weird to return the
>> > > > internal
>> > > > >> > topics even if the client doesn't ask for it.
>> > > > >>
>> > > > >>
>> > > > >> Good point.
>> > > > >>
>> > > > >>
>> > > > >> > Would it be better to just
>> > > > >> > add a flag in topic_metadata to indicate whether it's an
>> internal
>> > > > topic
>> > > > >> or
>> > > > >> > not, and only include the internal topics when thy are asked
>> (or
>> > all
>> > > > >> topics
>> > > > >> > are requested) for?
>> > > > >> >
>> > > > >>
>> > > > >> The disadvantage of this is that we are adding one byte per topic
>> > even
>> > > > >> though we have a very small number of internal topics (currently
>> a
>> > > > single
>> > > > >> internal topic). It seems a bit wasteful and particularly so when
>> > > using
>> > > > >> regex subscriptions (since we have to retrieve all topics in that
>> > > case).
>> > > > >>
>> > > > >> 2. A similar comment on topics_marked_for_deletion. Would it be
>> > better
>> > > > to
>> > > > >> > only return them when asked for and just return a new
>> TopicDeleted
>> > > > error
>> > > > >> > code in topic_metadata?
>> > > > >>
>> > > > >>
>> > > > >> I agree that this seems better.
>> > > > >>
>> > > > >> Ismael
>> > > > >>
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Grant Henke
>> > > > > Software Engineer | Cloudera
>> > > > > gr...@cloudera.com | twitter.com/gchenke |
>> > linkedin.com/in/granthenke
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Grant Henke
>> > > > Software Engineer | Cloudera
>> > > > gr...@cloudera.com | twitter.com/gchenke |
>> linkedin.com/in/granthenke
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Grant Henke
>> > Software Engineer | Cloudera
>> > gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>> >
>>
>
>
>
> --
> Grant Henke
> Software Engineer | Cloudera
> gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>



-- 
Grant Henke
Software Engineer | Cloudera
gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: [VOTE] KIP-4 Metadata Schema

Reply via email to