Re: [VOTE] KIP-4 Metadata Schema

Grant Henke Thu, 07 Apr 2016 08:38:28 -0700

I am testing the marked for deletion flag in the metadata and ran into some
challenges.


It turns out that as soon as a topic is marked for deletion it may be
purged from the metadata cache. This means that Metadata responses
can't/don't return the topic. Though the topic may still exist if its not
ready to be completely deleted or is in the process of being deleted.

This poses a challenge because a user would have no way to tell if a topic
still exists, and is marked for deletion, other than to try and recreate it
and see a failure. I could change the logic to no longer purge a message
from the cache until its completely deleted, but I am not sure if that
would impact the clients in some way negatively. Does anyone have enough
background to say?

I will dig into this a bit more today, but wanted to throw this out there
for some early feedback.

Thank you,
Grant


On Tue, Apr 5, 2016 at 8:02 PM, Jun Rao <[email protected]> wrote:

> 5. You will return no error and 4,5,6 as replicas. The response also
> includes a list of live brokers. So the client can figure out 5 is not live
> directly w/o relying on the error code.
>
> Thanks,
>
> Jun
>
> On Tue, Apr 5, 2016 at 5:05 PM, Grant Henke <[email protected]> wrote:
>
> > Hi Jun,
> >
> > See my responses below:
> >
> > 2. The issues that I was thinking are the following. (a) Say the
> controller
> > > has topic deletion disabled and a topic deletion request is submitted
> to
> > > ZK. In this case, the controller will ignore this request. However, the
> > > broker may pick up the topic deletion marker in a transient window. (b)
> > > Suppose that a topic is deleted and then recreated immediately. It is
> > > possible for a broker to see the newly created topic and then the
> > previous
> > > topic deletion marker in a transient window. Thinking about this a bit
> > > more. Both seem to be transient. So, it may not be a big concern. So, I
> > am
> > > ok with this as long as the interim solution is not too complicated.
> > > Another thing to think through. If a topic is marked for deletion, do
> we
> > > still return the partition level metadata?
> >
> >
> > I am not changing anything about the metadata content, only adding a
> > boolean based on the marked for deletion flag in zookeeper. This is
> > maintaining the same method that the topics script does today. I do think
> > delete improvements should be considered/reviewed. The goal here is to
> > allow the broker to report the value that its sees, which is the value in
> > zookeeper.
> >
> > 5. The issue is the following. If you have a partition with 3 replicas
> > > 4,5,6, leader is on replica 4 and replica 5 is down. Currently, the
> > broker
> > > will send a REPLICA_NOT_AVAILABLE error code and only replicas 4,6 in
> the
> > > assigned replicas. It's more intuitive to send no error code and 4,5,6
> in
> > > the assigned replicas in this case.
> >
> >
> > Should the list with no error code just be 4,6 since 5 is not available?
> >
> >
> > On Mon, Apr 4, 2016 at 1:34 PM, Jun Rao <[email protected]> wrote:
> >
> > > Grant,
> > >
> > > 2. The issues that I was thinking are the following. (a) Say the
> > controller
> > > has topic deletion disabled and a topic deletion request is submitted
> to
> > > ZK. In this case, the controller will ignore this request. However, the
> > > broker may pick up the topic deletion marker in a transient window. (b)
> > > Suppose that a topic is deleted and then recreated immediately. It is
> > > possible for a broker to see the newly created topic and then the
> > previous
> > > topic deletion marker in a transient window. Thinking about this a bit
> > > more. Both seem to be transient. So, it may not be a big concern. So, I
> > am
> > > ok with this as long as the interim solution is not too complicated.
> > > Another thing to think through. If a topic is marked for deletion, do
> we
> > > still return the partition level metadata?
> > >
> > > 3. Your explanation on controller id seems reasonable to me.
> > >
> > > 5. The issue is the following. If you have a partition with 3 replicas
> > > 4,5,6, leader is on replica 4 and replica 5 is down. Currently, the
> > broker
> > > will send a REPLICA_NOT_AVAILABLE error code and only replicas 4,6 in
> the
> > > assigned replicas. It's more intuitive to send no error code and 4,5,6
> in
> > > the assigned replicas in this case.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Apr 4, 2016 at 8:33 AM, Grant Henke <[email protected]>
> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Please See my responses below:
> > > >
> > > > Hmm, I am not sure about the listener approach. It ignores configs
> like
> > > > > enable.topic.deletion and also opens the door for potential
> ordering
> > > > issues
> > > > > since now there are two separate paths for propagating the metadata
> > to
> > > > the
> > > > > brokers.
> > > >
> > > >
> > > > This mechanism is very similar to how deletes are tracked on the
> > > controller
> > > > itself. It is also the same way ACLs are tracked on brokers in the
> > > default
> > > > implementation. I am not sure I understand what ordering issue there
> > > could
> > > > be. This is used to report what topics are marked for deletion, which
> > > today
> > > > has no dependency on enable.topic.deletion. I agree that the delete
> > > > mechanism in Kafka has a lot of room for improvement, but the goal in
> > > this
> > > > change is just to enable reporting it to the user, not to fix/improve
> > > > existing issues. If you have an alternate approach that does not
> > require
> > > > major changes to the controller code, I would be open to investigate
> > it.
> > > >
> > > > Could we just leave out markedForDeletion for now? In the common
> > > > > case, if a topic is deleted, it will only be in markedForDeletion
> > state
> > > > for
> > > > > a few milli seconds anyway.
> > > >
> > > >
> > > > I don't think we should leave it out. The point of these changes is
> to
> > > > prevent a user from needing to talk directly to zookeeper. We need a
> > way
> > > > for a user to see if a topic has been marked for deletion. Given the
> > > issues
> > > > with the current delete implementation, its fairly common for a topic
> > to
> > > > remain marked as deleted for quite some time.
> > > >
> > > > Yes, for those usage, it just seems it's a bit weird for the client
> to
> > > > > issue a MetadataRequest to get the controller info since it doesn't
> > > need
> > > > > any topic metadata.
> > > >
> > > >
> > > > Why does this seam weird? The MetadataRequest is the request used to
> > > > discover the cluster and metadata about that cluster regardless of
> the
> > > > topics you are interested in, if any. In fact, a big motivation for
> the
> > > > change to allow requesting "no topics" is because the existing
> producer
> > > and
> > > > consumer often want to learn about the cluster without asking for
> topic
> > > > metadata and today that means that they request all topics.
> > > >
> > > > 5. The issue is that for a client, when handling a metadata response,
> > the
> > > > > natural logic is if there is any error in the response, go to the
> > error
> > > > > handling path (e.g., back off and refresh metadata). Otherwise, get
> > the
> > > > > leader info and initiate a request to the leader if leader is
> > > available.
> > > > If
> > > > > you look at the current logic in
> > MetadataCache.getPartitionMetadata(),
> > > if
> > > > > an assigned replica is not alive, we will send a
> > REPLICA_NOT_AVAILABLE
> > > > > error code in the response. If the client follows the above logic,
> it
> > > > will
> > > > > keep doing the error handling even though there is nothing wrong
> with
> > > the
> > > > > leader. A better behavior is to simply return the list of replica
> ids
> > > > with
> > > > > no error code in this case.
> > > >
> > > >
> > > > To be sure I understand this correctly. Instead of returning the
> > complete
> > > > list of replicas, including the ones that errored as unavailable. You
> > are
> > > > suggesting to drop the unavailable ones and return just the replicas
> > with
> > > > no-errors and return no error code on the partition. Is that correct?
> > > >
> > > > Under what scenario does the MetadataCache have a replica that is not
> > > > available?
> > > >
> > > > Thanks,
> > > > Grant
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Apr 4, 2016 at 12:25 AM, Jun Rao <[email protected]> wrote:
> > > >
> > > > > Grant,
> > > > >
> > > > > 2. Hmm, I am not sure about the listener approach. It ignores
> configs
> > > > like
> > > > > enable.topic.deletion and also opens the door for potential
> ordering
> > > > issues
> > > > > since now there are two separate paths for propagating the metadata
> > to
> > > > the
> > > > > brokers. Could we just leave out markedForDeletion for now? In the
> > > common
> > > > > case, if a topic is deleted, it will only be in markedForDeletion
> > state
> > > > for
> > > > > a few milli seconds anyway.
> > > > >
> > > > > 3. Yes, for those usage, it just seems it's a bit weird for the
> > client
> > > to
> > > > > issue a MetadataRequest to get the controller info since it doesn't
> > > need
> > > > > any topic metadata.
> > > > >
> > > > > 5. The issue is that for a client, when handling a metadata
> response,
> > > the
> > > > > natural logic is if there is any error in the response, go to the
> > error
> > > > > handling path (e.g., back off and refresh metadata). Otherwise, get
> > the
> > > > > leader info and initiate a request to the leader if leader is
> > > available.
> > > > If
> > > > > you look at the current logic in
> > MetadataCache.getPartitionMetadata(),
> > > if
> > > > > an assigned replica is not alive, we will send a
> > REPLICA_NOT_AVAILABLE
> > > > > error code in the response. If the client follows the above logic,
> it
> > > > will
> > > > > keep doing the error handling even though there is nothing wrong
> with
> > > the
> > > > > leader. A better behavior is to simply return the list of replica
> ids
> > > > with
> > > > > no error code in this case.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Apr 3, 2016 at 6:29 PM, Grant Henke <[email protected]>
> > > wrote:
> > > > >
> > > > > > Responding to a few of the other comments:
> > > > > >
> > > > > > it seems that you propagated
> > > > > > > the topic deletion marker by having the replicaManager read
> from
> > ZK
> > > > > > > directly. It seems that it would be simpler/consistent if the
> > > > > controller
> > > > > > > propagates that information directly through UpdateMetaRequest.
> > > > > >
> > > > > >
> > > > > > I was told that I should not try and modify controller logic with
> > > KIP-4
> > > > > > changes. It was indicated that a larger controller rewrite and
> > > testing
> > > > > was
> > > > > > planned and those changes should be considered then. Since
> marking
> > a
> > > > > topic
> > > > > > for deletion doesn't flow through the controller and therefore
> the
> > > > > > UpdateMetadataRequest,
> > > > > > it would take quite a bit of change. We would need to trigger a
> new
> > > > > > UpdateMetadataRequest
> > > > > > every time a new topic is marked for deletion.
> > > > > >
> > > > > > Instead I added a listener to maintain a cache of the topic
> > deletion
> > > > > znodes
> > > > > > in the ReplicaManager where the existing UpdateMetadataRequests
> are
> > > > > > handled. This would make it easy to swap out later once the data
> > is a
> > > > > part
> > > > > > of that request and have minimal impact in the mean time.
> > > > > >
> > > > > >
> > > > > > > Could you add a description on how controller id will be used
> in
> > > the
> > > > > > > client?
> > > > > >
> > > > > >
> > > > > > I will add it to the wiki. Today metrics are the only way to
> access
> > > > this
> > > > > > piece of data. It is useful information about the cluster for
> many
> > > > > reasons.
> > > > > > Having programatic access to identify the controller is helpful
> for
> > > > > > automation. For example, It can be used during rolling restart
> > logic
> > > to
> > > > > > shutdown the controller last to prevent multiple fail overs.
> Beyond
> > > > > > automation, it can be leveraged in KIP-4 to route admin requests
> to
> > > the
> > > > > > controller broker.
> > > > > >
> > > > > > We had a weird semantic in version 0 of MetadataRequest. If a
> > replica
> > > > is
> > > > > > > not live, but the leader is live, we return an
> > > > > > > error ReplicaNotAvailableException in the partition metadata.
> > This
> > > > > makes
> > > > > > it
> > > > > > > a bit confusing for the client to parse since it has to first
> > check
> > > > > > whether
> > > > > > > leader is available or not before error code checking. We were
> > > > thinking
> > > > > > of
> > > > > > > changing that behavior the next time we bump up the version of
> > > > > > > MetadataRequest.
> > > > > > > Now that time has come, could you include that in the proposal?
> > > > > >
> > > > > >
> > > > > > I am not sure I completely follow the issue and requested change.
> > > Could
> > > > > you
> > > > > > point me to the discussion?
> > > > > >
> > > > > > Thank you,
> > > > > > Grant
> > > > > >
> > > > > > On Sun, Apr 3, 2016 at 8:09 PM, Grant Henke <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Jun and Ismael,
> > > > > > >
> > > > > > > Initially I had 2 booleans used to indicate if a topic was
> > internal
> > > > and
> > > > > > if
> > > > > > > a topic was marked for deletion. To save space on large
> > > deployments,
> > > > > > Ismael
> > > > > > > suggested I break out the internal topics and deleted topics
> into
> > > > their
> > > > > > own
> > > > > > > lists. The idea was that instead of 2 bytes added per topic, in
> > the
> > > > > > general
> > > > > > > case the lists would be empty. Even in those lists I still only
> > > > return
> > > > > > > topics that were requested. In fact on the client side they are
> > > just
> > > > > > > utilized to translate back to booleans. I do prefer the
> booleans
> > > from
> > > > > an
> > > > > > > expressiveness standpoint but was not strongly opinionated on
> the
> > > > > > > structure.
> > > > > > >
> > > > > > > Thank you,
> > > > > > > Grant
> > > > > > >
> > > > > > > On Sun, Apr 3, 2016 at 8:00 PM, Ismael Juma <[email protected]
> >
> > > > wrote:
> > > > > > >
> > > > > > >> Hi Jun,
> > > > > > >>
> > > > > > >> A couple of comments inline.
> > > > > > >>
> > > > > > >> On Sun, Apr 3, 2016 at 5:55 PM, Jun Rao <[email protected]>
> > wrote:
> > > > > > >>
> > > > > > >> > 1. It seems a bit weird to return just a list of internal
> > topics
> > > > w/o
> > > > > > the
> > > > > > >> > corresponding metadata. It also seems a bit weird to return
> > the
> > > > > > internal
> > > > > > >> > topics even if the client doesn't ask for it.
> > > > > > >>
> > > > > > >>
> > > > > > >> Good point.
> > > > > > >>
> > > > > > >>
> > > > > > >> > Would it be better to just
> > > > > > >> > add a flag in topic_metadata to indicate whether it's an
> > > internal
> > > > > > topic
> > > > > > >> or
> > > > > > >> > not, and only include the internal topics when thy are asked
> > (or
> > > > all
> > > > > > >> topics
> > > > > > >> > are requested) for?
> > > > > > >> >
> > > > > > >>
> > > > > > >> The disadvantage of this is that we are adding one byte per
> > topic
> > > > even
> > > > > > >> though we have a very small number of internal topics
> > (currently a
> > > > > > single
> > > > > > >> internal topic). It seems a bit wasteful and particularly so
> > when
> > > > > using
> > > > > > >> regex subscriptions (since we have to retrieve all topics in
> > that
> > > > > case).
> > > > > > >>
> > > > > > >> 2. A similar comment on topics_marked_for_deletion. Would it
> be
> > > > better
> > > > > > to
> > > > > > >> > only return them when asked for and just return a new
> > > TopicDeleted
> > > > > > error
> > > > > > >> > code in topic_metadata?
> > > > > > >>
> > > > > > >>
> > > > > > >> I agree that this seems better.
> > > > > > >>
> > > > > > >> Ismael
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Grant Henke
> > > > > > > Software Engineer | Cloudera
> > > > > > > [email protected] | twitter.com/gchenke |
> > > > linkedin.com/in/granthenke
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Grant Henke
> > > > > > Software Engineer | Cloudera
> > > > > > [email protected] | twitter.com/gchenke |
> > > linkedin.com/in/granthenke
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Grant Henke
> > > > Software Engineer | Cloudera
> > > > [email protected] | twitter.com/gchenke |
> linkedin.com/in/granthenke
> > > >
> > >
> >
> >
> >
> > --
> > Grant Henke
> > Software Engineer | Cloudera
> > [email protected] | twitter.com/gchenke | linkedin.com/in/granthenke
> >
>



-- 
Grant Henke
Software Engineer | Cloudera
[email protected] | twitter.com/gchenke | linkedin.com/in/granthenke

Re: [VOTE] KIP-4 Metadata Schema

Reply via email to