Re: 0.8.2.0 behavior change: ReplicaNotAvailableError

Jay Kreps Wed, 14 Jan 2015 18:56:54 -0800

I agree.

Also, is this behavior a good one? It seems kind of hacky to give an error
code and a result both, no?


-Jay

On Wed, Jan 14, 2015 at 6:35 PM, Dana Powers <dana.pow...@rd.io> wrote:

> Thanks -- i see that this was more of a bug in 0.8.1 than a regression in
> 0.8.2.  But I do think the 0.8.2 bug fix to the metadata cache means that
> the very common scenario of a single broker failure (and subsequent
> partition leadership change) will now return error codes in the
> MetadataResponse -- different from 0.8.1 -- and those errors may cause pain
> to some users if the client doesn't know how to handle them.  The "fix" for
> users is to upgrade client code (or verify that existing client code
> handles this well) before upgrading to 0.8.2 in a production environment.
>
> What would be really useful for the non-java community is a list or
> specification of what error codes should be expected for each API response
> (here the MetadataResponse) along with perhaps even some context-related
> notes on what they mean.  As it stands, the protocol document leaves all of
> the ErrorCode documentation until the end and doesn't give any context
> about *when* to handle each error:
>
> https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-ErrorCodes
>
> I would volunteer to go in to the wiki and help with that effort, but I
> also feel like perhaps protocol document changes deserve a more strict
> review process, .  Maybe the KIP process mentioned separately on the
> dev-list.  Maybe the protocol document itself should be versioned and
> released with core
>
> Nonetheless, right now the error-handling part of managing clients is
> fairly ad-hoc and I think we should work to tighten that process up.
>
> Dana Powers
> Rdio, Inc.
> dana.pow...@rd.io
> rdio.com/people/dpkp/
>
>
> On Wed, Jan 14, 2015 at 5:43 PM, Jun Rao <j...@confluent.io> wrote:
>
> > Hi, Dana,
> >
> > Thanks for reporting this. I investigated this a bit more. What you
> > observed is the following: a client getting a partition level error
> > code of ReplicaNotAvailableError
> > in a TopicMetadataResponse when one of replicas is offline. The short
> story
> > is that that behavior can already happen in 0.8.1, although the
> probability
> > of it showing up in 0.8.1 is less than that in 0.8.2.
> >
> > Currently, when sending a topic metadata response, the broker only
> includes
> > replicas (in either isr or assigned replica set) that are alive. To
> > indicate that a replica is missing, we set the partition level error code
> > to ReplicaNotAvailableError. In most cases, the client probably just
> cares
> > about the leader in the response. However, this error code could be
> useful
> > for some other clients (e.g., building admin tools). Since our java/scala
> > producer/consumer clients (both 0.8.1 and 0.8.2) only care about the
> > leader, they are ignoring the error code. That's why they are not
> affected
> > by this behavior. The reason why this behavior doesn't show up as often
> in
> > 0.8.1 as in 0.8.2 is that in 0.8.1, we had a bug such that dead brokers
> are
> > never removed from the metadata cache on the broker. That bug has since
> > been fixed in 0.8.2. To reproduce that behavior in 0.8.1, you can do the
> > following: (1) start 2 brokers, (2) create a topic with 1 partition and 2
> > replicas, (3) bring down both brokers, (4) restart only 1 broker, (5)
> issue
> > a TopicMetadataRequest on that topic, (6) you should see the
> > ReplicaNotAvailableError
> > code.
> >
> > So, technically, this is not a regression from 0.8.1. I agree that we
> > should have documented this behavior more clearly. Really sorry about
> that.
> >
> > Thanks,
> >
> > Jun
> >
> > On Wed, Jan 14, 2015 at 1:14 PM, Dana Powers <dana.pow...@rd.io> wrote:
> >
> > > Overall the 0.8.2.0 release candidate looks really good.
> > >
> > > All of the kafka-python integration tests pass as they do w/ prior
> > servers,
> > > except one... When testing recovery from a broker failure / leader
> > switch,
> > > we now see a ReplicaNotAvailableError in broker metadata /
> > > PartitionMetadata, which we do not see in the same test against
> previous
> > > servers.  I understand from discussion around KAFKA-1609 and KAFKA-1649
> > > that this behavior is expected and that clients should ignore the error
> > (or
> > > at least treat it as non-critical).  But strictly speaking this is a
> > > behavior change and could cause client issues.  Indeed, anyone using
> > older
> > > versions of kafka-python against this release candidate will get bad
> > > failures on leader switch (exactly when you don't want bad client
> > > failures!).  It may be that it is our fault for not handling this in
> > > kafka-python, but at the least I think this needs to be flagged as a
> > > possible issue for 3rd party clients.  Also KAFKA-1649 doesn't look
> like
> > it
> > > was ever actually resolved... The protocol document does not mention
> > > anything about clients ignoring this error code.
> > >
> > > Dana Powers
> > > Rdio, Inc.
> > > dana.pow...@rd.io
> > > rdio.com/people/dpkp/
> > >
> >
>

Re: 0.8.2.0 behavior change: ReplicaNotAvailableError

Reply via email to