I agree. Also, is this behavior a good one? It seems kind of hacky to give an error code and a result both, no?
-Jay On Wed, Jan 14, 2015 at 6:35 PM, Dana Powers <dana.pow...@rd.io> wrote: > Thanks -- i see that this was more of a bug in 0.8.1 than a regression in > 0.8.2. But I do think the 0.8.2 bug fix to the metadata cache means that > the very common scenario of a single broker failure (and subsequent > partition leadership change) will now return error codes in the > MetadataResponse -- different from 0.8.1 -- and those errors may cause pain > to some users if the client doesn't know how to handle them. The "fix" for > users is to upgrade client code (or verify that existing client code > handles this well) before upgrading to 0.8.2 in a production environment. > > What would be really useful for the non-java community is a list or > specification of what error codes should be expected for each API response > (here the MetadataResponse) along with perhaps even some context-related > notes on what they mean. As it stands, the protocol document leaves all of > the ErrorCode documentation until the end and doesn't give any context > about *when* to handle each error: > > https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-ErrorCodes > > I would volunteer to go in to the wiki and help with that effort, but I > also feel like perhaps protocol document changes deserve a more strict > review process, . Maybe the KIP process mentioned separately on the > dev-list. Maybe the protocol document itself should be versioned and > released with core > > Nonetheless, right now the error-handling part of managing clients is > fairly ad-hoc and I think we should work to tighten that process up. > > Dana Powers > Rdio, Inc. > dana.pow...@rd.io > rdio.com/people/dpkp/ > > > On Wed, Jan 14, 2015 at 5:43 PM, Jun Rao <j...@confluent.io> wrote: > > > Hi, Dana, > > > > Thanks for reporting this. I investigated this a bit more. What you > > observed is the following: a client getting a partition level error > > code of ReplicaNotAvailableError > > in a TopicMetadataResponse when one of replicas is offline. The short > story > > is that that behavior can already happen in 0.8.1, although the > probability > > of it showing up in 0.8.1 is less than that in 0.8.2. > > > > Currently, when sending a topic metadata response, the broker only > includes > > replicas (in either isr or assigned replica set) that are alive. To > > indicate that a replica is missing, we set the partition level error code > > to ReplicaNotAvailableError. In most cases, the client probably just > cares > > about the leader in the response. However, this error code could be > useful > > for some other clients (e.g., building admin tools). Since our java/scala > > producer/consumer clients (both 0.8.1 and 0.8.2) only care about the > > leader, they are ignoring the error code. That's why they are not > affected > > by this behavior. The reason why this behavior doesn't show up as often > in > > 0.8.1 as in 0.8.2 is that in 0.8.1, we had a bug such that dead brokers > are > > never removed from the metadata cache on the broker. That bug has since > > been fixed in 0.8.2. To reproduce that behavior in 0.8.1, you can do the > > following: (1) start 2 brokers, (2) create a topic with 1 partition and 2 > > replicas, (3) bring down both brokers, (4) restart only 1 broker, (5) > issue > > a TopicMetadataRequest on that topic, (6) you should see the > > ReplicaNotAvailableError > > code. > > > > So, technically, this is not a regression from 0.8.1. I agree that we > > should have documented this behavior more clearly. Really sorry about > that. > > > > Thanks, > > > > Jun > > > > On Wed, Jan 14, 2015 at 1:14 PM, Dana Powers <dana.pow...@rd.io> wrote: > > > > > Overall the 0.8.2.0 release candidate looks really good. > > > > > > All of the kafka-python integration tests pass as they do w/ prior > > servers, > > > except one... When testing recovery from a broker failure / leader > > switch, > > > we now see a ReplicaNotAvailableError in broker metadata / > > > PartitionMetadata, which we do not see in the same test against > previous > > > servers. I understand from discussion around KAFKA-1609 and KAFKA-1649 > > > that this behavior is expected and that clients should ignore the error > > (or > > > at least treat it as non-critical). But strictly speaking this is a > > > behavior change and could cause client issues. Indeed, anyone using > > older > > > versions of kafka-python against this release candidate will get bad > > > failures on leader switch (exactly when you don't want bad client > > > failures!). It may be that it is our fault for not handling this in > > > kafka-python, but at the least I think this needs to be flagged as a > > > possible issue for 3rd party clients. Also KAFKA-1649 doesn't look > like > > it > > > was ever actually resolved... The protocol document does not mention > > > anything about clients ignoring this error code. > > > > > > Dana Powers > > > Rdio, Inc. > > > dana.pow...@rd.io > > > rdio.com/people/dpkp/ > > > > > >