Thanks -- i see that this was more of a bug in 0.8.1 than a regression in 0.8.2. But I do think the 0.8.2 bug fix to the metadata cache means that the very common scenario of a single broker failure (and subsequent partition leadership change) will now return error codes in the MetadataResponse -- different from 0.8.1 -- and those errors may cause pain to some users if the client doesn't know how to handle them. The "fix" for users is to upgrade client code (or verify that existing client code handles this well) before upgrading to 0.8.2 in a production environment.
What would be really useful for the non-java community is a list or specification of what error codes should be expected for each API response (here the MetadataResponse) along with perhaps even some context-related notes on what they mean. As it stands, the protocol document leaves all of the ErrorCode documentation until the end and doesn't give any context about *when* to handle each error: https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-ErrorCodes I would volunteer to go in to the wiki and help with that effort, but I also feel like perhaps protocol document changes deserve a more strict review process, . Maybe the KIP process mentioned separately on the dev-list. Maybe the protocol document itself should be versioned and released with core Nonetheless, right now the error-handling part of managing clients is fairly ad-hoc and I think we should work to tighten that process up. Dana Powers Rdio, Inc. dana.pow...@rd.io rdio.com/people/dpkp/ On Wed, Jan 14, 2015 at 5:43 PM, Jun Rao <j...@confluent.io> wrote: > Hi, Dana, > > Thanks for reporting this. I investigated this a bit more. What you > observed is the following: a client getting a partition level error > code of ReplicaNotAvailableError > in a TopicMetadataResponse when one of replicas is offline. The short story > is that that behavior can already happen in 0.8.1, although the probability > of it showing up in 0.8.1 is less than that in 0.8.2. > > Currently, when sending a topic metadata response, the broker only includes > replicas (in either isr or assigned replica set) that are alive. To > indicate that a replica is missing, we set the partition level error code > to ReplicaNotAvailableError. In most cases, the client probably just cares > about the leader in the response. However, this error code could be useful > for some other clients (e.g., building admin tools). Since our java/scala > producer/consumer clients (both 0.8.1 and 0.8.2) only care about the > leader, they are ignoring the error code. That's why they are not affected > by this behavior. The reason why this behavior doesn't show up as often in > 0.8.1 as in 0.8.2 is that in 0.8.1, we had a bug such that dead brokers are > never removed from the metadata cache on the broker. That bug has since > been fixed in 0.8.2. To reproduce that behavior in 0.8.1, you can do the > following: (1) start 2 brokers, (2) create a topic with 1 partition and 2 > replicas, (3) bring down both brokers, (4) restart only 1 broker, (5) issue > a TopicMetadataRequest on that topic, (6) you should see the > ReplicaNotAvailableError > code. > > So, technically, this is not a regression from 0.8.1. I agree that we > should have documented this behavior more clearly. Really sorry about that. > > Thanks, > > Jun > > On Wed, Jan 14, 2015 at 1:14 PM, Dana Powers <dana.pow...@rd.io> wrote: > > > Overall the 0.8.2.0 release candidate looks really good. > > > > All of the kafka-python integration tests pass as they do w/ prior > servers, > > except one... When testing recovery from a broker failure / leader > switch, > > we now see a ReplicaNotAvailableError in broker metadata / > > PartitionMetadata, which we do not see in the same test against previous > > servers. I understand from discussion around KAFKA-1609 and KAFKA-1649 > > that this behavior is expected and that clients should ignore the error > (or > > at least treat it as non-critical). But strictly speaking this is a > > behavior change and could cause client issues. Indeed, anyone using > older > > versions of kafka-python against this release candidate will get bad > > failures on leader switch (exactly when you don't want bad client > > failures!). It may be that it is our fault for not handling this in > > kafka-python, but at the least I think this needs to be flagged as a > > possible issue for 3rd party clients. Also KAFKA-1649 doesn't look like > it > > was ever actually resolved... The protocol document does not mention > > anything about clients ignoring this error code. > > > > Dana Powers > > Rdio, Inc. > > dana.pow...@rd.io > > rdio.com/people/dpkp/ > > >