Hi, Dana, Thanks for reporting this. I investigated this a bit more. What you observed is the following: a client getting a partition level error code of ReplicaNotAvailableError in a TopicMetadataResponse when one of replicas is offline. The short story is that that behavior can already happen in 0.8.1, although the probability of it showing up in 0.8.1 is less than that in 0.8.2.
Currently, when sending a topic metadata response, the broker only includes replicas (in either isr or assigned replica set) that are alive. To indicate that a replica is missing, we set the partition level error code to ReplicaNotAvailableError. In most cases, the client probably just cares about the leader in the response. However, this error code could be useful for some other clients (e.g., building admin tools). Since our java/scala producer/consumer clients (both 0.8.1 and 0.8.2) only care about the leader, they are ignoring the error code. That's why they are not affected by this behavior. The reason why this behavior doesn't show up as often in 0.8.1 as in 0.8.2 is that in 0.8.1, we had a bug such that dead brokers are never removed from the metadata cache on the broker. That bug has since been fixed in 0.8.2. To reproduce that behavior in 0.8.1, you can do the following: (1) start 2 brokers, (2) create a topic with 1 partition and 2 replicas, (3) bring down both brokers, (4) restart only 1 broker, (5) issue a TopicMetadataRequest on that topic, (6) you should see the ReplicaNotAvailableError code. So, technically, this is not a regression from 0.8.1. I agree that we should have documented this behavior more clearly. Really sorry about that. Thanks, Jun On Wed, Jan 14, 2015 at 1:14 PM, Dana Powers <dana.pow...@rd.io> wrote: > Overall the 0.8.2.0 release candidate looks really good. > > All of the kafka-python integration tests pass as they do w/ prior servers, > except one... When testing recovery from a broker failure / leader switch, > we now see a ReplicaNotAvailableError in broker metadata / > PartitionMetadata, which we do not see in the same test against previous > servers. I understand from discussion around KAFKA-1609 and KAFKA-1649 > that this behavior is expected and that clients should ignore the error (or > at least treat it as non-critical). But strictly speaking this is a > behavior change and could cause client issues. Indeed, anyone using older > versions of kafka-python against this release candidate will get bad > failures on leader switch (exactly when you don't want bad client > failures!). It may be that it is our fault for not handling this in > kafka-python, but at the least I think this needs to be flagged as a > possible issue for 3rd party clients. Also KAFKA-1649 doesn't look like it > was ever actually resolved... The protocol document does not mention > anything about clients ignoring this error code. > > Dana Powers > Rdio, Inc. > dana.pow...@rd.io > rdio.com/people/dpkp/ >