We have had 2 nodes in a 4 node cluster die this weekend, sadly. Fortunately there was no critical data on these machines yet.
The cluster is running 0.8.1.1, and using replication factor of 2 for 2 topics, each with 20 partitions. For sake of discussion, assume that nodes A and B are still up, and C and D are now down. As expected, partitions that had one replica on a good host (A or B) and one on a bad node (C or D), had their ISR shrink to just 1 node (A or B). Roughly 1/6 of the partitions had their 2 replicas on the 2 bad nodes, C and D. For these, I was expecting the ISR to show up as empty, and the partition unavailable. However, that's not what I'm seeing. When running TopicCommand --describe, I see that the ISR still shows 1 replica, on node D (D was the second node to go down). And, producers are still periodically trying to produce to node D (but failing and retrying to one of the good nodes). So, it seems the cluster's meta data is still thinking that node D is up and serving the partitions that were only replicated on C and D. However, for partitions that were on A and D, or B and D, D is not shown as being in the ISR. Is this correct? Should the cluster continue showing the last node to have been alive for a partition as still in the ISR? Jason