We have had 2 nodes in a 4 node cluster die this weekend, sadly.
Fortunately there was no critical data on these machines yet.

The cluster is running 0.8.1.1, and using replication factor of 2 for 2
topics, each with 20 partitions.

For sake of discussion, assume that nodes A and B are still up, and C and D
are now down.

As expected, partitions that had one replica on a good host (A or B) and
one on a bad node (C or D), had their ISR shrink to just 1 node (A or B).

Roughly 1/6 of the partitions had their 2 replicas on the 2 bad nodes, C
and D.  For these, I was expecting the ISR to show up as empty, and the
partition unavailable.

However, that's not what I'm seeing.  When running TopicCommand --describe,
I see that the ISR still shows 1 replica, on node D (D was the second node
to go down).

And, producers are still periodically trying to produce to node D (but
failing and retrying to one of the good nodes).

So, it seems the cluster's meta data is still thinking that node D is up
and serving the partitions that were only replicated on C and D.   However,
for partitions that were on A and D, or B and D, D is not shown as being in
the ISR.

Is this correct?  Should the cluster continue showing the last node to have
been alive for a partition as still in the ISR?

Jason

Reply via email to