Hi, We have a topic with min.insync.replicas = 2 where each partition is replicated to 3 nodes. We write to it using acks=all.
We experienced a network malfunction, where leader node 1 could not reach replica 2 and 3, and vice versa. Nodes 2 and 3 could reach each other. The controller broker could reach all nodes, and external services could reach all nodes. What we saw was the ISR degrade to only node 1. Looking at the code, I see the ISR shrink when a replica has not caught up to the leader's LEO and it hasn't fetched for a while. My guess is the leader had messages that weren't yet replicated by the other nodes. Shouldn't min.insync.replicas = 2 and acks=all prevent the ISR shrinking to this size, since new writes should not be accepted unless they are replicated by at least 2 nodes?