Hello, We experienced a network partition in our kafka cluster which left 1 broker unable to be reached by other brokers, but it could still reach our zookeeper cluster. When this occurred, a number of topic-partitions shrunk their ISR to just the impaired broker itself, halting progress on those partitions. As we had to take the broker instance offline and provision a replacement, the partitions were unavailable until the replacement instance came back up and resumed acting as the broker.
However, reviewing our broker and producer settings, I'm not sure why it's possible for the leader to have accepted some writes that were not able to be replicated to the followers. Our topics use min.insync.replicas=2 and our producers use acks=all configuration. In this scenario, with the changes not being replicated to other followers, I'd expect the records to have failed to be written. We are however on an older version of kafka - 2.6.1 - so I'm curious if maybe future versions have improved the behavior here? Some relevant logs: [Partition my-topic-one-120 broker=7] Shrinking ISR from 7,8,9 to 7. Leader: (highWatermark: 82153383556, endOffset: 82153383565). Out of sync replicas: (brokerId: 8, endOffset: 82153383556) (brokerId: 9, endOffset: 82153383561). [Partition my-topic-one-120 broker=7] ISR updated to [7] and zkVersion updated to [1367] [ReplicaFetcher replicaId=9, leaderId=7, fetcherId=5] Error in response for fetch request (type=FetchRequest, replicaId=9, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={my-topic-one-120=(fetchOffset=75987953095, logStartOffset=75983970457, maxBytes=1048576, currentLeaderEpoch=Optional[772]), my-topic-one-84=(fetchOffset=87734453342, logStartOffset=87730882175, maxBytes=1048576, currentLeaderEpoch=Optional[776]), my-topic-one-108=(fetchOffset=72037212609, logStartOffset=72034727231, maxBytes=1048576, currentLeaderEpoch=Optional[776]), my-topic-one-72=(fetchOffset=83006080094, logStartOffset=83002240584, maxBytes=1048576, currentLeaderEpoch=Optional[768]), my-topic-one-96=(fetchOffset=79250375295, logStartOffset=79246320254, maxBytes=1048576, currentLeaderEpoch=Optional[763])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=965270777, epoch=725379656), rackId=) [Controller id=13 epoch=611] Controller 13 epoch 611 failed to change state for partition my-topic-one-120 from OnlinePartition to OnlinePartition kafka.common.StateChangeFailedException: Failed to elect leader for partition my-topic-one-120 under strategy ControlledShutdownPartitionLeaderElectionStrategy (later) kafka.common.StateChangeFailedException: Failed to elect leader for partition my-topic-one-120 under strategy OfflinePartitionLeaderElectionStrategy(false) Configuration for this topic: Topic: my-topic-one PartitionCount: 250 ReplicationFactor: 3 Configs: min.insync.replicas=2,segment.bytes=536870912,retention.ms =1800000,unclean.leader.election.enable=false Outside of this topic, we also had a topic with a replication factor of 5 impacted, and also the __consumer_offsets topic which we set to an RF of 5. [Partition my-topic-two-204 broker=7] Shrinking ISR from 10,9,7,11,8 to 7. Leader: (highWatermark: 86218167, endOffset: 86218170). Out of sync replicas: (brokerId: 10, endOffset: 86218167) (brokerId: 9, endOffset: 86218167) (brokerId: 11, endOffset: 86218167) (brokerId: 8, endOffset: 86218167). Configuration: Topic: my-topic-two PartitionCount: 500 ReplicationFactor: 5 Configs: min.insync.replicas=2,segment.jitter.ms =3600000,cleanup.policy=compact,segment.bytes=1048576,max.compaction.lag.ms =9000000,min.compaction.lag.ms=4500000,unclean.leader.election.enable=false, delete.retention.ms=86400000,segment.ms=21600000 [Partition __consumer_offsets-18 broker=7] Shrinking ISR from 10,9,7,11,8 to 7. Leader: (highWatermark: 4387657484, endOffset: 4387657485). Out of sync replicas: (brokerId: 9, endOffset: 4387657484) (brokerId: 8, endOffset: 4387657484) (brokerId: 10, endOffset: 4387657484) (brokerId: 11, endOffset: 4387657484). Configuration: Topic: __consumer_offsets PartitionCount: 50 ReplicationFactor: 5 Configs: compression.type=producer,min.insync.replicas=2,cleanup.policy=compact,segment.bytes=104857600,unclean.leader.election.enable=false Other configurations: zookeeper.connection.timeout.ms=6000 replica.lag.time.max.ms=8000 zookeeper.session.timeout.ms=6000 Producer request.timeout.ms=8500 Producer linger.ms=10 Producer delivery.timeout.ms=38510 I saw a similar issue described in KAFKA-8702 <https://issues.apache.org/jira/browse/KAFKA-8702> however I did not see a resolution there. Any help with this would be appreciated, thank you!