[ https://issues.apache.org/jira/browse/KAFKA-12686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Konstantine Karantasis resolved KAFKA-12686. -------------------------------------------- Resolution: Fixed > Race condition in AlterIsr response handling > -------------------------------------------- > > Key: KAFKA-12686 > URL: https://issues.apache.org/jira/browse/KAFKA-12686 > Project: Kafka > Issue Type: Bug > Affects Versions: 2.7.0, 2.8.0 > Reporter: David Arthur > Assignee: David Arthur > Priority: Minor > Fix For: 3.0.0 > > > In Partition.scala, there is a race condition between the handling of an > AlterIsrResponse and a LeaderAndIsrRequest. This is a pretty rare scenario > and would involve the AlterIsrResponse being delayed for some time, but it is > possible. This was observed in a test environment when lots of ISR and > leadership changes were happening due to broker restarts. > When the leader handles the LeaderAndIsr, it calls Partition#makeLeader which > overrides the {{isrState}} variable and clears the pending ISR items via > {{AlterIsrManager#clearPending(TopicPartition)}}. > The bug is that AlterIsrManager does not check its inflight state before > clearing pending items. The way AlterIsrManager is designed, it retains > inflight items in the pending items collection until the response is > processed (to allow for retries). The result is that an inflight item is > inadvertently removed from this collection. > Since the inflight item is cleared from the collection, AlterIsrManager > allows for new AlterIsrItem-s to be enqueued for this partition even though > it has an inflight AlterIsrItem. By allowing an update to be enqueued, > Partition will transition its {{isrState}} to one of the inflight states > (PendingIsrExpand, PendingIsrShrink, etc). Once the inflight partition's > response is handled, it will fail to update the {{isrState}} due to detecting > changes since the request was sent (which is by design). However, after the > response callback is run, AlterIsrManager will clear the partitions that it > saw in the response from the unsent items collection. This includes the newly > added (and unsent) update. > The result is that Partition has a "inflight" isrState but AlterIsrManager > does not have an unsent item for this partition. This prevents any further > ISR updates on the partition until the next leader election (when > {{isrState}} is reset). > If this bug is encountered, the workaround is to force a leader election > which will reset the partition's state. -- This message was sent by Atlassian Jira (v8.3.4#803005)