[ 
https://issues.apache.org/jira/browse/KAFKA-12686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantine Karantasis resolved KAFKA-12686.
--------------------------------------------
    Resolution: Fixed

> Race condition in AlterIsr response handling
> --------------------------------------------
>
>                 Key: KAFKA-12686
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12686
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.7.0, 2.8.0
>            Reporter: David Arthur
>            Assignee: David Arthur
>            Priority: Minor
>             Fix For: 3.0.0
>
>
> In Partition.scala, there is a race condition between the handling of an 
> AlterIsrResponse and a LeaderAndIsrRequest. This is a pretty rare scenario 
> and would involve the AlterIsrResponse being delayed for some time, but it is 
> possible. This was observed in a test environment when lots of ISR and 
> leadership changes were happening due to broker restarts.
> When the leader handles the LeaderAndIsr, it calls Partition#makeLeader which 
> overrides the {{isrState}} variable and clears the pending ISR items via 
> {{AlterIsrManager#clearPending(TopicPartition)}}. 
> The bug is that AlterIsrManager does not check its inflight state before 
> clearing pending items. The way AlterIsrManager is designed, it retains 
> inflight items in the pending items collection until the response is 
> processed (to allow for retries). The result is that an inflight item is 
> inadvertently removed from this collection.
> Since the inflight item is cleared from the collection, AlterIsrManager 
> allows for new AlterIsrItem-s to be enqueued for this partition even though 
> it has an inflight AlterIsrItem. By allowing an update to be enqueued, 
> Partition will transition its {{isrState}} to one of the inflight states 
> (PendingIsrExpand, PendingIsrShrink, etc). Once the inflight partition's 
> response is handled, it will fail to update the {{isrState}} due to detecting 
> changes since the request was sent (which is by design). However, after the 
> response callback is run, AlterIsrManager will clear the partitions that it 
> saw in the response from the unsent items collection. This includes the newly 
> added (and unsent) update.
> The result is that Partition has a "inflight" isrState but AlterIsrManager 
> does not have an unsent item for this partition. This prevents any further 
> ISR updates on the partition until the next leader election (when 
> {{isrState}} is reset).
> If this bug is encountered, the workaround is to force a leader election 
> which will reset the partition's state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to