Calvin Liu created KAFKA-20041:
----------------------------------
Summary: Stuck ISR expansion due to partition reassignment
completion race
Key: KAFKA-20041
URL: https://issues.apache.org/jira/browse/KAFKA-20041
Project: Kafka
Issue Type: Bug
Reporter: Calvin Liu
Assignee: Calvin Liu
An ISR expansion is stuck at the leader side from [0,1] -> [0,1,2]. This ISR
expansion can't complete because the replica set has been changed from [0,1,2]
-> [0,1,3]. This ISR expansion fails with INVALID_REQUEST for its
AlterPartition request but its
PendingExpandIsr stays which blocks future ISR expansion.
The main reason is a rare race between the ISR expansion and partition
reassignment.
{code:java}
private def maybeExpandIsr(followerReplica: Replica): Unit = {
val needsIsrUpdate = !partitionState.isInflight &&
canAddReplicaToIsr(followerReplica.brokerId) && inReadLock(leaderIsrUpdateLock)
{
needsExpandIsr(followerReplica)
}
if (needsIsrUpdate) {
val alterIsrUpdateOpt = inWriteLock(leaderIsrUpdateLock) {
// check if this replica needs to be added to the ISR
partitionState match {
case currentState: CommittedPartitionState if
needsExpandIsr(followerReplica) =>
Some(prepareIsrExpand(currentState, followerReplica.brokerId))
case _ =>
None
}
} {code}
The partition is expending its ISR, and it enters `maybeExpandIsr`. Before this
thread acquires the `leaderIsrUpdateLock`, the partition reassignment is
completed and the partition finished the update (it now has the latest
partition epochs). Then this thread enters the lock and prepares the ISR
expansion. Because the code trusts the caller, it does not verify whether the
ISR candidate replica is still in the partition replica set. Then the partition
creates an invalid ISR update (wrong replica) but with the valid epochs. At the
end, the partition receives INVALID_REQUEST error, but it does not clean the
PendingExpandIsr. This PendingExpandIsr prevents future ISR update.
The partition is unblocked after leader restart.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)