[ https://issues.apache.org/jira/browse/KAFKA-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chia-Ping Tsai resolved KAFKA-18084. ------------------------------------ Fix Version/s: 4.0.0 Resolution: Fixed [~adixitconfluent] thanks for all your contribution!!! > Null and leaked AcquisitionLockTimerTask causes hanging AcknowledgeRequest > and corrupted state of batch > ------------------------------------------------------------------------------------------------------- > > Key: KAFKA-18084 > URL: https://issues.apache.org/jira/browse/KAFKA-18084 > Project: Kafka > Issue Type: Sub-task > Reporter: Chia-Ping Tsai > Assignee: Abhinav Dixit > Priority: Blocker > Fix For: 4.0.0 > > > I noticed some critical issues in reading shared-related code > 1) > `SharePartition#rollbackOrProcessStateUpdates` does not hold the write lock > in updating state so it could result in race condition. noted that the > `DefaultStatePersister` uses a internal thread [1] to complete those > callback > 2) > `SharePartition#acquire` does not honor the rollback state [2][3]. This > causes two issues. > 2.1) leaked `acquisitionLockTimeoutTask - `SharePartition#acquire` create a > new `acquisitionLockTimeoutTask` for the "available" batch, however, the > available batch in transition already has a `acquisitionLockTimeoutTask`, so > the leaked `acquisitionLockTimeoutTask` will corrupt the state later ... > 2.2) null `acquisitionLockTimeoutTask` in a "acquired" batch - this can be > reproduced by following order. > - the batch is in transition - current state is `AVAILABLE` and rollback > state is `ACQUIRED` > - `SharePartition#rollbackOrProcessStateUpdates` is processing RPC, so it > does not call `InFlightState#completeStateTransition` > - `SharePartition#acquire` assumes the batch is available, so it changes the > state from `AVAILABLE` to `ACQUIRED` and create a new > `acquisitionLockTimeoutTask` (see 2.1) > - `SharePartition#rollbackOrProcessStateUpdates` complete the RPC - it commit > the state and cancel the `acquisitionLockTimeoutTask` - that means the batch > is in `ACQUIRED` but it does not have `acquisitionLockTimeoutTask` > - the next AcknowledgeRequest tries to update the state to `ACKNOWLEDGED` but > it encounters NPE `acquisitionLockTimeoutTask`[4] and then the request gets > hanging until timeout > > > [0] > https://github.com/apache/kafka/blob/654ebe10f4a5c31e449b2a2ef6c284254ed7dceb/core/src/main/java/kafka/server/share/SharePartition.java#L1649 > [1] > https://github.com/apache/kafka/blob/trunk/share/src/main/java/org/apache/kafka/server/share/persister/PersisterStateManager.java#L80 > [2] > https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L665 > [3] > https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L646 > [4] > https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L1663 -- This message was sent by Atlassian Jira (v8.20.10#820010)