Chia-Ping Tsai created KAFKA-18084: -------------------------------------- Summary: Null and leaked AcquisitionLockTimerTask causes hanging AcknowledgeRequest and corrupted state of batch Key: KAFKA-18084 URL: https://issues.apache.org/jira/browse/KAFKA-18084 Project: Kafka Issue Type: Sub-task Reporter: Chia-Ping Tsai Assignee: Chia-Ping Tsai
I noticed some critical issues in reading shared-related code 1) `SharePartition#rollbackOrProcessStateUpdates` does not hold the write lock in updating state so it could result in race condition. noted that the `DefaultStatePersister` uses a internal thread [1] to complete those callback 2) `SharePartition#acquire` does not honor the rollback state [2][3]. This causes two issues. 2.1) leaked `acquisitionLockTimeoutTask - `SharePartition#acquire` create a new `acquisitionLockTimeoutTask` for the "available" batch, however, the available batch in transition already has a `acquisitionLockTimeoutTask`, so the leaked `acquisitionLockTimeoutTask` will corrupt the state later ... 2.2) null `acquisitionLockTimeoutTask` in a "acquired" batch - this can be reproduced by following order. - the batch is in transition - current state is `AVAILABLE` and rollback state is `ACQUIRED` - `SharePartition#rollbackOrProcessStateUpdates` is processing RPC, so it does not call `InFlightState#completeStateTransition` - `SharePartition#acquire` assumes the batch is available, so it changes the state from `AVAILABLE` to `ACQUIRED` and create a new `acquisitionLockTimeoutTask` (see 2.1) - `SharePartition#rollbackOrProcessStateUpdates` complete the RPC - it commit the state and cancel the `acquisitionLockTimeoutTask` - that means the batch is in `ACQUIRED` but it does not have `acquisitionLockTimeoutTask` - the next AcknowledgeRequest tries to update the state to `ACKNOWLEDGED` but it encounters NPE `acquisitionLockTimeoutTask`[4] and then the request gets hanging until timeout [0] https://github.com/apache/kafka/blob/654ebe10f4a5c31e449b2a2ef6c284254ed7dceb/core/src/main/java/kafka/server/share/SharePartition.java#L1649 [1] https://github.com/apache/kafka/blob/trunk/share/src/main/java/org/apache/kafka/server/share/persister/PersisterStateManager.java#L80 [2] https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L665 [3] https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L646 [4] https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L1663 -- This message was sent by Atlassian Jira (v8.20.10#820010)