Chia-Ping Tsai created KAFKA-18084:
--------------------------------------

             Summary: Null and leaked AcquisitionLockTimerTask causes hanging 
AcknowledgeRequest and corrupted state of batch
                 Key: KAFKA-18084
                 URL: https://issues.apache.org/jira/browse/KAFKA-18084
             Project: Kafka
          Issue Type: Sub-task
            Reporter: Chia-Ping Tsai
            Assignee: Chia-Ping Tsai


I noticed some critical issues in reading shared-related code

1)
`SharePartition#rollbackOrProcessStateUpdates` does not hold the write lock in 
updating state so it could result in race condition. noted that the 
`DefaultStatePersister`  uses a internal thread [1] to complete those callback 

2) 

`SharePartition#acquire` does not honor the rollback state [2][3]. This causes 
two issues.

2.1) leaked `acquisitionLockTimeoutTask - `SharePartition#acquire` create a new 
`acquisitionLockTimeoutTask` for the "available" batch, however, the available 
batch in transition already has a `acquisitionLockTimeoutTask`, so the leaked 
`acquisitionLockTimeoutTask` will corrupt the state later ...

2.2) null `acquisitionLockTimeoutTask` in a "acquired" batch - this can be 
reproduced by following order.
- the batch is in transition - current state is `AVAILABLE` and rollback state 
is `ACQUIRED`
- `SharePartition#rollbackOrProcessStateUpdates` is processing RPC, so it does 
not call `InFlightState#completeStateTransition`
- `SharePartition#acquire` assumes the batch is available, so it changes the 
state from `AVAILABLE` to `ACQUIRED` and create a new 
`acquisitionLockTimeoutTask` (see 2.1)
- `SharePartition#rollbackOrProcessStateUpdates` complete the RPC - it commit 
the state and cancel the `acquisitionLockTimeoutTask` - that means the batch is 
in `ACQUIRED` but it does not have `acquisitionLockTimeoutTask` 
- the next AcknowledgeRequest tries to update the state to `ACKNOWLEDGED` but 
it encounters NPE  `acquisitionLockTimeoutTask`[4] and then the request gets 
hanging until timeout
 

 

[0] 
https://github.com/apache/kafka/blob/654ebe10f4a5c31e449b2a2ef6c284254ed7dceb/core/src/main/java/kafka/server/share/SharePartition.java#L1649

[1] 
https://github.com/apache/kafka/blob/trunk/share/src/main/java/org/apache/kafka/server/share/persister/PersisterStateManager.java#L80

[2] 
https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L665

[3] 
https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L646

[4] 
https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L1663



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to