[jira] [Commented] (KAFKA-18084) Null and leaked AcquisitionLockTimerTask causes hanging AcknowledgeRequest and corrupted state of batch

Andrew Schofield (Jira) Mon, 25 Nov 2024 10:47:06 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900991#comment-17900991
 ]


Andrew Schofield commented on KAFKA-18084:
------------------------------------------

Hi [~chia7712], thanks for the review.

I think you are correct in your analysis that the execution of the logic when 
the persister operation completes does not hold the write lock as it performs 
some of the state transitions.

I wonder whether there's a similar kind of problem in 
`SharePartition#maybeInitialize` where part of the logic executes when the 
persister operation completes, and it doesn't obtain the lock to complete the 
initialization.

> Null and leaked AcquisitionLockTimerTask causes hanging AcknowledgeRequest 
> and corrupted state of batch
> -------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-18084
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18084
>             Project: Kafka
>          Issue Type: Sub-task
>            Reporter: Chia-Ping Tsai
>            Assignee: Chia-Ping Tsai
>            Priority: Blocker
>
> I noticed some critical issues in reading shared-related code
> 1)
> `SharePartition#rollbackOrProcessStateUpdates` does not hold the write lock 
> in updating state so it could result in race condition. noted that the 
> `DefaultStatePersister`  uses a internal thread [1] to complete those 
> callback 
> 2) 
> `SharePartition#acquire` does not honor the rollback state [2][3]. This 
> causes two issues.
> 2.1) leaked `acquisitionLockTimeoutTask - `SharePartition#acquire` create a 
> new `acquisitionLockTimeoutTask` for the "available" batch, however, the 
> available batch in transition already has a `acquisitionLockTimeoutTask`, so 
> the leaked `acquisitionLockTimeoutTask` will corrupt the state later ...
> 2.2) null `acquisitionLockTimeoutTask` in a "acquired" batch - this can be 
> reproduced by following order.
> - the batch is in transition - current state is `AVAILABLE` and rollback 
> state is `ACQUIRED`
> - `SharePartition#rollbackOrProcessStateUpdates` is processing RPC, so it 
> does not call `InFlightState#completeStateTransition`
> - `SharePartition#acquire` assumes the batch is available, so it changes the 
> state from `AVAILABLE` to `ACQUIRED` and create a new 
> `acquisitionLockTimeoutTask` (see 2.1)
> - `SharePartition#rollbackOrProcessStateUpdates` complete the RPC - it commit 
> the state and cancel the `acquisitionLockTimeoutTask` - that means the batch 
> is in `ACQUIRED` but it does not have `acquisitionLockTimeoutTask` 
> - the next AcknowledgeRequest tries to update the state to `ACKNOWLEDGED` but 
> it encounters NPE  `acquisitionLockTimeoutTask`[4] and then the request gets 
> hanging until timeout
>  
>  
> [0] 
> https://github.com/apache/kafka/blob/654ebe10f4a5c31e449b2a2ef6c284254ed7dceb/core/src/main/java/kafka/server/share/SharePartition.java#L1649
> [1] 
> https://github.com/apache/kafka/blob/trunk/share/src/main/java/org/apache/kafka/server/share/persister/PersisterStateManager.java#L80
> [2] 
> https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L665
> [3] 
> https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L646
> [4] 
> https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L1663



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-18084) Null and leaked AcquisitionLockTimerTask causes hanging AcknowledgeRequest and corrupted state of batch

Reply via email to