[jira] [Commented] (KAFKA-18084) Null and leaked AcquisitionLockTimerTask causes hanging AcknowledgeRequest and corrupted state of batch

Abhinav Dixit (Jira) Tue, 26 Nov 2024 22:56:05 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17901385#comment-17901385
 ]


Abhinav Dixit commented on KAFKA-18084:
---------------------------------------

[~chia7712], thanks for pointing out these issues.

Regarding issue 2
{code:java}
`SharePartition#acquire` does not honor the rollback state{code}
the change for considering rollbackState should also be made in 
[https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L1228]
 , right?

PS - I can pick up this ticket as well, once everyone agrees on the above 
changes [~schofielaj] [~apoorvmittal10] 
cc - [~frankvicky] 

> Null and leaked AcquisitionLockTimerTask causes hanging AcknowledgeRequest 
> and corrupted state of batch
> -------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-18084
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18084
>             Project: Kafka
>          Issue Type: Sub-task
>            Reporter: Chia-Ping Tsai
>            Assignee: Chia-Ping Tsai
>            Priority: Blocker
>
> I noticed some critical issues in reading shared-related code
> 1)
> `SharePartition#rollbackOrProcessStateUpdates` does not hold the write lock 
> in updating state so it could result in race condition. noted that the 
> `DefaultStatePersister`  uses a internal thread [1] to complete those 
> callback 
> 2) 
> `SharePartition#acquire` does not honor the rollback state [2][3]. This 
> causes two issues.
> 2.1) leaked `acquisitionLockTimeoutTask - `SharePartition#acquire` create a 
> new `acquisitionLockTimeoutTask` for the "available" batch, however, the 
> available batch in transition already has a `acquisitionLockTimeoutTask`, so 
> the leaked `acquisitionLockTimeoutTask` will corrupt the state later ...
> 2.2) null `acquisitionLockTimeoutTask` in a "acquired" batch - this can be 
> reproduced by following order.
> - the batch is in transition - current state is `AVAILABLE` and rollback 
> state is `ACQUIRED`
> - `SharePartition#rollbackOrProcessStateUpdates` is processing RPC, so it 
> does not call `InFlightState#completeStateTransition`
> - `SharePartition#acquire` assumes the batch is available, so it changes the 
> state from `AVAILABLE` to `ACQUIRED` and create a new 
> `acquisitionLockTimeoutTask` (see 2.1)
> - `SharePartition#rollbackOrProcessStateUpdates` complete the RPC - it commit 
> the state and cancel the `acquisitionLockTimeoutTask` - that means the batch 
> is in `ACQUIRED` but it does not have `acquisitionLockTimeoutTask` 
> - the next AcknowledgeRequest tries to update the state to `ACKNOWLEDGED` but 
> it encounters NPE  `acquisitionLockTimeoutTask`[4] and then the request gets 
> hanging until timeout
>  
>  
> [0] 
> https://github.com/apache/kafka/blob/654ebe10f4a5c31e449b2a2ef6c284254ed7dceb/core/src/main/java/kafka/server/share/SharePartition.java#L1649
> [1] 
> https://github.com/apache/kafka/blob/trunk/share/src/main/java/org/apache/kafka/server/share/persister/PersisterStateManager.java#L80
> [2] 
> https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L665
> [3] 
> https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L646
> [4] 
> https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L1663



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-18084) Null and leaked AcquisitionLockTimerTask causes hanging AcknowledgeRequest and corrupted state of batch

Reply via email to