[
https://issues.apache.org/jira/browse/KAFKA-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17901385#comment-17901385
]
Abhinav Dixit commented on KAFKA-18084:
---------------------------------------
[~chia7712], thanks for pointing out these issues.
Regarding issue 2
{code:java}
`SharePartition#acquire` does not honor the rollback state{code}
the change for considering rollbackState should also be made in
[https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L1228]
, right?
PS - I can pick up this ticket as well, once everyone agrees on the above
changes [~schofielaj] [~apoorvmittal10]
cc - [~frankvicky]
> Null and leaked AcquisitionLockTimerTask causes hanging AcknowledgeRequest
> and corrupted state of batch
> -------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-18084
> URL: https://issues.apache.org/jira/browse/KAFKA-18084
> Project: Kafka
> Issue Type: Sub-task
> Reporter: Chia-Ping Tsai
> Assignee: Chia-Ping Tsai
> Priority: Blocker
>
> I noticed some critical issues in reading shared-related code
> 1)
> `SharePartition#rollbackOrProcessStateUpdates` does not hold the write lock
> in updating state so it could result in race condition. noted that the
> `DefaultStatePersister` uses a internal thread [1] to complete those
> callback
> 2)
> `SharePartition#acquire` does not honor the rollback state [2][3]. This
> causes two issues.
> 2.1) leaked `acquisitionLockTimeoutTask - `SharePartition#acquire` create a
> new `acquisitionLockTimeoutTask` for the "available" batch, however, the
> available batch in transition already has a `acquisitionLockTimeoutTask`, so
> the leaked `acquisitionLockTimeoutTask` will corrupt the state later ...
> 2.2) null `acquisitionLockTimeoutTask` in a "acquired" batch - this can be
> reproduced by following order.
> - the batch is in transition - current state is `AVAILABLE` and rollback
> state is `ACQUIRED`
> - `SharePartition#rollbackOrProcessStateUpdates` is processing RPC, so it
> does not call `InFlightState#completeStateTransition`
> - `SharePartition#acquire` assumes the batch is available, so it changes the
> state from `AVAILABLE` to `ACQUIRED` and create a new
> `acquisitionLockTimeoutTask` (see 2.1)
> - `SharePartition#rollbackOrProcessStateUpdates` complete the RPC - it commit
> the state and cancel the `acquisitionLockTimeoutTask` - that means the batch
> is in `ACQUIRED` but it does not have `acquisitionLockTimeoutTask`
> - the next AcknowledgeRequest tries to update the state to `ACKNOWLEDGED` but
> it encounters NPE `acquisitionLockTimeoutTask`[4] and then the request gets
> hanging until timeout
>
>
> [0]
> https://github.com/apache/kafka/blob/654ebe10f4a5c31e449b2a2ef6c284254ed7dceb/core/src/main/java/kafka/server/share/SharePartition.java#L1649
> [1]
> https://github.com/apache/kafka/blob/trunk/share/src/main/java/org/apache/kafka/server/share/persister/PersisterStateManager.java#L80
> [2]
> https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L665
> [3]
> https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L646
> [4]
> https://github.com/apache/kafka/blob/trunk/core/src/main/java/kafka/server/share/SharePartition.java#L1663
--
This message was sent by Atlassian Jira
(v8.20.10#820010)