hudeqi created KAFKA-14824:
------------------------------
Summary: ReplicaAlterLogDirsThread may cause serious disk usage in
case of unknown exception
Key: KAFKA-14824
URL: https://issues.apache.org/jira/browse/KAFKA-14824
Project: Kafka
Issue Type: Bug
Components: core
Affects Versions: 3.3.2
Reporter: hudeqi
For ReplicaAlterLogDirsThread, if the partition is marked as failed due to an
unknown exception and the partition fetch is suspended, the paused cleanup
logic of the partition needs to be canceled, otherwise it will lead to serious
unexpected disk usage growth.
For example, in the actual production environment (the Kafka version used is
2.5.1), there is such a case: perform log dir balance on this partition leader
broker. After started fetching when the future log is successfully created,
then reset and truncate to the leader's log start offset for the first time due
to out of range. At the same time, because the partition leader is processing
the leaderAndIsrRequest, the leader epoch is updated, so the
ReplicaAlterLogDirsThread appears FENCED_LEADER_EPOCH, and the
'partitionStates' of the partition are cleaned up. At the same time, the logic
of add ReplicaAlterLogDirsThread for the partition is executing in the thread
that is processing leaderAndIsrRequest. In here, the offset set by
InitialFetchState is the hw of the leader. When ReplicaAlterLogDirsThread
performs the logic of processFetchRequest, it will throw
"java.lang.IllegalStateException : Offset mismatch for the future replica
anti_fraud.data_collector.anticrawler_live-54: fetched offset = 4979659327, log
end offset = 4918576434.", leading to such a result: ReplicaAlterLogDirsThread
no longer fetch the partition, due to the previous paused cleanup logic of the
partition, the disk usage of the corresponding broker increases infinitely,
causing serious problems.
But I found that trunk fixed this bug in KAFKA-9087, which may cause
ReplicaAlterLogDirsThread to appear “Offset mismatch error" causing to stop
fetch. But I don't know if there will be some other unknown exceptions, and at
the same time, due to the current logic, it will bring the same disk cleanup
failure problem?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)