Ethan Rose created HDDS-11136:
---------------------------------
Summary: Some containers affected by HDDS-8129 may still be in the
DELETING state incorrectly
Key: HDDS-11136
URL: https://issues.apache.org/jira/browse/HDDS-11136
Project: Apache Ozone
Issue Type: Bug
Components: Ozone Datanode, SCM
Reporter: Ethan Rose
Assignee: Siddhant Sangwan
The bug described in HDDS-8129 would cause containers to have block counts
lower than their correct value. In versions of the code before the issue was
fixed, this could cause the block count to reach zero too early, so SCM would
move the containers to DELETING state, issue delete commands to datanodes, and
move containers to DELETED when the replicas were gone. However, it's possible
that between when the datanodes sent a heartbeat with zero block counts and SCM
sent back delete commands, the block deleting service ran and made the
container's block count negative on the datanode. In this case, when the
datanode gets the delete command, it will reject it, even in the old version
before the fixes, because the counter is [not equal to
zero|https://github.com/apache/ozone/blob/08263b44ce1422711e1fa70797bf349e4bb3f56b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java#L1181]
(this code link is to a version before the deletion path was fixed).
These containers are stuck such that SCM's state is DELETING and it keeps
resending delete commands, but datanodes block the deletion and the container
may still have valid data. Containers that entered this state in old versions
have remained in this state indefinitely, even after the fixes. This is because
the delete commands are being sent based on SCM's DELETING state for the
container, not the status of its block content as reported by datanodes after
the fixes. The fixes prevent containers from moving from CLOSED to DELETING
incorrectly but do nothing for containers already in that state.
Since DELETING containers are not processed by the replication manager, we need
a way for SCM to move their state back to CLOSED if the datanode rejects the
deletion to fully recover from the effects of HDDS-8129.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]