Ethan Rose created HDDS-11136:
---------------------------------

             Summary: Some containers affected by HDDS-8129 may still be in the 
DELETING state incorrectly
                 Key: HDDS-11136
                 URL: https://issues.apache.org/jira/browse/HDDS-11136
             Project: Apache Ozone
          Issue Type: Bug
          Components: Ozone Datanode, SCM
            Reporter: Ethan Rose
            Assignee: Siddhant Sangwan


The bug described in HDDS-8129 would cause containers to have block counts 
lower than their correct value. In versions of the code before the issue was 
fixed, this could cause the block count to reach zero too early, so SCM would 
move the containers to DELETING state, issue delete commands to datanodes, and 
move containers to DELETED when the replicas were gone. However, it's possible 
that between when the datanodes sent a heartbeat with zero block counts and SCM 
sent back delete commands, the block deleting service ran and made the 
container's block count negative on the datanode. In this case, when the 
datanode gets the delete command, it will reject it, even in the old version 
before the fixes, because the counter is [not equal to 
zero|https://github.com/apache/ozone/blob/08263b44ce1422711e1fa70797bf349e4bb3f56b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java#L1181]
 (this code link is to a version before the deletion path was fixed).

These containers are stuck such that SCM's state is DELETING and it keeps 
resending delete commands, but datanodes block the deletion and the container 
may still have valid data. Containers that entered this state in old versions 
have remained in this state indefinitely, even after the fixes. This is because 
the delete commands are being sent based on SCM's DELETING state for the 
container, not the status of its block content as reported by datanodes after 
the fixes. The fixes prevent containers from moving from CLOSED to DELETING 
incorrectly but do nothing for containers already in that state.

Since DELETING containers are not processed by the replication manager, we need 
a way for SCM to move their state back to CLOSED if the datanode rejects the 
deletion to fully recover from the effects of HDDS-8129.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to