[ https://issues.apache.org/jira/browse/HDDS-12127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Attila Doroszlai resolved HDDS-12127. ------------------------------------- Fix Version/s: 2.0.0 Resolution: Fixed > RM should not expire pending deletes, but retry instead. > -------------------------------------------------------- > > Key: HDDS-12127 > URL: https://issues.apache.org/jira/browse/HDDS-12127 > Project: Apache Ozone > Issue Type: Bug > Components: SCM > Reporter: Stephen O'Donnell > Assignee: Stephen O'Donnell > Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > > When RM schedules a delete of a container on a datanode, it should keep track > of that delete until either: > 1. A ICR / FCR is received which confirms the container is removed. > 2. The datanode goes dead. > Currently, RM expires the delete attempt after 10 minutes and while it should > resend the command to the same datanode on retry it may not (eg HDDS-12115) > or in other scenarios that cause the datanode ordering to change. > With this change, the expiry still occurs and the command can get dropped on > the datanode, but in the ContainerReplicaPendingOps expiry thread, it no long > removes the pending delete from the pending list. Instead it will trigger a > notification to RM which will then reset the same command with a new deadline > until it has been confirmed as successful. RM will subscribe to the > notifications from ContainerReplicaPendingOps and re-run any expired delete > commands. > This is to combat a recent problem we experienced where delete command hung > for a very long time and RM issued new deletes to other DNs, resulting in all > replicas of a container getting removed unexpectedly. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org