sodonnel opened a new pull request, #7746: URL: https://github.com/apache/ozone/pull/7746
## What changes were proposed in this pull request? When RM schedules a delete of a container on a datanode, it should keep track of that delete until either: 1. A ICR / FCR is received which confirms the container is removed. 2. The datanode goes dead. Currently, RM expires the delete attempt after 10 minutes and while it should resend the command to the same datanode on retry it may not (eg [HDDS-12115](https://issues.apache.org/jira/browse/HDDS-12115)) or in other scenarios that cause the datanode ordering to change. With this change, the expiry still occurs and the command can get dropped on the datanode, but in the ContainerReplicaPendingOps expiry thread, it no long removes the pending delete from the pending list. Instead it will trigger a notification to RM which will then resend the same command with a new deadline until it has been confirmed as successful. RM will subscribe to the notifications from ContainerReplicaPendingOps and re-run any expired delete commands. This is to combat a recent problem we experienced where delete command hung for a very long time and RM issued new deletes to other DNs, resulting in all replicas of a container getting removed unexpectedly. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-12127 ## How was this patch tested? Various unit tests modified and added. Manually tested the deletes are resent in docker. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
