chungen0126 opened a new pull request, #8643:
URL: https://github.com/apache/ozone/pull/8643

   ## What changes were proposed in this pull request?
   
    Two intermittent failures in TestDecommissionAndMaintenance
   ```
        at 
org.apache.ozone.test.GenericTestUtils.waitFor(GenericTestUtils.java:116)
        at 
org.apache.hadoop.ozone.MiniOzoneClusterImpl.waitForClusterToBeReady(MiniOzoneClusterImpl.java:166)
        at 
org.apache.hadoop.ozone.MiniOzoneClusterImpl.restartStorageContainerManager(MiniOzoneClusterImpl.java:291)
        at 
org.apache.hadoop.hdds.scm.node.TestDecommissionAndMaintenance.testDecommissioningNodesCompleteDecommissionOnSCMRestart(TestDecommissionAndMaintenance.java:279)
        at java.base/java.lang.reflect.Method.invoke(Method.java:580)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
   ```
   ```
   
org.apache.hadoop.hdds.scm.node.TestDecommissionAndMaintenance.testSingleNodeWithOpenPipelineCanGotoMaintenance
 -- Time elapsed: 18.06 s <<< ERROR!
   java.lang.NullPointerException: Cannot invoke 
"org.apache.hadoop.ozone.container.common.helpers.DatanodeIdYaml$DatanodeDetailsYaml.getUuid()"
 because "datanodeDetailsYaml" is null
        at 
org.apache.hadoop.ozone.container.common.helpers.DatanodeIdYaml.readDatanodeIdFile(DatanodeIdYaml.java:91)
        at 
org.apache.hadoop.ozone.container.common.helpers.ContainerUtils.readDatanodeDetailsFrom(ContainerUtils.java:177)
        at 
org.apache.hadoop.ozone.HddsDatanodeService.initializeDatanodeDetails(HddsDatanodeService.java:428)
        at 
org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:223)
        at 
org.apache.hadoop.ozone.MiniOzoneClusterImpl.startHddsDatanode(MiniOzoneClusterImpl.java:424)
        at 
org.apache.hadoop.ozone.MiniOzoneClusterImpl.restartHddsDatanode(MiniOzoneClusterImpl.java:362)
        at 
org.apache.hadoop.ozone.MiniOzoneClusterImpl.restartHddsDatanode(MiniOzoneClusterImpl.java:372)
        at 
org.apache.hadoop.hdds.scm.node.TestDecommissionAndMaintenance.testSingleNodeWithOpenPipelineCanGotoMaintenance(TestDecommissionAndMaintenance.java:462)
        at java.base/java.lang.reflect.Method.invoke(Method.java:580)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) 
   ```
   
   
   The root causes of the failures:
   1. The first failure occurs after 
[c7117dc](https://github.com/apache/ozone/commit/c7117dcc1731a5f8a82fc6f06b99bda9cd6e01c0),
 due to an accidental addition of `ecContainerDNsMap.clear()`.
   2. The second failure is that we currently update DatanodeDetails in memory 
before persisting it to disk. If the persist step fails, the in-memory state is 
already changed and SCM may receive this incorrect state via heartbeat.
   SCM might think a datanode is already in IN_MAINTENANCE because of the 
heartbeat, but the actual state wasn’t persisted. If the user then shutdown the 
datanode, thinking it's safe, the datanode may fail to restart because it's 
missing persisted datanode details.
   
   What does this pr change?
   
   1. Remove  `ecContainerDNsMap.clear()` in 
`ECContainerSafeModeRule#initializeRule`.
   2. This improves state consistency in the datanode.
   
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-12843
   
   ## How was this patch tested?
   CI:
   https://github.com/chungen0126/ozone/actions/runs/15698436934
   
   Passed 20x50 after change:
   https://github.com/chungen0126/ozone/actions/runs/15697148311
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to