chungen0126 opened a new pull request, #8643:
URL: https://github.com/apache/ozone/pull/8643
## What changes were proposed in this pull request?
Two intermittent failures in TestDecommissionAndMaintenance
```
at
org.apache.ozone.test.GenericTestUtils.waitFor(GenericTestUtils.java:116)
at
org.apache.hadoop.ozone.MiniOzoneClusterImpl.waitForClusterToBeReady(MiniOzoneClusterImpl.java:166)
at
org.apache.hadoop.ozone.MiniOzoneClusterImpl.restartStorageContainerManager(MiniOzoneClusterImpl.java:291)
at
org.apache.hadoop.hdds.scm.node.TestDecommissionAndMaintenance.testDecommissioningNodesCompleteDecommissionOnSCMRestart(TestDecommissionAndMaintenance.java:279)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
```
```
org.apache.hadoop.hdds.scm.node.TestDecommissionAndMaintenance.testSingleNodeWithOpenPipelineCanGotoMaintenance
-- Time elapsed: 18.06 s <<< ERROR!
java.lang.NullPointerException: Cannot invoke
"org.apache.hadoop.ozone.container.common.helpers.DatanodeIdYaml$DatanodeDetailsYaml.getUuid()"
because "datanodeDetailsYaml" is null
at
org.apache.hadoop.ozone.container.common.helpers.DatanodeIdYaml.readDatanodeIdFile(DatanodeIdYaml.java:91)
at
org.apache.hadoop.ozone.container.common.helpers.ContainerUtils.readDatanodeDetailsFrom(ContainerUtils.java:177)
at
org.apache.hadoop.ozone.HddsDatanodeService.initializeDatanodeDetails(HddsDatanodeService.java:428)
at
org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:223)
at
org.apache.hadoop.ozone.MiniOzoneClusterImpl.startHddsDatanode(MiniOzoneClusterImpl.java:424)
at
org.apache.hadoop.ozone.MiniOzoneClusterImpl.restartHddsDatanode(MiniOzoneClusterImpl.java:362)
at
org.apache.hadoop.ozone.MiniOzoneClusterImpl.restartHddsDatanode(MiniOzoneClusterImpl.java:372)
at
org.apache.hadoop.hdds.scm.node.TestDecommissionAndMaintenance.testSingleNodeWithOpenPipelineCanGotoMaintenance(TestDecommissionAndMaintenance.java:462)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
```
The root causes of the failures:
1. The first failure occurs after
[c7117dc](https://github.com/apache/ozone/commit/c7117dcc1731a5f8a82fc6f06b99bda9cd6e01c0),
due to an accidental addition of `ecContainerDNsMap.clear()`.
2. The second failure is that we currently update DatanodeDetails in memory
before persisting it to disk. If the persist step fails, the in-memory state is
already changed and SCM may receive this incorrect state via heartbeat.
SCM might think a datanode is already in IN_MAINTENANCE because of the
heartbeat, but the actual state wasn’t persisted. If the user then shutdown the
datanode, thinking it's safe, the datanode may fail to restart because it's
missing persisted datanode details.
What does this pr change?
1. Remove `ecContainerDNsMap.clear()` in
`ECContainerSafeModeRule#initializeRule`.
2. This improves state consistency in the datanode.
## What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-12843
## How was this patch tested?
CI:
https://github.com/chungen0126/ozone/actions/runs/15698436934
Passed 20x50 after change:
https://github.com/chungen0126/ozone/actions/runs/15697148311
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]