[
https://issues.apache.org/jira/browse/YARN-11934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Nauroth resolved YARN-11934.
----------------------------------
Fix Version/s: 3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.5.0 (was: 3.5.0, 3.5.1)
Resolution: Fixed
> Fix testComponentHealthThresholdMonitor race condition
> ------------------------------------------------------
>
> Key: YARN-11934
> URL: https://issues.apache.org/jira/browse/YARN-11934
> Project: Hadoop YARN
> Issue Type: Bug
> Components: yarn-native-services
> Affects Versions: 3.5.0
> Reporter: Shilun Fan
> Assignee: Shilun Fan
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.5.0
>
>
> *Problem*
> TestYarnNativeServices.testComponentHealthThresholdMonitor test fails
> intermittently with the following error:
> {code:java}
> [INFO] Running org.apache.hadoop.yarn.service.TestYarnNativeServices
> [ERROR] Tests run: 16, Failures: 1, Errors: 0, Skipped: 0, Time elapsed:
> 953.6 s <<< FAILURE! -- in
> org.apache.hadoop.yarn.service.TestYarnNativeServices
> [ERROR]
> org.apache.hadoop.yarn.service.TestYarnNativeServices.testComponentHealthThresholdMonitor
> -- Time elapsed: 72.65 s <<< FAILURE!
> org.opentest4j.AssertionFailedError: Service should not be in a stable state.
> It should throw a timeout exception.
> at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:38)
> at org.junit.jupiter.api.Assertions.fail(Assertions.java:138)
> at
> org.apache.hadoop.yarn.service.TestYarnNativeServices.testComponentHealthThresholdMonitor(TestYarnNativeServices.java:799)
> at java.base/java.lang.reflect.Method.invoke(Method.java:569)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1511){code}
>
> *Root Case*
> The test has a race condition after calling `flexByRestService()`. The test
> expects `waitForServiceToBeStable()` to timeout (because anti-affinity
> prevents the 4th container from being allocated), but instead it returns
> immediately.
>
> The issue occurs because:
> 1. `flexByRestService()` is called to change the number of containers
> 2. `waitForServiceToBeStable()` is called immediately after
> 3. If the flex operation hasn't taken effect yet, the service is still in the
> old STABLE state
> 4. `waitForServiceToBeStable()` returns immediately instead of waiting and
> timing out as expected
>
> *Solution*
> Introduce a new helper method `waitForServiceToLeaveStable()` that ensures
> the service state has transitioned away from STABLE before proceeding with
> subsequent assertions. This guarantees the flex operation has taken effect.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]