Shilun Fan created YARN-11934:
---------------------------------

             Summary: Fix testComponentHealthThresholdMonitor race condition
                 Key: YARN-11934
                 URL: https://issues.apache.org/jira/browse/YARN-11934
             Project: Hadoop YARN
          Issue Type: Bug
          Components: yarn-native-services
    Affects Versions: 3.5.0
            Reporter: Shilun Fan
            Assignee: Shilun Fan


{*}Problem{*}{*}{*}
TestYarnNativeServices.testComponentHealthThresholdMonitor test fails 
intermittently with the following error:
{code:java}
[INFO] Running org.apache.hadoop.yarn.service.TestYarnNativeServices
[ERROR] Tests run: 16, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 953.6 
s <<< FAILURE! -- in org.apache.hadoop.yarn.service.TestYarnNativeServices
[ERROR] 
org.apache.hadoop.yarn.service.TestYarnNativeServices.testComponentHealthThresholdMonitor
 -- Time elapsed: 72.65 s <<< FAILURE!
org.opentest4j.AssertionFailedError: Service should not be in a stable state. 
It should throw a timeout exception.
        at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:38)
        at org.junit.jupiter.api.Assertions.fail(Assertions.java:138)
        at 
org.apache.hadoop.yarn.service.TestYarnNativeServices.testComponentHealthThresholdMonitor(TestYarnNativeServices.java:799)
        at java.base/java.lang.reflect.Method.invoke(Method.java:569)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1511){code}
**
 
{*}Root Case{*}{*}{*}{*}{*}
The test has a race condition after calling `flexByRestService()`. The test 
expects `waitForServiceToBeStable()` to timeout (because anti-affinity prevents 
the 4th container from being allocated), but instead it returns immediately.
 
The issue occurs because:
1. `flexByRestService()` is called to change the number of containers
2. `waitForServiceToBeStable()` is called immediately after
3. If the flex operation hasn't taken effect yet, the service is still in the 
old STABLE state
4. `waitForServiceToBeStable()` returns immediately instead of waiting and 
timing out as expected
 
*Solution*
Introduce a new helper method `waitForServiceToLeaveStable()` that ensures the 
service/component state has transitioned away from STABLE before proceeding 
with subsequent assertions. This guarantees the flex operation has taken effect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to