[ 
https://issues.apache.org/jira/browse/YARN-11934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved YARN-11934.
----------------------------------
       Fix Version/s: 3.5.0
        Hadoop Flags: Reviewed
    Target Version/s: 3.5.0  (was: 3.5.0, 3.5.1)
          Resolution: Fixed

> Fix testComponentHealthThresholdMonitor race condition
> ------------------------------------------------------
>
>                 Key: YARN-11934
>                 URL: https://issues.apache.org/jira/browse/YARN-11934
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn-native-services
>    Affects Versions: 3.5.0
>            Reporter: Shilun Fan
>            Assignee: Shilun Fan
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.5.0
>
>
> *Problem*
> TestYarnNativeServices.testComponentHealthThresholdMonitor test fails 
> intermittently with the following error:
> {code:java}
> [INFO] Running org.apache.hadoop.yarn.service.TestYarnNativeServices
> [ERROR] Tests run: 16, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 953.6 s <<< FAILURE! -- in 
> org.apache.hadoop.yarn.service.TestYarnNativeServices
> [ERROR] 
> org.apache.hadoop.yarn.service.TestYarnNativeServices.testComponentHealthThresholdMonitor
>  -- Time elapsed: 72.65 s <<< FAILURE!
> org.opentest4j.AssertionFailedError: Service should not be in a stable state. 
> It should throw a timeout exception.
>       at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:38)
>       at org.junit.jupiter.api.Assertions.fail(Assertions.java:138)
>       at 
> org.apache.hadoop.yarn.service.TestYarnNativeServices.testComponentHealthThresholdMonitor(TestYarnNativeServices.java:799)
>       at java.base/java.lang.reflect.Method.invoke(Method.java:569)
>       at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
>       at java.base/java.util.ArrayList.forEach(ArrayList.java:1511){code}
>  
> *Root Case*
> The test has a race condition after calling `flexByRestService()`. The test 
> expects `waitForServiceToBeStable()` to timeout (because anti-affinity 
> prevents the 4th container from being allocated), but instead it returns 
> immediately.
>  
> The issue occurs because:
> 1. `flexByRestService()` is called to change the number of containers
> 2. `waitForServiceToBeStable()` is called immediately after
> 3. If the flex operation hasn't taken effect yet, the service is still in the 
> old STABLE state
> 4. `waitForServiceToBeStable()` returns immediately instead of waiting and 
> timing out as expected
>  
> *Solution*
> Introduce a new helper method `waitForServiceToLeaveStable()` that ensures 
> the service state has transitioned away from STABLE before proceeding with 
> subsequent assertions. This guarantees the flex operation has taken effect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to