Shilun Fan created YARN-11932:
---------------------------------

             Summary: Fix TestYarnFederationWithFairScheduler timeout caused by 
shared NodeLabel storage
                 Key: YARN-11932
                 URL: https://issues.apache.org/jira/browse/YARN-11932
             Project: Hadoop YARN
          Issue Type: Bug
          Components: router
    Affects Versions: 3.5.1
            Reporter: Shilun Fan
            Assignee: Shilun Fan


*Problem*
 
TestYarnFederationWithFairScheduler#testMetricsInfo intermittently times out 
during test execution.
 
The root cause is that multiple test subclusters share the same NodeLabel 
storage directory (\{{/tmp/hadoop-yarn-$USER/node-labels}}) by default. When 
tests run sequentially, residual editlog entries containing "delete default 
label" operations from previous tests cause the ResourceManager to fail during 
startup recovery with the error:
{code:java}
Node label=default to be removed doesn't existed in cluster node labels 
collection {code}
*Solution*
 
Set an isolated NodeLabel storage directory for each subcluster startup to 
avoid reusing old editlog files.
 
In \{{TestMockSubCluster.java}}, configure a unique directory per subcluster 
using:
* GenericTestUtils.getTestDir() to create test-specific directories
* Directory naming pattern: \{{node-labels-{subClusterId}-\{timestamp}}}
* Configuration key: \{{YarnConfiguration.FS_NODE_LABELS_STORE_ROOT_DIR}}
 
*Test Results*
 
After the fix, all 38 tests in TestYarnFederationWithFairScheduler pass 
successfully:
* Tests run: 38, Failures: 0, Errors: 0, Skipped: 0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to