ayushtkn commented on code in PR #7857: URL: https://github.com/apache/hadoop/pull/7857#discussion_r2260938054
########## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml: ########## @@ -4650,6 +4650,34 @@ <value></value> </property> + <property> + <description> + Sets the maximum duration for executions of the discovery binary defined in + yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables. If + the binary takes longer than this amount of time to run, then the process + is aborted. Discovery may be attempted again, depending on + yarn.nodemanager.resource-plugins.gpu.discovery-max-errors. + </description> + <name>yarn.nodemanager.resource-plugins.gpu.discovery-timeout</name> + <value>10000ms</value> Review Comment: any reason for not using 10s? ########## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/TestGpuDiscoverer.java: ########## @@ -297,6 +297,36 @@ public void testGetGpuDeviceInformationFaultyNvidiaSmiScriptConsecutiveRun() assertNotNull(discoverer.getGpusUsableByYarn()); } + @Test + public void testGetGpuDeviceInformationDisableMaxErrors() + throws YarnException, IOException { + Configuration conf = new Configuration(false); + // A negative value should disable max errors enforcement. + conf.setInt(YarnConfiguration.NM_GPU_DISCOVERY_MAX_ERRORS, -1); + + File fakeBinary = createFakeNvidiaSmiScriptAsRunnableFile( + this::createFaultyNvidiaSmiScript); + + GpuDiscoverer discoverer = creatediscovererWithGpuPathDefined(conf); + assertEquals(fakeBinary.getAbsolutePath(), + discoverer.getPathOfGpuBinary()); + assertNull(discoverer.getEnvironmentToRunCommand().get(PATH)); + + final String terminateMsg = "Failed to execute GPU device " + + "detection script (" + fakeBinary.getAbsolutePath() + ") for 10 times"; + final String msg = "Failed to execute GPU device detection script"; + + // The default max errors is 10. Verify that it keeps going for an 11th try. + for (int i = 0; i < 11; ++i) { Review Comment: I changed this 11 to 15 & still the test doesn't fail for me, can you check once? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org