[ https://issues.apache.org/jira/browse/FLINK-27667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555532#comment-17555532 ]
Biao Geng edited comment on FLINK-27667 at 6/18/22 6:36 AM: ------------------------------------------------------------ hi [~ferenc-csaky] [~chesnay] , I did some investigation and hope it can help us to enhence the stability: In {{{}YARNHighAvailabilityITCase{}}}, we currently overwrite the {{teardown()}} method of the {{YarnTestBase}} (see [codes|https://github.com/apache/flink/blob/3ae4c6f5a48105d00807e8ce02e70d4c092cbf40/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNHighAvailabilityITCase.java#L126] here) and as a result, only {{{}YARNHighAvailabilityITCase{}}}'s {{teardown()}} will be executed after all HA tests finish. The above behavior may lead to potential race condition: {{YARNHighAvailabilityITCase}} relies on the {{@TempDir protected static File tmp}} defined in {{YarnTestBase}} as YARN's parent dir of staging dir of each YARN application to launch the mini YARN cluster using {{{}RawLocalFileSystem{}}}. According to JUnit5's [doc|https://junit.org/junit5/docs/5.4.1/api/org/junit/jupiter/api/io/TempDir.html], the TempDir will be deleted recursively when the test class has finished execution. But as the {{teardown()}} method of the base class is not executed, there is no guarantee when the mini YARN cluster will be cleaned up(e.g. deleting staging dir like {{{}/tmp/junit1681458499635252469/.flink/application_1652775626514_0003{}}}). As a result, when JUnit wants to delete TempDir and it happens to see the staging dir, it is possible that the staging dir is deleted by YARN's cleanup method before being deleted by JUnit, which incurs the exception of {{{}java.io.IOException: Failed to delete temp directory{}}}. I tried to add the call of the teardown method of the base class (see [codes|https://github.com/bgeng777/flink/commit/c4a4c8c8d4d1dafaa2875dd8d88133e38a78d438] here) and manually triggered the test for 5 times(test[1|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=119&view=results], [2|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=120&view=results], [3|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=121&view=results], [4|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=122&view=results], [5|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=123&view=results]) in my own azure pipeline. All of them passed the misc tests. One further question is that I am not so sure why this test keeps stable in JUnit 4. Maybe the @TempDir annotation behaves somehow differently. was (Author: bgeng777): hi [~ferenc-csaky] [~chesnay] , I did some investigation and hope it can help us to enhence the stability: In {{{}YARNHighAvailabilityITCase{}}}, we currently override the {{teardown()}} method of the {{YarnTestBase}} (see [codes|https://github.com/apache/flink/blob/3ae4c6f5a48105d00807e8ce02e70d4c092cbf40/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNHighAvailabilityITCase.java#L126] here) and as a result, only {{{}YARNHighAvailabilityITCase{}}}'s {{teardown()}} will be executed after all HA tests finish. The above behavior may lead to potential race condition: {{YARNHighAvailabilityITCase}} relies on the {{@TempDir protected static File tmp}} defined in {{YarnTestBase}} as YARN's parent dir of staging dir of each YARN application to launch the mini YARN cluster using {{{}RawLocalFileSystem{}}}. According to JUnit5's [doc|https://junit.org/junit5/docs/5.4.1/api/org/junit/jupiter/api/io/TempDir.html], the TempDir will be deleted recursively when the test class has finished execution. But as the {{teardown()}} method of the base class is not executed, there is no guarantee when the mini YARN cluster will be cleaned up(e.g. deleting staging dir like {{{}/tmp/junit1681458499635252469/.flink/application_1652775626514_0003{}}}). As a result, when JUnit wants to delete TempDir and it happens to see the staging dir, it is possible that the staging dir is deleted by YARN's cleanup method before being deleted by JUnit, which incurs the exception of {{{}java.io.IOException: Failed to delete temp directory{}}}. I tried to add the call of the teardown method of the base class (see [codes|https://github.com/bgeng777/flink/commit/c4a4c8c8d4d1dafaa2875dd8d88133e38a78d438] here) and manually triggered the test for 5 times(test[1|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=119&view=results], [2|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=120&view=results], [3|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=121&view=results], [4|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=122&view=results], [5|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=123&view=results]) in my own azure pipeline. All of them passed the misc tests. One further question is that I am not so sure why this test keeps stable in JUnit 4. Maybe the @TempDir annotation behaves somehow differently. > YARNHighAvailabilityITCase fails with "Failed to delete temp directory > /tmp/junit1681" > -------------------------------------------------------------------------------------- > > Key: FLINK-27667 > URL: https://issues.apache.org/jira/browse/FLINK-27667 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN > Affects Versions: 1.16.0 > Reporter: Martijn Visser > Assignee: Ferenc Csaky > Priority: Critical > Labels: pull-request-available, test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=35733&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=29208 > > {code:bash} > May 17 08:36:22 [INFO] Results: > May 17 08:36:22 [INFO] > May 17 08:36:22 [ERROR] Errors: > May 17 08:36:22 [ERROR] YARNHighAvailabilityITCase » IO Failed to delete temp > directory /tmp/junit1681... > May 17 08:36:22 [INFO] > May 17 08:36:22 [ERROR] Tests run: 28, Failures: 0, Errors: 1, Skipped: 0 > May 17 08:36:22 [INFO] > {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007)