Grace Grimwood created FLINK-35571: -------------------------------------- Summary: ProfilingServiceTest.testRollingDeletion intermittently fails due to improper test isolation Key: FLINK-35571 URL: https://issues.apache.org/jira/browse/FLINK-35571 Project: Flink Issue Type: Bug Components: Tests Environment: *Git revision:* {code:bash} $ git show commit b8d527166e095653ae3ff5c0431bf27297efe229 (HEAD -> master) {code}
*Java info:* {code:bash} $ java -version openjdk version "17.0.11" 2024-04-16 OpenJDK Runtime Environment Temurin-17.0.11+9 (build 17.0.11+9) OpenJDK 64-Bit Server VM Temurin-17.0.11+9 (build 17.0.11+9, mixed mode) {code} {code:bash} $ sdk current Using: java: 17.0.11-tem maven: 3.8.6 scala: 2.12.19 {code} *OS info:* {code:bash} $ uname -av Darwin MacBook-Pro 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:14:38 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020 arm64 {code} *Hardware info:* {code:bash} $ sysctl -a | grep -e 'machdep\.cpu\.brand_string\:' -e 'machdep\.cpu\.core_count\:' -e 'hw\.memsize\:' hw.memsize: 34359738368 machdep.cpu.core_count: 12 machdep.cpu.brand_string: Apple M2 Pro {code} Reporter: Grace Grimwood Attachments: 20240612_181148_mvn-clean-package_flink-runtime_also-make.log *Symptom:* The test *{{ProfilingServiceTest.testRollingDeletion}}* fails with the following error: {code:java} [ERROR] Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 25.32 s <<< FAILURE! -- in org.apache.flink.runtime.util.profiler.ProfilingServiceTest [ERROR] org.apache.flink.runtime.util.profiler.ProfilingServiceTest.testRollingDeletion -- Time elapsed: 9.264 s <<< FAILURE! org.opentest4j.AssertionFailedError: expected: <3> but was: <6> at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) at org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197) at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150) at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145) at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:531) at org.apache.flink.runtime.util.profiler.ProfilingServiceTest.verifyRollingDeletionWorks(ProfilingServiceTest.java:175) at org.apache.flink.runtime.util.profiler.ProfilingServiceTest.testRollingDeletion(ProfilingServiceTest.java:117) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194) at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) {code} The number of extra files found varies from failure to failure. *Cause:* Many of the tests in *{{ProfilingServiceTest}}* rely on a specific configuration of the *{{ProfilingService}}* instance, but *{{ProfilingService.getInstance}}* does not check whether an existing instance's config matches the provided config before returning it. Because of this, and because JUnit does not guarantee a specific ordering of tests (unless they are specifically annotated), it is possible for these tests to receive an instance that does not behave in the expected way and therefore fail. *Analysis:* In troubleshooting the test failure, we tried adding an extra assertion to *{{ProfilingServiceTest.setUp}}* to validate the directories being written to were correct: {code:java} Assertions.assertEquals(tempDir.toString(), profilingService.getProfilingResultDir()); {code} That assert produced the following failure: {code:java} org.opentest4j.AssertionFailedError: expected: </var/folders/sh/5vx5kpkd5dn_pfdptn1s9rvc0000gn/T/junit9871405123519368112> but was: </var/folders/sh/5vx5kpkd5dn_pfdptn1s9rvc0000gn/T/> {code} This failure shows that the *{{ProfilingService}}* returned by *{{ProfilingService.getInstance}}* in the setup is not using the correct directory, and therefore cannot be the correct instance for this test class because it has the wrong config. This is because the static method *{{ProfilingService.getInstance}}* attempts to reuse any existing instance of *{{ProfilingService}}* before it creates a new one and disregards any differences in config in doing so, which means that if another test instantiates a *{{ProfilingService}}* with different config first and does not close it, that previous instance will be provided to *{{ProfilingServiceTest}}* rather than the new instance those tests seem to expect. This only happens with the first test run in this class, as the teardown method run after every test explicitly closes the existing *{{ProfilingService}}* instance. Specifically in the case of the test failures I have observed, it seems that if *{{ProfilingServiceTest.testRollingDeletion}}* is run _before_ any other *{{ProfilingServiceTest}}* tests but _after_ the test methods in *{{JobIntermediateDatasetReuseTest}}* (or any other tests that create a *{{TaskExecutor}}* via a {*}{{MiniCluster}}{*}), it will fail. From what I've been able to gather, *{{TaskExecutor}}* calls *{{ProfilingService.getInstance}}* with default config, and holds on to that instance internally but doesn't attempt to close that *{{ProfilingService}}* instance when the *{{TaskExecutor}}* instance is itself closed. This means that instance is sometimes still around when *{{ProfilingServiceTest.setUp}}* is run, so it gets passed to *{{ProfilingServiceTest.testRollingDeletion}}* at which point that test will fail as it incorrectly assumes that it has a new *{{ProfilingService}}* instance with a clean directory configured. . Logs are attached, produced with the following command: {code:bash} mvn clean package -Denforcer.skip -Dcheckstyle.skip -Drat.skip=true -pl :flink-runtime {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)