Grace Grimwood created FLINK-35571:
--------------------------------------

             Summary: ProfilingServiceTest.testRollingDeletion intermittently 
fails due to improper test isolation
                 Key: FLINK-35571
                 URL: https://issues.apache.org/jira/browse/FLINK-35571
             Project: Flink
          Issue Type: Bug
          Components: Tests
         Environment: *Git revision:*
{code:bash}
$ git show
commit b8d527166e095653ae3ff5c0431bf27297efe229 (HEAD -> master)
{code}

*Java info:*
{code:bash}
$ java -version
openjdk version "17.0.11" 2024-04-16
OpenJDK Runtime Environment Temurin-17.0.11+9 (build 17.0.11+9)
OpenJDK 64-Bit Server VM Temurin-17.0.11+9 (build 17.0.11+9, mixed mode)
{code}

{code:bash}
$ sdk current
Using:
java: 17.0.11-tem
maven: 3.8.6
scala: 2.12.19
{code}

*OS info:*
{code:bash}
$ uname -av
Darwin MacBook-Pro 23.5.0 Darwin Kernel Version 23.5.0: Wed May  1 20:14:38 PDT 
2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020 arm64
{code}

*Hardware info:*
{code:bash}
$ sysctl -a | grep -e 'machdep\.cpu\.brand_string\:' -e 
'machdep\.cpu\.core_count\:' -e 'hw\.memsize\:'
hw.memsize: 34359738368
machdep.cpu.core_count: 12
machdep.cpu.brand_string: Apple M2 Pro
{code}
            Reporter: Grace Grimwood
         Attachments: 
20240612_181148_mvn-clean-package_flink-runtime_also-make.log

*Symptom:*
The test *{{ProfilingServiceTest.testRollingDeletion}}* fails with the 
following error:
{code:java}
[ERROR] Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 25.32 s 
<<< FAILURE! -- in org.apache.flink.runtime.util.profiler.ProfilingServiceTest
[ERROR] 
org.apache.flink.runtime.util.profiler.ProfilingServiceTest.testRollingDeletion 
-- Time elapsed: 9.264 s <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <3> but was: <6>
        at 
org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
        at 
org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
        at 
org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
        at 
org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
        at 
org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
        at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:531)
        at 
org.apache.flink.runtime.util.profiler.ProfilingServiceTest.verifyRollingDeletionWorks(ProfilingServiceTest.java:175)
        at 
org.apache.flink.runtime.util.profiler.ProfilingServiceTest.testRollingDeletion(ProfilingServiceTest.java:117)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at 
java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
        at 
java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
        at 
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
        at 
java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
        at 
java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
        at 
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
{code}
The number of extra files found varies from failure to failure.

*Cause:*
Many of the tests in *{{ProfilingServiceTest}}* rely on a specific 
configuration of the *{{ProfilingService}}* instance, but 
*{{ProfilingService.getInstance}}* does not check whether an existing 
instance's config matches the provided config before returning it. Because of 
this, and because JUnit does not guarantee a specific ordering of tests (unless 
they are specifically annotated), it is possible for these tests to receive an 
instance that does not behave in the expected way and therefore fail.

*Analysis:*
In troubleshooting the test failure, we tried adding an extra assertion to 
*{{ProfilingServiceTest.setUp}}* to validate the directories being written to 
were correct:
{code:java}
Assertions.assertEquals(tempDir.toString(), 
profilingService.getProfilingResultDir());
{code}
That assert produced the following failure:
{code:java}
org.opentest4j.AssertionFailedError: expected: 
</var/folders/sh/5vx5kpkd5dn_pfdptn1s9rvc0000gn/T/junit9871405123519368112> but 
was: </var/folders/sh/5vx5kpkd5dn_pfdptn1s9rvc0000gn/T/>
{code}
This failure shows that the *{{ProfilingService}}* returned by 
*{{ProfilingService.getInstance}}* in the setup is not using the correct 
directory, and therefore cannot be the correct instance for this test class 
because it has the wrong config.

This is because the static method *{{ProfilingService.getInstance}}* attempts 
to reuse any existing instance of *{{ProfilingService}}* before it creates a 
new one and disregards any differences in config in doing so, which means that 
if another test instantiates a *{{ProfilingService}}* with different config 
first and does not close it, that previous instance will be provided to 
*{{ProfilingServiceTest}}* rather than the new instance those tests seem to 
expect. This only happens with the first test run in this class, as the 
teardown method run after every test explicitly closes the existing 
*{{ProfilingService}}* instance.

Specifically in the case of the test failures I have observed, it seems that if 
*{{ProfilingServiceTest.testRollingDeletion}}* is run _before_ any other 
*{{ProfilingServiceTest}}* tests but _after_ the test methods in 
*{{JobIntermediateDatasetReuseTest}}* (or any other tests that create a 
*{{TaskExecutor}}* via a {*}{{MiniCluster}}{*}), it will fail. From what I've 
been able to gather, *{{TaskExecutor}}* calls 
*{{ProfilingService.getInstance}}* with default config, and holds on to that 
instance internally but doesn't attempt to close that *{{ProfilingService}}* 
instance when the *{{TaskExecutor}}* instance is itself closed. This means that 
instance is sometimes still around when *{{ProfilingServiceTest.setUp}}* is 
run, so it gets passed to *{{ProfilingServiceTest.testRollingDeletion}}* at 
which point that test will fail as it incorrectly assumes that it has a new 
*{{ProfilingService}}* instance with a clean directory configured.

.

Logs are attached, produced with the following command:
{code:bash}
mvn clean package -Denforcer.skip -Dcheckstyle.skip -Drat.skip=true -pl 
:flink-runtime
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to