Hi devs, I created a ticket FLINK-35571 - ProfilingServiceTest.testRollingDeletion intermittently fails due to improper test isolation[1] yesterday, and my colleagues and I were hoping to get some feedback on how best for us to fix it. We'd like to raise a PR to fix it, just want to be sure we understand what we're doing and aren't breaking anything in doing so.
To summarise the issue briefly: this test fails sometimes with too many files in the tempDir used by the ProfilingService instance. This happens because the instance retrieved by ProfilingServiceTest.setUp is sometimes an old instance with different configuration that was created as part of other tests outside this test class. There's more detail on the ticket, but that's the gist. My colleagues and I have been looking into this, and we've got a handful of ideas for things we could change here. In test code: 1. Add a call to close any existing instances of ProfilingService in ProfilingServiceTest.setUp before instantiating its own 2. Modify ProfilingServiceTest.testRollingDeletion so that it is able to handle being given a previously-used instance of ProfilingService and any files it might leave laying around 3. Modify ProfilingServiceTest.testProfilingConfigurationWorkingAsExpected (or add another test) so that it covers cases where there is an existing instance with different config 4. Add a close method to TestingMiniCluster which calls close on any existing ProfilingService (prevents tests using TaskExecutor via TestingMiniCluster from leaking unclosed instances of ProfilingService) In application code: 1. Add a check in ProfilingService.getInstance to ensure the instance returned is correctly configured 2. Make ProfilingService.close synchronize on ProfilingService.class to ensure it cannot race with getInstance 3. Modify ProfilingService so it is no longer a global singleton We aren't familiar enough with these components of Flink to know what the implications are for the issue or our suggested fixes. For example, we don't think that allowing multiple instances of ProfilingService to exist would cause issues with the application code (especially if AsyncProfiler was kept as a singleton), but we don't know for certain because we're not very familiar with how this all fits together. We would appreciate any input anyone has on this. [1] https://issues.apache.org/jira/browse/FLINK-35571 Thanks, Grace -- Grace Grimwood She/They Senior Software Engineer Red Hat <https://www.redhat.com/> <https://www.redhat.com/>