Hi devs,

I created a ticket FLINK-35571 - ProfilingServiceTest.testRollingDeletion
intermittently fails due to improper test isolation[1] yesterday, and my
colleagues and I were hoping to get some feedback on how best for us to fix
it. We'd like to raise a PR to fix it, just want to be sure we understand
what we're doing and aren't breaking anything in doing so.

To summarise the issue briefly: this test fails sometimes with too many
files in the tempDir used by the ProfilingService instance. This happens
because the instance retrieved by ProfilingServiceTest.setUp is sometimes
an old instance with different configuration that was created as part of
other tests outside this test class. There's more detail on the ticket, but
that's the gist.

My colleagues and I have been looking into this, and we've got a handful of
ideas for things we could change here.
In test code:
  1. Add a call to close any existing instances of ProfilingService in
ProfilingServiceTest.setUp before instantiating its own
  2. Modify ProfilingServiceTest.testRollingDeletion so that it is able to
handle being given a previously-used instance of ProfilingService and any
files it might leave laying around
  3. Modify
ProfilingServiceTest.testProfilingConfigurationWorkingAsExpected (or add
another test) so that it covers cases where there is an existing instance
with different config
  4. Add a close method to TestingMiniCluster which calls close on any
existing ProfilingService (prevents tests using TaskExecutor via
TestingMiniCluster from leaking unclosed instances of ProfilingService)

In application code:
  1. Add a check in ProfilingService.getInstance to ensure the instance
returned is correctly configured
  2. Make ProfilingService.close synchronize on ProfilingService.class to
ensure it cannot race with getInstance
  3. Modify ProfilingService so it is no longer a global singleton

We aren't familiar enough with these components of Flink to know what the
implications are for the issue or our suggested fixes. For example, we
don't think that allowing multiple instances of ProfilingService to exist
would cause issues with the application code (especially if AsyncProfiler
was kept as a singleton), but we don't know for certain because we're not
very familiar with how this all fits together. We would appreciate any
input anyone has on this.

[1] https://issues.apache.org/jira/browse/FLINK-35571

Thanks,
Grace

-- 

Grace Grimwood

She/They

Senior Software Engineer

Red Hat <https://www.redhat.com/>

<https://www.redhat.com/>

Reply via email to