[ https://issues.apache.org/jira/browse/FLINK-22891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390241#comment-17390241 ]
Yangze Guo commented on FLINK-22891: ------------------------------------ After a deeper investigation, I think the root cause is that the return of {{ScheduledFuture#isDone}} can be a false negative. And thus we missing a schedule for the {{checkResourceRequirements}}. The core logic is located in FutureTask#run. {code:java} public void run() { if (state != NEW || !UNSAFE.compareAndSwapObject(this, runnerOffset, null, Thread.currentThread())) return; try { Callable<V> c = callable; if (c != null && state == NEW) { V result; boolean ran; try { result = c.call(); ran = true; } catch (Throwable ex) { result = null; ran = false; setException(ex); } if (ran) set(result); } } finally { // runner must be non-null until state is settled to // prevent concurrent calls to run() runner = null; // state must be re-read after nulling runner to prevent // leaked interrupts int s = state; if (s >= INTERRUPTING) handlePossibleCancellationInterrupt(s); } } {code} The {{ScheduledFuture#isDone}} will return true after the execution of {{set(result)}}. Howeveer, if we call the {{isDone}} between {{set(result)}} and {{result = c.call()}}, it can get an intermediate state and do not schedule another {{checkResourceRequirements}} as expected. One possible solution is to replace the {{ScheduledFuture}} with a {{CompletableFuture}} and complete it at the end of {{checkResourceRequirements}}. > FineGrainedSlotManagerDefaultResourceAllocationStrategyITCase fails on azure > ---------------------------------------------------------------------------- > > Key: FLINK-22891 > URL: https://issues.apache.org/jira/browse/FLINK-22891 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.14.0, 1.13.1 > Reporter: Dawid Wysakowicz > Assignee: Yangze Guo > Priority: Major > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=18700&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0&l=8660 > {code} > Jun 05 21:16:00 [ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, > Time elapsed: 6.24 s <<< FAILURE! - in > org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerDefaultResourceAllocationStrategyITCase > Jun 05 21:16:00 [ERROR] > testResourceCanBeAllocatedForDifferentJobWithDeclarationBeforeSlotFree(org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerDefaultResourceAllocationStrategyITCase) > Time elapsed: 5.015 s <<< ERROR! > Jun 05 21:16:00 java.util.concurrent.TimeoutException > Jun 05 21:16:00 at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) > Jun 05 21:16:00 at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) > Jun 05 21:16:00 at > org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTestBase.assertFutureCompleteAndReturn(FineGrainedSlotManagerTestBase.java:121) > Jun 05 21:16:00 at > org.apache.flink.runtime.resourcemanager.slotmanager.AbstractFineGrainedSlotManagerITCase$4.lambda$new$4(AbstractFineGrainedSlotManagerITCase.java:374) > Jun 05 21:16:00 at > org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTestBase$Context.runTest(FineGrainedSlotManagerTestBase.java:212) > Jun 05 21:16:00 at > org.apache.flink.runtime.resourcemanager.slotmanager.AbstractFineGrainedSlotManagerITCase$4.<init>(AbstractFineGrainedSlotManagerITCase.java:310) > Jun 05 21:16:00 at > org.apache.flink.runtime.resourcemanager.slotmanager.AbstractFineGrainedSlotManagerITCase.testResourceCanBeAllocatedForDifferentJobAfterFree(AbstractFineGrainedSlotManagerITCase.java:308) > Jun 05 21:16:00 at > org.apache.flink.runtime.resourcemanager.slotmanager.AbstractFineGrainedSlotManagerITCase.testResourceCanBeAllocatedForDifferentJobWithDeclarationBeforeSlotFree(AbstractFineGrainedSlotManagerITCase.java:262) > Jun 05 21:16:00 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > Jun 05 21:16:00 at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > Jun 05 21:16:00 at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > Jun 05 21:16:00 at java.lang.reflect.Method.invoke(Method.java:498) > Jun 05 21:16:00 at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > Jun 05 21:16:00 at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > Jun 05 21:16:00 at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > Jun 05 21:16:00 at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > Jun 05 21:16:00 at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > Jun 05 21:16:00 at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61) > Jun 05 21:16:00 at > org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > Jun 05 21:16:00 at > org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) > Jun 05 21:16:00 at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) > Jun 05 21:16:00 at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) > Jun 05 21:16:00 at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) > Jun 05 21:16:00 at > org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) > Jun 05 21:16:00 at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) > Jun 05 21:16:00 at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) > Jun 05 21:16:00 at > org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) > Jun 05 21:16:00 at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) > Jun 05 21:16:00 at > org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > Jun 05 21:16:00 at > org.junit.runners.ParentRunner.run(ParentRunner.java:413) > Jun 05 21:16:00 at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > Jun 05 21:16:00 at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > Jun 05 21:16:00 at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > Jun 05 21:16:00 at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > Jun 05 21:16:00 at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > Jun 05 21:16:00 at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > Jun 05 21:16:00 at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > Jun 05 21:16:00 at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > Jun 05 21:16:00 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)