[ 
https://issues.apache.org/jira/browse/FLINK-22891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390241#comment-17390241
 ] 

Yangze Guo commented on FLINK-22891:
------------------------------------

After a deeper investigation, I think the root cause is that the return of 
{{ScheduledFuture#isDone}} can be a false negative. And thus we missing a 
schedule for the {{checkResourceRequirements}}.
The core logic is located in FutureTask#run.

{code:java}
public void run() {
    if (state != NEW ||
        !UNSAFE.compareAndSwapObject(this, runnerOffset,
                                        null, Thread.currentThread()))
        return;
    try {
        Callable<V> c = callable;
        if (c != null && state == NEW) {
            V result;
            boolean ran;
            try {
                result = c.call();
                ran = true;
            } catch (Throwable ex) {
                result = null;
                ran = false;
                setException(ex);
            }
            if (ran)
                set(result);
        }
    } finally {
        // runner must be non-null until state is settled to
        // prevent concurrent calls to run()
        runner = null;
        // state must be re-read after nulling runner to prevent
        // leaked interrupts
        int s = state;
        if (s >= INTERRUPTING)
            handlePossibleCancellationInterrupt(s);
    }
}
{code}

The {{ScheduledFuture#isDone}} will return true after the execution of 
{{set(result)}}. Howeveer, if we call the {{isDone}} between {{set(result)}} 
and {{result = c.call()}}, it can get an intermediate state and do not schedule 
another {{checkResourceRequirements}} as expected.

One possible solution is to replace the {{ScheduledFuture}} with a 
{{CompletableFuture}} and complete it at the end of 
{{checkResourceRequirements}}.

> FineGrainedSlotManagerDefaultResourceAllocationStrategyITCase fails on azure
> ----------------------------------------------------------------------------
>
>                 Key: FLINK-22891
>                 URL: https://issues.apache.org/jira/browse/FLINK-22891
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0, 1.13.1
>            Reporter: Dawid Wysakowicz
>            Assignee: Yangze Guo
>            Priority: Major
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=18700&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0&l=8660
> {code}
> Jun 05 21:16:00 [ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, 
> Time elapsed: 6.24 s <<< FAILURE! - in 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerDefaultResourceAllocationStrategyITCase
> Jun 05 21:16:00 [ERROR] 
> testResourceCanBeAllocatedForDifferentJobWithDeclarationBeforeSlotFree(org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerDefaultResourceAllocationStrategyITCase)
>   Time elapsed: 5.015 s  <<< ERROR!
> Jun 05 21:16:00 java.util.concurrent.TimeoutException
> Jun 05 21:16:00       at 
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
> Jun 05 21:16:00       at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> Jun 05 21:16:00       at 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTestBase.assertFutureCompleteAndReturn(FineGrainedSlotManagerTestBase.java:121)
> Jun 05 21:16:00       at 
> org.apache.flink.runtime.resourcemanager.slotmanager.AbstractFineGrainedSlotManagerITCase$4.lambda$new$4(AbstractFineGrainedSlotManagerITCase.java:374)
> Jun 05 21:16:00       at 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTestBase$Context.runTest(FineGrainedSlotManagerTestBase.java:212)
> Jun 05 21:16:00       at 
> org.apache.flink.runtime.resourcemanager.slotmanager.AbstractFineGrainedSlotManagerITCase$4.<init>(AbstractFineGrainedSlotManagerITCase.java:310)
> Jun 05 21:16:00       at 
> org.apache.flink.runtime.resourcemanager.slotmanager.AbstractFineGrainedSlotManagerITCase.testResourceCanBeAllocatedForDifferentJobAfterFree(AbstractFineGrainedSlotManagerITCase.java:308)
> Jun 05 21:16:00       at 
> org.apache.flink.runtime.resourcemanager.slotmanager.AbstractFineGrainedSlotManagerITCase.testResourceCanBeAllocatedForDifferentJobWithDeclarationBeforeSlotFree(AbstractFineGrainedSlotManagerITCase.java:262)
> Jun 05 21:16:00       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Jun 05 21:16:00       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Jun 05 21:16:00       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Jun 05 21:16:00       at java.lang.reflect.Method.invoke(Method.java:498)
> Jun 05 21:16:00       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> Jun 05 21:16:00       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> Jun 05 21:16:00       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> Jun 05 21:16:00       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> Jun 05 21:16:00       at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> Jun 05 21:16:00       at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
> Jun 05 21:16:00       at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> Jun 05 21:16:00       at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> Jun 05 21:16:00       at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> Jun 05 21:16:00       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> Jun 05 21:16:00       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> Jun 05 21:16:00       at 
> org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> Jun 05 21:16:00       at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> Jun 05 21:16:00       at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> Jun 05 21:16:00       at 
> org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> Jun 05 21:16:00       at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> Jun 05 21:16:00       at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> Jun 05 21:16:00       at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> Jun 05 21:16:00       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> Jun 05 21:16:00       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> Jun 05 21:16:00       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> Jun 05 21:16:00       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> Jun 05 21:16:00       at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> Jun 05 21:16:00       at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> Jun 05 21:16:00       at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> Jun 05 21:16:00       at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Jun 05 21:16:00 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to