[ https://issues.apache.org/jira/browse/FLINK-37701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17959556#comment-17959556 ]
Aleksandr Iushmanov commented on FLINK-37701: --------------------------------------------- I can see 2 problems breaking this test. 1. Execution graph for some reason goes through `Cancelling -> Cancelled` states before job resubmission (which doesn't match my expectations based on docs !screenshot-1.png! ). Going through `terminal` state it `nulls` checkpoint coordinator, hence StateSizeEstimate class can completely ignore last checkpoint. 2. Test job doesn't have any `keyedManagedState` and `StateSizeEstimate` scorer gives 0. This way we score matching key group allocations same 0 as non-matching, which leads to random slot allocation. I have raised this PR as a discussion starter. [~roman], please let me know what do you think? https://github.com/apache/flink/pull/26663 > The testRecoverLocallyFromProcessCrashWithWorkingDirectory test failed of > azure cron adaptive scheduler pipeline > ----------------------------------------------------------------------------------------------------------------- > > Key: FLINK-37701 > URL: https://issues.apache.org/jira/browse/FLINK-37701 > Project: Flink > Issue Type: Bug > Components: Build System / Azure Pipelines, Build System / CI > Affects Versions: 2.1.0 > Reporter: dalongliu > Assignee: Aleksandr Iushmanov > Priority: Major > Labels: pull-request-available > Fix For: 2.1.0 > > Attachments: screenshot-1.png > > > The detail: > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=67293&view=logs&j=8fd9202e-fd17-5b26-353c-ac1ff76c8f28&t=ea7cf968-e585-52cb-e0fc-f48de023a7ca > {code:java} > Apr 20 03:21:57 03:21:57.387 [ERROR] Tests run: 1, Failures: 1, Errors: 0, > Skipped: 0, Time elapsed: 17.77 s <<< FAILURE! -- in > org.apache.flink.test.recovery.LocalRecoveryITCase > Apr 20 03:21:57 03:21:57.387 [ERROR] > org.apache.flink.test.recovery.LocalRecoveryITCase.testRecoverLocallyFromProcessCrashWithWorkingDirectory > -- Time elapsed: 17.74 s <<< FAILURE! > Apr 20 03:21:57 org.opentest4j.AssertionFailedError: [The task was deployed > to AllocationID(bb6371bf3fe9fbcb2ee329893e802fde) but it should have been > deployed to AllocationID(5100f7baf1dea42453fd9b1c17d6d732) for local > recovery., The task was deployed to > AllocationID(e357fcd5041e52b7e647ca463cfe471a) but it should have been > deployed to AllocationID(bb6371bf3fe9fbcb2ee329893e802fde) for local > recovery., The task was deployed to > AllocationID(5100f7baf1dea42453fd9b1c17d6d732) but it should have been > deployed to AllocationID(e357fcd5041e52b7e647ca463cfe471a) for local > recovery.] ==> expected: <true> but was: <false> > Apr 20 03:21:57 at > org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > Apr 20 03:21:57 at > org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > Apr 20 03:21:57 at > org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > Apr 20 03:21:57 at > org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > Apr 20 03:21:57 at > org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214) > Apr 20 03:21:57 at > org.apache.flink.test.recovery.LocalRecoveryITCase.testRecoverLocallyFromProcessCrashWithWorkingDirectory(LocalRecoveryITCase.java:119) > Apr 20 03:21:57 at > java.base/java.lang.reflect.Method.invoke(Method.java:568) > Apr 20 03:21:57 at > java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) > Apr 20 03:21:57 at > java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) > Apr 20 03:21:57 at > java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) > Apr 20 03:21:57 at > java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) > Apr 20 03:21:57 at > java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)