[ https://issues.apache.org/jira/browse/FLINK-37701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17959512#comment-17959512 ]
Aleksandr Iushmanov commented on FLINK-37701: --------------------------------------------- I was able to reproduce it in intelliJ when running with enabled adaptive scheduler. (VM options: `-Dflink.tests.enable-adaptive-scheduler=true`). >From what I can see, `StateLocalitySlotAssigner` is used, but it doesn't >receive correct state size estimates. I could trace it to >`StateSIzeEstimates#fromGraph` that attempts to retrieve last checkpoint data. >The problem is that `checkpointCoordinator` is already `null` at this point as >job graph reached terminal state (job was cancelled). I will look into other >ways to provide checkpoint data there > The testRecoverLocallyFromProcessCrashWithWorkingDirectory test failed of > azure cron adaptive scheduler pipeline > ----------------------------------------------------------------------------------------------------------------- > > Key: FLINK-37701 > URL: https://issues.apache.org/jira/browse/FLINK-37701 > Project: Flink > Issue Type: Bug > Components: Build System / Azure Pipelines, Build System / CI > Affects Versions: 2.1.0 > Reporter: dalongliu > Assignee: Aleksandr Iushmanov > Priority: Major > Fix For: 2.1.0 > > > The detail: > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=67293&view=logs&j=8fd9202e-fd17-5b26-353c-ac1ff76c8f28&t=ea7cf968-e585-52cb-e0fc-f48de023a7ca > {code:java} > Apr 20 03:21:57 03:21:57.387 [ERROR] Tests run: 1, Failures: 1, Errors: 0, > Skipped: 0, Time elapsed: 17.77 s <<< FAILURE! -- in > org.apache.flink.test.recovery.LocalRecoveryITCase > Apr 20 03:21:57 03:21:57.387 [ERROR] > org.apache.flink.test.recovery.LocalRecoveryITCase.testRecoverLocallyFromProcessCrashWithWorkingDirectory > -- Time elapsed: 17.74 s <<< FAILURE! > Apr 20 03:21:57 org.opentest4j.AssertionFailedError: [The task was deployed > to AllocationID(bb6371bf3fe9fbcb2ee329893e802fde) but it should have been > deployed to AllocationID(5100f7baf1dea42453fd9b1c17d6d732) for local > recovery., The task was deployed to > AllocationID(e357fcd5041e52b7e647ca463cfe471a) but it should have been > deployed to AllocationID(bb6371bf3fe9fbcb2ee329893e802fde) for local > recovery., The task was deployed to > AllocationID(5100f7baf1dea42453fd9b1c17d6d732) but it should have been > deployed to AllocationID(e357fcd5041e52b7e647ca463cfe471a) for local > recovery.] ==> expected: <true> but was: <false> > Apr 20 03:21:57 at > org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > Apr 20 03:21:57 at > org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > Apr 20 03:21:57 at > org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > Apr 20 03:21:57 at > org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > Apr 20 03:21:57 at > org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214) > Apr 20 03:21:57 at > org.apache.flink.test.recovery.LocalRecoveryITCase.testRecoverLocallyFromProcessCrashWithWorkingDirectory(LocalRecoveryITCase.java:119) > Apr 20 03:21:57 at > java.base/java.lang.reflect.Method.invoke(Method.java:568) > Apr 20 03:21:57 at > java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) > Apr 20 03:21:57 at > java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) > Apr 20 03:21:57 at > java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) > Apr 20 03:21:57 at > java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) > Apr 20 03:21:57 at > java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)