[ 
https://issues.apache.org/jira/browse/FLINK-37701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17959556#comment-17959556
 ] 

Aleksandr Iushmanov commented on FLINK-37701:
---------------------------------------------

I can see 2 problems breaking this test.
1. Execution graph for some reason goes through `Cancelling -> Cancelled` 
states before job resubmission (which doesn't match my expectations based on 
docs !screenshot-1.png! ). Going through `terminal` state it `nulls` checkpoint 
coordinator, hence StateSizeEstimate class can completely ignore last 
checkpoint. 
2. Test job doesn't have any `keyedManagedState` and `StateSizeEstimate` scorer 
gives 0. This way we score matching key group allocations same 0 as 
non-matching, which leads to random slot allocation.

I have raised this PR as a discussion starter. [~roman], please let me know 
what do you think?
https://github.com/apache/flink/pull/26663

> The  testRecoverLocallyFromProcessCrashWithWorkingDirectory test failed of 
> azure cron adaptive scheduler pipeline
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-37701
>                 URL: https://issues.apache.org/jira/browse/FLINK-37701
>             Project: Flink
>          Issue Type: Bug
>          Components: Build System / Azure Pipelines, Build System / CI
>    Affects Versions: 2.1.0
>            Reporter: dalongliu
>            Assignee: Aleksandr Iushmanov
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.1.0
>
>         Attachments: screenshot-1.png
>
>
> The detail: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=67293&view=logs&j=8fd9202e-fd17-5b26-353c-ac1ff76c8f28&t=ea7cf968-e585-52cb-e0fc-f48de023a7ca
> {code:java}
> Apr 20 03:21:57 03:21:57.387 [ERROR] Tests run: 1, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 17.77 s <<< FAILURE! -- in 
> org.apache.flink.test.recovery.LocalRecoveryITCase
> Apr 20 03:21:57 03:21:57.387 [ERROR] 
> org.apache.flink.test.recovery.LocalRecoveryITCase.testRecoverLocallyFromProcessCrashWithWorkingDirectory
>  -- Time elapsed: 17.74 s <<< FAILURE!
> Apr 20 03:21:57 org.opentest4j.AssertionFailedError: [The task was deployed 
> to AllocationID(bb6371bf3fe9fbcb2ee329893e802fde) but it should have been 
> deployed to AllocationID(5100f7baf1dea42453fd9b1c17d6d732) for local 
> recovery., The task was deployed to 
> AllocationID(e357fcd5041e52b7e647ca463cfe471a) but it should have been 
> deployed to AllocationID(bb6371bf3fe9fbcb2ee329893e802fde) for local 
> recovery., The task was deployed to 
> AllocationID(5100f7baf1dea42453fd9b1c17d6d732) but it should have been 
> deployed to AllocationID(e357fcd5041e52b7e647ca463cfe471a) for local 
> recovery.] ==> expected: <true> but was: <false>
> Apr 20 03:21:57       at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
> Apr 20 03:21:57       at 
> org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
> Apr 20 03:21:57       at 
> org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
> Apr 20 03:21:57       at 
> org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
> Apr 20 03:21:57       at 
> org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214)
> Apr 20 03:21:57       at 
> org.apache.flink.test.recovery.LocalRecoveryITCase.testRecoverLocallyFromProcessCrashWithWorkingDirectory(LocalRecoveryITCase.java:119)
> Apr 20 03:21:57       at 
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
> Apr 20 03:21:57       at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
> Apr 20 03:21:57       at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
> Apr 20 03:21:57       at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
> Apr 20 03:21:57       at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
> Apr 20 03:21:57       at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to