[jira] [Commented] (FLINK-37701) The testRecoverLocallyFromProcessCrashWithWorkingDirectory test failed of azure cron adaptive scheduler pipeline

Aleksandr Iushmanov (Jira) Tue, 10 Jun 2025 08:49:13 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-37701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17959512#comment-17959512
 ]


Aleksandr Iushmanov commented on FLINK-37701:
---------------------------------------------

I was able to reproduce it in intelliJ when running with enabled adaptive 
scheduler. (VM options: `-Dflink.tests.enable-adaptive-scheduler=true`). 

>From what I can see, `StateLocalitySlotAssigner` is used, but it doesn't 
>receive correct state size estimates. I could trace it to 
>`StateSIzeEstimates#fromGraph` that attempts to retrieve last checkpoint data. 
>The problem is that `checkpointCoordinator` is already `null` at this point as 
>job graph reached terminal state (job was cancelled). I will look into other 
>ways to provide checkpoint data there

> The  testRecoverLocallyFromProcessCrashWithWorkingDirectory test failed of 
> azure cron adaptive scheduler pipeline
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-37701
>                 URL: https://issues.apache.org/jira/browse/FLINK-37701
>             Project: Flink
>          Issue Type: Bug
>          Components: Build System / Azure Pipelines, Build System / CI
>    Affects Versions: 2.1.0
>            Reporter: dalongliu
>            Assignee: Aleksandr Iushmanov
>            Priority: Major
>             Fix For: 2.1.0
>
>
> The detail: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=67293&view=logs&j=8fd9202e-fd17-5b26-353c-ac1ff76c8f28&t=ea7cf968-e585-52cb-e0fc-f48de023a7ca
> {code:java}
> Apr 20 03:21:57 03:21:57.387 [ERROR] Tests run: 1, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 17.77 s <<< FAILURE! -- in 
> org.apache.flink.test.recovery.LocalRecoveryITCase
> Apr 20 03:21:57 03:21:57.387 [ERROR] 
> org.apache.flink.test.recovery.LocalRecoveryITCase.testRecoverLocallyFromProcessCrashWithWorkingDirectory
>  -- Time elapsed: 17.74 s <<< FAILURE!
> Apr 20 03:21:57 org.opentest4j.AssertionFailedError: [The task was deployed 
> to AllocationID(bb6371bf3fe9fbcb2ee329893e802fde) but it should have been 
> deployed to AllocationID(5100f7baf1dea42453fd9b1c17d6d732) for local 
> recovery., The task was deployed to 
> AllocationID(e357fcd5041e52b7e647ca463cfe471a) but it should have been 
> deployed to AllocationID(bb6371bf3fe9fbcb2ee329893e802fde) for local 
> recovery., The task was deployed to 
> AllocationID(5100f7baf1dea42453fd9b1c17d6d732) but it should have been 
> deployed to AllocationID(e357fcd5041e52b7e647ca463cfe471a) for local 
> recovery.] ==> expected: <true> but was: <false>
> Apr 20 03:21:57       at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
> Apr 20 03:21:57       at 
> org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
> Apr 20 03:21:57       at 
> org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
> Apr 20 03:21:57       at 
> org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
> Apr 20 03:21:57       at 
> org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214)
> Apr 20 03:21:57       at 
> org.apache.flink.test.recovery.LocalRecoveryITCase.testRecoverLocallyFromProcessCrashWithWorkingDirectory(LocalRecoveryITCase.java:119)
> Apr 20 03:21:57       at 
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
> Apr 20 03:21:57       at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
> Apr 20 03:21:57       at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
> Apr 20 03:21:57       at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
> Apr 20 03:21:57       at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
> Apr 20 03:21:57       at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37701) The testRecoverLocallyFromProcessCrashWithWorkingDirectory test failed of azure cron adaptive scheduler pipeline

Reply via email to