[jira] [Commented] (FLINK-36295) AdaptiveSchedulerClusterITCase. testCheckpointStatsPersistedAcrossRescale failed with

Matthias Pohl (Jira) Wed, 18 Sep 2024 04:06:03 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-36295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882674#comment-17882674
 ]


Matthias Pohl commented on FLINK-36295:
---------------------------------------

Several things were identified during the investigation so far:
* The test itself can have different behavior (with the same outcome) depending 
on whether the rescale is triggered by the StateTransitionManager or the 
failure due to missing TM slots. The test should be hardened accordingly to 
ensure consistent behavior (waiting for desired resources to be met to allow 
for the TM removal)
* The final wait in the test (see [line 
202|https://github.com/apache/flink/blob/db682b9869dcedce1873c6ee5252b0e351373a69/flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerClusterITCase.java#L202])
 can cause inconsistent behavior if the job ran with sufficient resources (i.e. 
2 slots allocated) in the first place. In that case, the wait wouldn't delay 
the test code execution (this was actually the case for failed test run and 
might contribute to the behavior I summarize in the last item in this list)
* There is a strange observation in the logs where the StateTransitionManager 
is transitioning into Stabilizing phase with "Desired resources are not met" 
log message indicating that there are not enough resources available for 
immediate state transitioning. But the 4 task slots become available earlier 
(indicated by the {{DefaultDeclarativeSlotPool}} with "Acquired new resources; 
new total acquired resources: 
ResourceCounter{resources={ResourceProfile{UNKNOWN}=4}}")
* The actual test failure is caused by the getHistory() returning an empty list 
even though the checkpoint snapshot returned a checkpoint count larger than 0 
earlier in the test. We checked that the count and the history are actually 
synchronously modified. So, the diff shouldn't happen.

> AdaptiveSchedulerClusterITCase. testCheckpointStatsPersistedAcrossRescale 
> failed with 
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-36295
>                 URL: https://issues.apache.org/jira/browse/FLINK-36295
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 2.0-preview
>            Reporter: Matthias Pohl
>            Assignee: Matthias Pohl
>            Priority: Blocker
>              Labels: test-stability
>         Attachments: 
> FLINK-36295.failure.62156.20240916.1.logs-cron_jdk17-test_cron_jdk17_core-1726454552.log,
>  FLINK-36295.failure.with-revert.debug.log
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62156&view=logs&j=675bf62c-8558-587e-2555-dcad13acefb5&t=5878eed3-cc1e-5b12-1ed0-9e7139ce0992&l=10234
> {code}
> Sep 16 03:06:30 03:06:30.168 [ERROR] Tests run: 3, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 5.275 s <<< FAILURE! -- in 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase
> Sep 16 03:06:30 03:06:30.168 [ERROR] 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale
>  -- Time elapsed: 0.676 s <<< ERROR!
> Sep 16 03:06:30 java.lang.IndexOutOfBoundsException: Index: -1
> Sep 16 03:06:30       at 
> java.base/java.util.Collections$EmptyList.get(Collections.java:4586)
> Sep 16 03:06:30       at 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale(AdaptiveSchedulerClusterITCase.java:214)
> Sep 16 03:06:30       at 
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
> Sep 16 03:06:30
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36295) AdaptiveSchedulerClusterITCase. testCheckpointStatsPersistedAcrossRescale failed with

Reply via email to