[ 
https://issues.apache.org/jira/browse/FLINK-36295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882730#comment-17882730
 ] 

Matthias Pohl commented on FLINK-36295:
---------------------------------------

I guess I found the reason why we do not retrieve an exception history in the 
failed test run:
1. The job is started and reaches WaitingForResources state with only one TM 
being available (2 slots, i.e. only sufficient resources are met).
2. The 2nd TM is not added while being in WaitingForResources state which makes 
the StateTransitionManager (STM) trigger the state transition into 
CreatingExecutionGraph state.
3. While creating the ExecutionGraph, the 2nd TM becomes available. The 
ExecutionGraph creation is on-going already and doesn't consider the newly 
added slots.
4. The AdaptiveScheduler (AS) reaches Executing state where the onChange and 
onTrigger events are initiated which triggers the STM's onChange and onTrigger 
event. This events do not consider the newly added TM slots, yet, because of 
FLINK-36279 (only free slots are considered but not the ones that are already 
allocated to the job). Hence, we see that the desired resources are not met. 
The STM changes into `Stabilized` Phase and waits for a new onTrigger (which 
would be a new checkpoint)
5. The job is running with parallelism of 2 until the checkpoint is triggered. 
That makes the STM trigger the rescale cancelling the two subtasks.
6. While the job is restarting, one TM is stopped by the test code.
7. The AS transitions into CreateExecutionGraph state right away from 
Restarting state (FLINK-36013) while one TM is still in the process of stopping.
8. The ExecutionGraph is now picked up with a parallelism of 4 (because the 
slots of the TM that is subject to shutdown are still available)
9. At the end of the CreateExecutionGraph state, a transitioning to 
WaitingForResources state is performed because of 2 slots being gone.
10. The job reaches Executing state with parallelism of 2.

> AdaptiveSchedulerClusterITCase. testCheckpointStatsPersistedAcrossRescale 
> failed with 
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-36295
>                 URL: https://issues.apache.org/jira/browse/FLINK-36295
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 2.0-preview
>            Reporter: Matthias Pohl
>            Assignee: Matthias Pohl
>            Priority: Blocker
>              Labels: test-stability
>         Attachments: 
> FLINK-36295.failure.62156.20240916.1.logs-cron_jdk17-test_cron_jdk17_core-1726454552.log,
>  FLINK-36295.failure.with-revert.debug.log
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62156&view=logs&j=675bf62c-8558-587e-2555-dcad13acefb5&t=5878eed3-cc1e-5b12-1ed0-9e7139ce0992&l=10234
> {code}
> Sep 16 03:06:30 03:06:30.168 [ERROR] Tests run: 3, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 5.275 s <<< FAILURE! -- in 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase
> Sep 16 03:06:30 03:06:30.168 [ERROR] 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale
>  -- Time elapsed: 0.676 s <<< ERROR!
> Sep 16 03:06:30 java.lang.IndexOutOfBoundsException: Index: -1
> Sep 16 03:06:30       at 
> java.base/java.util.Collections$EmptyList.get(Collections.java:4586)
> Sep 16 03:06:30       at 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale(AdaptiveSchedulerClusterITCase.java:214)
> Sep 16 03:06:30       at 
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
> Sep 16 03:06:30
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to