[jira] [Commented] (FLINK-36295) AdaptiveSchedulerClusterITCase. testCheckpointStatsPersistedAcrossRescale failed with

Matthias Pohl (Jira) Tue, 17 Sep 2024 08:20:01 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-36295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882444#comment-17882444
 ]


Matthias Pohl commented on FLINK-36295:
---------------------------------------

The issue is that the job was cancelled while the 1st checkpoint was created:
{code}
03:06:29,931 [    Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 1 (type=CheckpointType{name='Checkpoint', 
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1726455989919 for job 
114e80eadd48937fb4bd8725fda8e141.
03:06:29,974 [flink-pekko.actor.default-dispatcher-10] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job Unnamed 
job (114e80eadd48937fb4bd8725fda8e141) switched from state RUNNING to 
CANCELLING.
03:06:29,976 [jobmanager-io-thread-6] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 1 for job 114e80eadd48937fb4bd8725fda8e141 (0 bytes, 
checkpointDuration=46 ms, finalizationTime=11 ms).
03:06:29,980 [flink-pekko.actor.default-dispatcher-10] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - jobVertex 
(1/2) (06b80ec858cac71bdb9835cfc22776cd_a3dfb9d871d93dfded58ac249c5b6076_0_0) 
switched from RUNNING to CANCELING.
03:06:29,980 [flink-pekko.actor.default-dispatcher-10] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - jobVertex 
(2/2) (06b80ec858cac71bdb9835cfc22776cd_a3dfb9d871d93dfded58ac249c5b6076_1_0) 
switched from RUNNING to CANCELING.
{code}

This triggers the job cancellation without any restart which is why we do not 
see any checkpoint being returned as part of the {{CheckpointStatsSnapshot}}. 
It's yet unclear why the job was cancelled.

> AdaptiveSchedulerClusterITCase. testCheckpointStatsPersistedAcrossRescale 
> failed with 
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-36295
>                 URL: https://issues.apache.org/jira/browse/FLINK-36295
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 2.0-preview
>            Reporter: Matthias Pohl
>            Priority: Critical
>              Labels: test-stability
>         Attachments: 
> FLINK-36295.failure.62156.20240916.1.logs-cron_jdk17-test_cron_jdk17_core-1726454552.log
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62156&view=logs&j=675bf62c-8558-587e-2555-dcad13acefb5&t=5878eed3-cc1e-5b12-1ed0-9e7139ce0992&l=10234
> {code}
> Sep 16 03:06:30 03:06:30.168 [ERROR] Tests run: 3, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 5.275 s <<< FAILURE! -- in 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase
> Sep 16 03:06:30 03:06:30.168 [ERROR] 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale
>  -- Time elapsed: 0.676 s <<< ERROR!
> Sep 16 03:06:30 java.lang.IndexOutOfBoundsException: Index: -1
> Sep 16 03:06:30       at 
> java.base/java.util.Collections$EmptyList.get(Collections.java:4586)
> Sep 16 03:06:30       at 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale(AdaptiveSchedulerClusterITCase.java:214)
> Sep 16 03:06:30       at 
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
> Sep 16 03:06:30
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36295) AdaptiveSchedulerClusterITCase. testCheckpointStatsPersistedAcrossRescale failed with

Reply via email to