[ https://issues.apache.org/jira/browse/FLINK-36295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882444#comment-17882444 ]
Matthias Pohl commented on FLINK-36295: --------------------------------------- The issue is that the job was cancelled while the 1st checkpoint was created: {code} 03:06:29,931 [ Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 1 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1726455989919 for job 114e80eadd48937fb4bd8725fda8e141. 03:06:29,974 [flink-pekko.actor.default-dispatcher-10] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job Unnamed job (114e80eadd48937fb4bd8725fda8e141) switched from state RUNNING to CANCELLING. 03:06:29,976 [jobmanager-io-thread-6] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 1 for job 114e80eadd48937fb4bd8725fda8e141 (0 bytes, checkpointDuration=46 ms, finalizationTime=11 ms). 03:06:29,980 [flink-pekko.actor.default-dispatcher-10] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - jobVertex (1/2) (06b80ec858cac71bdb9835cfc22776cd_a3dfb9d871d93dfded58ac249c5b6076_0_0) switched from RUNNING to CANCELING. 03:06:29,980 [flink-pekko.actor.default-dispatcher-10] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - jobVertex (2/2) (06b80ec858cac71bdb9835cfc22776cd_a3dfb9d871d93dfded58ac249c5b6076_1_0) switched from RUNNING to CANCELING. {code} This triggers the job cancellation without any restart which is why we do not see any checkpoint being returned as part of the {{CheckpointStatsSnapshot}}. It's yet unclear why the job was cancelled. > AdaptiveSchedulerClusterITCase. testCheckpointStatsPersistedAcrossRescale > failed with > -------------------------------------------------------------------------------------- > > Key: FLINK-36295 > URL: https://issues.apache.org/jira/browse/FLINK-36295 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 2.0-preview > Reporter: Matthias Pohl > Priority: Critical > Labels: test-stability > Attachments: > FLINK-36295.failure.62156.20240916.1.logs-cron_jdk17-test_cron_jdk17_core-1726454552.log > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62156&view=logs&j=675bf62c-8558-587e-2555-dcad13acefb5&t=5878eed3-cc1e-5b12-1ed0-9e7139ce0992&l=10234 > {code} > Sep 16 03:06:30 03:06:30.168 [ERROR] Tests run: 3, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 5.275 s <<< FAILURE! -- in > org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase > Sep 16 03:06:30 03:06:30.168 [ERROR] > org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale > -- Time elapsed: 0.676 s <<< ERROR! > Sep 16 03:06:30 java.lang.IndexOutOfBoundsException: Index: -1 > Sep 16 03:06:30 at > java.base/java.util.Collections$EmptyList.get(Collections.java:4586) > Sep 16 03:06:30 at > org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale(AdaptiveSchedulerClusterITCase.java:214) > Sep 16 03:06:30 at > java.base/java.lang.reflect.Method.invoke(Method.java:568) > Sep 16 03:06:30 at > java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194) > Sep 16 03:06:30 at > java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) > Sep 16 03:06:30 at > java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) > Sep 16 03:06:30 at > java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) > Sep 16 03:06:30 at > java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) > Sep 16 03:06:30 at > java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) > Sep 16 03:06:30 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)