[ https://issues.apache.org/jira/browse/FLINK-35499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthias Pohl updated FLINK-35499: ---------------------------------- Attachment: FLINK-35499.AdaptiveScheduler.log > EventTimeWindowCheckpointingITCase times out due to Checkpoint expired before > completing > ---------------------------------------------------------------------------------------- > > Key: FLINK-35499 > URL: https://issues.apache.org/jira/browse/FLINK-35499 > Project: Flink > Issue Type: Bug > Affects Versions: 1.20.0 > Reporter: Ryan Skraba > Priority: Critical > Labels: test-stability > Attachments: FLINK-35499.AdaptiveScheduler.log > > > * 1.20 AdaptiveScheduler / Test (module: tests) > https://github.com/apache/flink/actions/runs/9311892945/job/25632037990#step:10:8702 > * 1.20 Default (Java 8) / Test (module: tests) > https://github.com/apache/flink/actions/runs/9275522134/job/25520829730#step:10:8264 > Going into the logs, we see the following error occurs: > {code:java} > ================================================================================ > Test testTumblingTimeWindow[statebackend type =ROCKSDB_INCREMENTAL, > buffersPerChannel = > 2](org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase) is > running. > -------------------------------------------------------------------------------- > <...> > 20:24:23,562 [ Checkpoint Timer] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering > checkpoint 22 (type=CheckpointType{name='Checkpoint', > sharingFilesStrategy=FORWARD_BACKWARD}) @ 1716927863562 for job > 15d0a663cb415b09b9a68ccc40640c6d. > 20:24:23,609 [jobmanager-io-thread-2] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 22 for job 15d0a663cb415b09b9a68ccc40640c6d (2349132 bytes, > checkpointDuration=43 ms, finalizationTime=4 ms). > 20:24:23,610 [ Checkpoint Timer] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering > checkpoint 23 (type=CheckpointType{name='Checkpoint', > sharingFilesStrategy=FORWARD_BACKWARD}) @ 1716927863610 for job > 15d0a663cb415b09b9a68ccc40640c6d. > 20:24:23,620 [jobmanager-io-thread-2] WARN > org.apache.flink.runtime.jobmaster.JobMaster [] - Error while > processing AcknowledgeCheckpoint message > java.lang.IllegalStateException: Attempt to reference unknown state: > a9a90973-4ee5-384f-acef-58a7c7560920 > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) > ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.state.SharedStateRegistryImpl.registerReference(SharedStateRegistryImpl.java:97) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.state.SharedStateRegistry.registerReference(SharedStateRegistry.java:53) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.state.IncrementalRemoteKeyedStateHandle.registerSharedStates(IncrementalRemoteKeyedStateHandle.java:289) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedState(OperatorSubtaskState.java:243) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedStates(OperatorSubtaskState.java:226) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.checkpoint.TaskStateSnapshot.registerSharedStates(TaskStateSnapshot.java:193) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1245) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$2(ExecutionGraphHandler.java:109) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$4(ExecutionGraphHandler.java:139) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64) > ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_392] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_392] > at java.lang.Thread.run(Thread.java:750) [?:1.8.0_392] > 20:24:23,663 [Source: Custom Source (1/1)#1] INFO > org.apache.flink.runtime.taskmanager.Task [] - Source: > Custom Source (1/1)#1 > (bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1) > switched from RUNNING to FINISHED. > 20:24:23,663 [Source: Custom Source (1/1)#1] INFO > org.apache.flink.runtime.taskmanager.Task [] - Freeing > task resources for Source: Custom Source (1/1)#1 > (bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1). > 20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - > Un-registering task and sending final execution state FINISHED to JobManager > for task Source: Custom Source (1/1)#1 > bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1. > 20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: > Custom Source (1/1) > (bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1) > switched from RUNNING to FINISHED. > 20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Triggering > a manual checkpoint for job 15d0a663cb415b09b9a68ccc40640c6d. > 20:24:23,664 [TumblingEventTimeWindows (3/4)#1] INFO > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend [] - Closed > RocksDB State Backend. Cleaning up RocksDB working directory > /tmp/junit4600599408056781950/junit4118530363439384731/job_15d0a663cb415b09b9a68ccc40640c6d_op_WindowOperator_0a448493b4782967b150582570326227__3_4__uuid_f9b935a1-38cc-4cb3-92f0-97bf257d499d. > 20:24:23,665 [TumblingEventTimeWindows (3/4)#1] INFO > org.apache.flink.runtime.taskmanager.Task [] - > TumblingEventTimeWindows (3/4)#1 > (bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1) > switched from RUNNING to FINISHED. > 20:24:23,665 [TumblingEventTimeWindows (3/4)#1] INFO > org.apache.flink.runtime.taskmanager.Task [] - Freeing > task resources for TumblingEventTimeWindows (3/4)#1 > (bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1). > 20:24:23,665 [flink-pekko.actor.default-dispatcher-8] INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - > Un-registering task and sending final execution state FINISHED to JobManager > for task TumblingEventTimeWindows (3/4)#1 > bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1. > 20:24:23,665 [flink-pekko.actor.default-dispatcher-8] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > TumblingEventTimeWindows (3/4) > (bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1) > switched from RUNNING to FINISHED. > 20:24:23,665 [flink-pekko.actor.default-dispatcher-10] INFO > org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager > [] - Received resource requirements from job > 15d0a663cb415b09b9a68ccc40640c6d: > [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, > numberOfRequiredSlots=3}] > 20:26:01,189 [flink-pekko.actor.default-dispatcher-16] INFO > org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - > Releasing slot [27b0503b2930b17041ffd3635dfb819f]. > 20:26:01,190 [flink-pekko.actor.default-dispatcher-16] INFO > org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot > TaskSlot(index:3, state:ACTIVE, resource profile: > ResourceProfile{taskHeapMemory=512.000gb (549755813888 bytes), > taskOffHeapMemory=512.000gb (549755813888 bytes), managedMemory=64.000mb > (67108864 bytes), networkMemory=32.000mb (33554432 bytes)}, allocationId: > 27b0503b2930b17041ffd3635dfb819f, jobId: 15d0a663cb415b09b9a68ccc40640c6d). > 20:26:01,205 [flink-pekko.actor.default-dispatcher-16] INFO > org.apache.flink.runtime.resourcemanager.slotmanager.DefaultSlotStatusSyncer > [] - Freeing slot 27b0503b2930b17041ffd3635dfb819f. > 20:34:23,610 [ Checkpoint Timer] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint > 23 of job 15d0a663cb415b09b9a68ccc40640c6d expired before completing. > 20:34:23,611 [ Checkpoint Timer] WARN > org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to > trigger or complete checkpoint 23 for job 15d0a663cb415b09b9a68ccc40640c6d. > (0 consecutive failed attempts so far) > org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint expired > before completing. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2346) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64) > ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [?:1.8.0_392] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [?:1.8.0_392] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > [?:1.8.0_392] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > [?:1.8.0_392] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_392] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_392] > at java.lang.Thread.run(Thread.java:750) [?:1.8.0_392] > 20:34:23,611 [ Checkpoint Timer] INFO > org.apache.flink.runtime.checkpoint.CheckpointRequestDecider [] - checkpoint > request time in queue: 599948 > 20:34:23,612 [flink-pekko.actor.default-dispatcher-130] INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to > recover from a global failure. > org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable > failure threshold. The latest checkpoint failed due to Checkpoint expired > before completing., view the Checkpoint History tab or the Job Manager log to > find out why continuous checkpoints failed. > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.checkFailureAgainstCounter(CheckpointFailureManager.java:212) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:169) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:122) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2281) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2260) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$1200(CheckpointCoordinator.java:102) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2346) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64) > ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[?:1.8.0_392] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ~[?:1.8.0_392] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > ~[?:1.8.0_392] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > ~[?:1.8.0_392] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > ~[?:1.8.0_392] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ~[?:1.8.0_392] > at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392] > 20:34:23,613 [flink-pekko.actor.default-dispatcher-130] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > Tumbling Window Test (15d0a663cb415b09b9a68ccc40640c6d) switched from state > RUNNING to FAILING. > org.apache.flink.runtime.JobException: Recovery is suppressed by > FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=1, > backoffTimeMS=0) > at > org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:219) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailureAndReport(ExecutionFailureHandler.java:166) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.getGlobalFailureHandlingResult(ExecutionFailureHandler.java:140) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.DefaultScheduler.handleGlobalFailure(DefaultScheduler.java:324) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyGlobalFailure(UpdateSchedulerNgOnInternalFailuresListener.java:57) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.failGlobal(DefaultExecutionGraph.java:1092) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.executiongraph.DefaultExecutionGraph$1.lambda$failJob$0(DefaultExecutionGraph.java:477) > ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460) > ~[flink-rpc-akka44fe192f-1cb4-4526-9c09-a5d6b8d70d87.jar:1.20-SNAPSHOT] > at > org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > ~[flink-rpc-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > <...> > {code} > Notably: {{org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint > tolerable failure threshold. The latest checkpoint failed due to Checkpoint > expired before completing., view the Checkpoint History tab or the Job > Manager log to find out why continuous checkpoints failed.}} -- This message was sent by Atlassian Jira (v8.20.10#820010)