hidataplus opened a new issue, #7339: URL: https://github.com/apache/seatunnel/issues/7339
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22) and found no similar issues. ### What happened when upgrade to 2.3.6 ,we found checkpoint fail, timout. ### SeaTunnel Version 2.3.6 ### SeaTunnel Config ```conf seatunnel: engine: history-job-expire-minutes: 1440 backup-count: 1 queue-type: blockingqueue print-execution-info-interval: 60 print-job-metrics-info-interval: 60 classloader-cache-mode: false slot-service: dynamic-slot: True checkpoint: interval: 10000 timeout: 600000 max-concurrent: 5 tolerable-failure: 2 storage: type: hdfs max-retained: 3 plugin-config: namespace: /seatunnel/checkpoint_snapshot storage.type: hdfs fs.defaultFS: hdfs://datanode01:8020 # Ensure that the directory has written permission ``` ### Running Command ```shell bin/seatunnel.sh --config $SEATUNNEL_HOME/config/v2.batch.config.template ``` ### Error Exception ```log 2024-08-08 03:35:57,530 INFO [.s.e.s.c.CheckpointCoordinator] [seatunnel-coordinator-service-2] - checkpoint is enabled, start schedule trigger pending checkpoint. 2024-08-08 03:35:57,614 INFO [o.a.s.a.e.LoggingEventHandler ] [hz.main.generic-operation.thread-0] - log event: EnumeratorOpenEvent(createdTime=1723088157612, jobId=873771488035995649, eventType=LIFECYCLE_ENUMERATOR_OPEN) 2024-08-08 03:36:07,565 INFO [.s.e.s.c.CheckpointCoordinator] [seatunnel-coordinator-service-2] - wait checkpoint completed: 1 2024-08-08 03:36:44,995 INFO [o.a.s.e.s.CoordinatorService ] [pool-4-thread-1] - [datanode01]:5801 [seatunnel] [5.1] *********************************************** CoordinatorService Thread Pool Status *********************************************** activeCount : 1 corePoolSize : 0 maximumPoolSize : 2147483647 poolSize : 4 completedTaskCount : 16 taskCount : 17 *********************************************** 2024-08-08 03:36:44,996 INFO [o.a.s.e.s.CoordinatorService ] [pool-4-thread-1] - [datanode01]:5801 [seatunnel] [5.1] *********************************************** Job info detail *********************************************** createdJobCount : 0 scheduledJobCount : 0 runningJobCount : 1 failingJobCount : 0 failedJobCount : 0 cancellingJobCount : 0 canceledJobCount : 0 finishedJobCount : 0 *********************************************** 2024-08-08 03:37:07,578 INFO [.s.e.s.c.CheckpointCoordinator] [checkpoint-coordinator-1/873771488035995649] - timeout checkpoint: 873771488035995649/1/1, CHECKPOINT_TYPE 2024-08-08 03:37:07,580 INFO [.s.e.s.c.CheckpointCoordinator] [checkpoint-coordinator-1/873771488035995649] - start clean pending checkpoint cause Checkpoint expired before completing. Please increase checkpoint timeout in the seatunnel.yaml or jobConfig env. 2024-08-08 03:37:07,580 ERROR [.s.e.s.c.CheckpointCoordinator] [seatunnel-coordinator-service-4] - trigger checkpoint failed org.apache.seatunnel.engine.server.checkpoint.CheckpointException: Checkpoint expired before completing. Please increase checkpoint timeout in the seatunnel.yaml or jobConfig env. at org.apache.seatunnel.engine.server.checkpoint.PendingCheckpoint.abortCheckpoint(PendingCheckpoint.java:176) ~[seatunnel-starter.jar:2.3.6] at org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.lambda$cleanPendingCheckpoint$20(CheckpointCoordinator.java:780) ~[seatunnel-starter.jar:2.3.6] at java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4707) ~[?:1.8.0_342] at org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.cleanPendingCheckpoint(CheckpointCoordinator.java:778) ~[seatunnel-starter.jar:2.3.6] at org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.handleCoordinatorError(CheckpointCoordinator.java:285) ~[seatunnel-starter.jar:2.3.6] at org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.lambda$null$9(CheckpointCoordinator.java:658) ~[seatunnel-starter.jar:2.3.6] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_342] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_342] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_342] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_342] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_342] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_342] at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342] 2024-08-08 03:37:07,582 INFO [.s.e.s.c.CheckpointCoordinator] [seatunnel-coordinator-service-4] - start clean pending checkpoint cause CheckpointCoordinator inside have error. 2024-08-08 03:37:07,583 INFO [.s.e.s.c.CheckpointCoordinator] [checkpoint-coordinator-1/873771488035995649] - Turn checkpoint_state_873771488035995649_1 state from null to FAILED 2024-08-08 03:37:07,593 INFO [.s.e.s.c.CheckpointCoordinator] [seatunnel-coordinator-service-4] - Turn checkpoint_state_873771488035995649_1 state from FAILED to FAILED 2024-08-08 03:37:07,594 WARN [o.a.s.e.s.d.p.SubPlan ] [checkpoint-coordinator-1/873771488035995649] - Job SeaTunnel_Job (873771488035995649), Pipeline: [(1/1)] checkpoint have error, cancel the pipeline 2024-08-08 03:37:07,600 WARN [o.a.s.e.s.d.p.SubPlan ] [seatunnel-coordinator-service-4] - Job SeaTunnel_Job (873771488035995649), Pipeline: [(1/1)] checkpoint have error, cancel the pipeline 2024-08-08 03:37:07,606 INFO [o.a.s.e.s.d.p.SubPlan ] [checkpoint-coordinator-1/873771488035995649] - Job SeaTunnel_Job (873771488035995649), Pipeline: [(1/1)] turned from state RUNNING to CANCELING. 2024-08-08 03:37:07,606 INFO [o.a.s.e.s.d.p.PhysicalVertex ] [checkpoint-coordinator-1/873771488035995649] - Job SeaTunnel_Job (873771488035995649), Pipeline: [(1/1)], task: [pipeline-1 [Source[0]-FakeSource]-SplitEnumerator (1/1)] state process is start 2024-08-08 03:37:07,617 INFO [o.a.s.e.s.d.p.PhysicalVertex ] [checkpoint-coordinator-1/873771488035995649] - Job SeaTunnel_Job (873771488035995649), Pipeline: [(1/1)], task: [pipeline-1 [Source[0]-FakeSource]-SplitEnumerator (1/1)] turned from state RUNNING to CANCELING. 2024-08-08 03:37:07,619 WARN [c.h.i.s.t.TcpServerConnection ] [checkpoint-coordinator-1/873771488035995649] - [datanode01]:5801 [seatunnel] [5.1] Connection[id=4, /192.168.2.1:5801->/192.168.2.2:48179, qualifier=null, endpoint=[datanode02]:5802, remoteUuid=70999fc0-47f5-40a3-9c53-5003385646e5, alive=false, connectionType=MEMBER, planeIndex=0] closed. Reason: Exception in Connection[id=4, /192.168.2.1:5801->/192.168.2.2:48179, qualifier=null, endpoint=[datanode02]:5802, remoteUuid=70999fc0-47f5-40a3-9c53-5003385646e5, alive=true, connectionType=MEMBER, planeIndex=0], thread=checkpoint-coordinator-1/873771488035995649 java.nio.channels.ClosedByInterruptException: null at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) ~[?:1.8.0_342] at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:477) ~[?:1.8.0_342] at com.hazelcast.internal.networking.nio.NioOutboundPipeline.flushToSocket(NioOutboundPipeline.java:439) ~[seatunnel-starter.jar:2.3.6] at com.hazelcast.internal.networking.nio.NioOutboundPipeline.process(NioOutboundPipeline.java:324) ~[seatunnel-starter.jar:2.3.6] at com.hazelcast.internal.networking.nio.NioOutboundPipeline.executePipeline(NioOutboundPipeline.java:240) ~[seatunnel-starter.jar:2.3.6] at com.hazelcast.internal.networking.nio.NioOutboundPipeline.write(NioOutboundPipeline.java:218) ~[seatunnel-starter.jar:2.3.6] at com.hazelcast.internal.networking.nio.NioChannel.write(NioChannel.java:79) ~[seatunnel-starter.jar:2.3.6] at com.hazelcast.internal.server.tcp.TcpServerConnection.write(TcpServerConnection.java:222) ~[seatunnel-starter.jar:2.3.6] at com.hazelcast.spi.impl.operationservice.impl.OutboundOperationHandler.send(OutboundOperationHandler.java:59) ~[seatunnel-starter.jar:2.3.6] at com.hazelcast.spi.impl.operationservice.impl.Invocation.doInvokeRemote(Invocation.java:612) ~[seatunnel-starter.jar:2.3.6] at com.hazelcast.spi.impl.operationservice.impl.Invocation.doInvoke(Invocation.java:582) ~[seatunnel-starter.jar:2.3.6] at com.hazelcast.spi.impl.operationservice.impl.Invocation.invoke0(Invocation.java:541) ~[seatunnel-starter.jar:2.3.6] at com.hazelcast.spi.impl.operationservice.impl.Invocation.invoke(Invocation.java:241) ~[seatunnel-starter.jar:2.3.6] at com.hazelcast.spi.impl.operationservice.impl.InvocationBuilderImpl.invoke(InvocationBuilderImpl.java:61) ~[seatunnel-starter.jar:2.3.6] at org.apache.seatunnel.engine.server.dag.physical.PhysicalVertex.checkTaskGroupIsExecuting(PhysicalVertex.java:255) ~[seatunnel-starter.jar:2.3.6] at org.apache.seatunnel.engine.server.dag.physical.PhysicalVertex.noticeTaskExecutionServiceCancel(PhysicalVertex.java:418) ~[seatunnel-starter.jar:2.3.6] at org.apache.seatunnel.engine.server.dag.physical.PhysicalVertex.stateProcess(PhysicalVertex.java:580) ~[seatunnel-starter.jar:2.3.6] at org.apache.seatunnel.engine.server.dag.physical.PhysicalVertex.updateTaskState(PhysicalVertex.java:400) ~[seatunnel-starter.jar:2.3.6] at org.apache.seatunnel.engine.server.dag.physical.PhysicalVertex.cancel(PhysicalVertex.java:411) ~[seatunnel-starter.jar:2.3.6] at org.apache.seatunnel.engine.server.dag.physical.SubPlan.lambda$stateProcess$21(SubPlan.java:656) ~[seatunnel-starter.jar:2.3.6] at java.util.ArrayList.forEach(ArrayList.java:1259) ~[?:1.8.0_342] at org.apache.seatunnel.engine.server.dag.physical.SubPlan.stateProcess(SubPlan.java:653) ~[seatunnel-starter.jar:2.3.6] at org.apache.seatunnel.engine.server.dag.physical.SubPlan.updatePipelineState(SubPlan.java:376) ~[seatunnel-starter.jar:2.3.6] at org.apache.seatunnel.engine.server.dag.physical.SubPlan.handleCheckpointError(SubPlan.java:594) ~[seatunnel-starter.jar:2.3.6] at org.apache.seatunnel.engine.server.master.JobMaster.lambda$handleCheckpointError$3(JobMaster.java:395) ~[seatunnel-starter.jar:2.3.6] at java.util.ArrayList.forEach(ArrayList.java:1259) ~[?:1.8.0_342] at org.apache.seatunnel.engine.server.master.JobMaster.handleCheckpointError(JobMaster.java:392) ~[seatunnel-starter.jar:2.3.6] at org.apache.seatunnel.engine.server.checkpoint.CheckpointManager.handleCheckpointError(CheckpointManager.java:176) ~[seatunnel-starter.jar:2.3.6] at org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.handleCoordinatorError(CheckpointCoordinator.java:290) ~[seatunnel-starter.jar:2.3.6] at org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.lambda$null$9(CheckpointCoordinator.java:658) ~[seatunnel-starter.jar:2.3.6] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_342] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_342] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_342] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_342] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_342] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_342] at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342] ``` ### Zeta or Flink or Spark Version Zeta ### Java or Scala Version 1.8 ### Screenshots _No response_ ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@seatunnel.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org