hidataplus opened a new issue, #7339:
URL: https://github.com/apache/seatunnel/issues/7339

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22)
 and found no similar issues.
   
   
   ### What happened
   
   when upgrade to 2.3.6 ,we found checkpoint fail, timout.
   
   ### SeaTunnel Version
   
   2.3.6
   
   ### SeaTunnel Config
   
   ```conf
   seatunnel:
     engine:
       history-job-expire-minutes: 1440
       backup-count: 1
       queue-type: blockingqueue
       print-execution-info-interval: 60
       print-job-metrics-info-interval: 60
       classloader-cache-mode: false
       slot-service:
         dynamic-slot: True
   
       checkpoint:
         interval: 10000
         timeout: 600000
         max-concurrent: 5
         tolerable-failure: 2
         storage:
           type: hdfs
           max-retained: 3
           plugin-config:
             namespace: /seatunnel/checkpoint_snapshot
             storage.type: hdfs
             fs.defaultFS: hdfs://datanode01:8020 # Ensure that the directory 
has written permission
   ```
   
   
   ### Running Command
   
   ```shell
   bin/seatunnel.sh --config $SEATUNNEL_HOME/config/v2.batch.config.template
   ```
   
   
   ### Error Exception
   
   ```log
   2024-08-08 03:35:57,530 INFO  [.s.e.s.c.CheckpointCoordinator] 
[seatunnel-coordinator-service-2] - checkpoint is enabled, start schedule 
trigger pending checkpoint.
   2024-08-08 03:35:57,614 INFO  [o.a.s.a.e.LoggingEventHandler ] 
[hz.main.generic-operation.thread-0] - log event: 
EnumeratorOpenEvent(createdTime=1723088157612, jobId=873771488035995649, 
eventType=LIFECYCLE_ENUMERATOR_OPEN)
   2024-08-08 03:36:07,565 INFO  [.s.e.s.c.CheckpointCoordinator] 
[seatunnel-coordinator-service-2] - wait checkpoint completed: 1
   2024-08-08 03:36:44,995 INFO  [o.a.s.e.s.CoordinatorService  ] 
[pool-4-thread-1] - [datanode01]:5801 [seatunnel] [5.1] 
   ***********************************************
        CoordinatorService Thread Pool Status
   ***********************************************
   activeCount               :                   1
   corePoolSize              :                   0
   maximumPoolSize           :          2147483647
   poolSize                  :                   4
   completedTaskCount        :                  16
   taskCount                 :                  17
   ***********************************************
   
   2024-08-08 03:36:44,996 INFO  [o.a.s.e.s.CoordinatorService  ] 
[pool-4-thread-1] - [datanode01]:5801 [seatunnel] [5.1] 
   ***********************************************
                   Job info detail
   ***********************************************
   createdJobCount           :                   0
   scheduledJobCount         :                   0
   runningJobCount           :                   1
   failingJobCount           :                   0
   failedJobCount            :                   0
   cancellingJobCount        :                   0
   canceledJobCount          :                   0
   finishedJobCount          :                   0
   ***********************************************
   
   2024-08-08 03:37:07,578 INFO  [.s.e.s.c.CheckpointCoordinator] 
[checkpoint-coordinator-1/873771488035995649] - timeout checkpoint: 
873771488035995649/1/1, CHECKPOINT_TYPE
   2024-08-08 03:37:07,580 INFO  [.s.e.s.c.CheckpointCoordinator] 
[checkpoint-coordinator-1/873771488035995649] - start clean pending checkpoint 
cause Checkpoint expired before completing. Please increase checkpoint timeout 
in the seatunnel.yaml or jobConfig env.
   2024-08-08 03:37:07,580 ERROR [.s.e.s.c.CheckpointCoordinator] 
[seatunnel-coordinator-service-4] - trigger checkpoint failed
   org.apache.seatunnel.engine.server.checkpoint.CheckpointException: 
Checkpoint expired before completing. Please increase checkpoint timeout in the 
seatunnel.yaml or jobConfig env.
           at 
org.apache.seatunnel.engine.server.checkpoint.PendingCheckpoint.abortCheckpoint(PendingCheckpoint.java:176)
 ~[seatunnel-starter.jar:2.3.6]
           at 
org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.lambda$cleanPendingCheckpoint$20(CheckpointCoordinator.java:780)
 ~[seatunnel-starter.jar:2.3.6]
           at 
java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4707)
 ~[?:1.8.0_342]
           at 
org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.cleanPendingCheckpoint(CheckpointCoordinator.java:778)
 ~[seatunnel-starter.jar:2.3.6]
           at 
org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.handleCoordinatorError(CheckpointCoordinator.java:285)
 ~[seatunnel-starter.jar:2.3.6]
           at 
org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.lambda$null$9(CheckpointCoordinator.java:658)
 ~[seatunnel-starter.jar:2.3.6]
           at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[?:1.8.0_342]
           at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
~[?:1.8.0_342]
           at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 ~[?:1.8.0_342]
           at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 ~[?:1.8.0_342]
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_342]
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_342]
           at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342]
   2024-08-08 03:37:07,582 INFO  [.s.e.s.c.CheckpointCoordinator] 
[seatunnel-coordinator-service-4] - start clean pending checkpoint cause 
CheckpointCoordinator inside have error.
   2024-08-08 03:37:07,583 INFO  [.s.e.s.c.CheckpointCoordinator] 
[checkpoint-coordinator-1/873771488035995649] - Turn 
checkpoint_state_873771488035995649_1 state from null to FAILED
   2024-08-08 03:37:07,593 INFO  [.s.e.s.c.CheckpointCoordinator] 
[seatunnel-coordinator-service-4] - Turn checkpoint_state_873771488035995649_1 
state from FAILED to FAILED
   2024-08-08 03:37:07,594 WARN  [o.a.s.e.s.d.p.SubPlan         ] 
[checkpoint-coordinator-1/873771488035995649] - Job SeaTunnel_Job 
(873771488035995649), Pipeline: [(1/1)] checkpoint have error, cancel the 
pipeline
   2024-08-08 03:37:07,600 WARN  [o.a.s.e.s.d.p.SubPlan         ] 
[seatunnel-coordinator-service-4] - Job SeaTunnel_Job (873771488035995649), 
Pipeline: [(1/1)] checkpoint have error, cancel the pipeline
   2024-08-08 03:37:07,606 INFO  [o.a.s.e.s.d.p.SubPlan         ] 
[checkpoint-coordinator-1/873771488035995649] - Job SeaTunnel_Job 
(873771488035995649), Pipeline: [(1/1)] turned from state RUNNING to CANCELING.
   2024-08-08 03:37:07,606 INFO  [o.a.s.e.s.d.p.PhysicalVertex  ] 
[checkpoint-coordinator-1/873771488035995649] - Job SeaTunnel_Job 
(873771488035995649), Pipeline: [(1/1)], task: [pipeline-1 
[Source[0]-FakeSource]-SplitEnumerator (1/1)] state process is start
   2024-08-08 03:37:07,617 INFO  [o.a.s.e.s.d.p.PhysicalVertex  ] 
[checkpoint-coordinator-1/873771488035995649] - Job SeaTunnel_Job 
(873771488035995649), Pipeline: [(1/1)], task: [pipeline-1 
[Source[0]-FakeSource]-SplitEnumerator (1/1)] turned from state RUNNING to 
CANCELING.
   2024-08-08 03:37:07,619 WARN  [c.h.i.s.t.TcpServerConnection ] 
[checkpoint-coordinator-1/873771488035995649] - [datanode01]:5801 [seatunnel] 
[5.1] Connection[id=4, /192.168.2.1:5801->/192.168.2.2:48179, qualifier=null, 
endpoint=[datanode02]:5802, remoteUuid=70999fc0-47f5-40a3-9c53-5003385646e5, 
alive=false, connectionType=MEMBER, planeIndex=0] closed. Reason: Exception in 
Connection[id=4, /192.168.2.1:5801->/192.168.2.2:48179, qualifier=null, 
endpoint=[datanode02]:5802, remoteUuid=70999fc0-47f5-40a3-9c53-5003385646e5, 
alive=true, connectionType=MEMBER, planeIndex=0], 
thread=checkpoint-coordinator-1/873771488035995649
   java.nio.channels.ClosedByInterruptException: null
           at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
 ~[?:1.8.0_342]
           at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:477) 
~[?:1.8.0_342]
           at 
com.hazelcast.internal.networking.nio.NioOutboundPipeline.flushToSocket(NioOutboundPipeline.java:439)
 ~[seatunnel-starter.jar:2.3.6]
           at 
com.hazelcast.internal.networking.nio.NioOutboundPipeline.process(NioOutboundPipeline.java:324)
 ~[seatunnel-starter.jar:2.3.6]
           at 
com.hazelcast.internal.networking.nio.NioOutboundPipeline.executePipeline(NioOutboundPipeline.java:240)
 ~[seatunnel-starter.jar:2.3.6]
           at 
com.hazelcast.internal.networking.nio.NioOutboundPipeline.write(NioOutboundPipeline.java:218)
 ~[seatunnel-starter.jar:2.3.6]
           at 
com.hazelcast.internal.networking.nio.NioChannel.write(NioChannel.java:79) 
~[seatunnel-starter.jar:2.3.6]
           at 
com.hazelcast.internal.server.tcp.TcpServerConnection.write(TcpServerConnection.java:222)
 ~[seatunnel-starter.jar:2.3.6]
           at 
com.hazelcast.spi.impl.operationservice.impl.OutboundOperationHandler.send(OutboundOperationHandler.java:59)
 ~[seatunnel-starter.jar:2.3.6]
           at 
com.hazelcast.spi.impl.operationservice.impl.Invocation.doInvokeRemote(Invocation.java:612)
 ~[seatunnel-starter.jar:2.3.6]
           at 
com.hazelcast.spi.impl.operationservice.impl.Invocation.doInvoke(Invocation.java:582)
 ~[seatunnel-starter.jar:2.3.6]
           at 
com.hazelcast.spi.impl.operationservice.impl.Invocation.invoke0(Invocation.java:541)
 ~[seatunnel-starter.jar:2.3.6]
           at 
com.hazelcast.spi.impl.operationservice.impl.Invocation.invoke(Invocation.java:241)
 ~[seatunnel-starter.jar:2.3.6]
           at 
com.hazelcast.spi.impl.operationservice.impl.InvocationBuilderImpl.invoke(InvocationBuilderImpl.java:61)
 ~[seatunnel-starter.jar:2.3.6]
           at 
org.apache.seatunnel.engine.server.dag.physical.PhysicalVertex.checkTaskGroupIsExecuting(PhysicalVertex.java:255)
 ~[seatunnel-starter.jar:2.3.6]
           at 
org.apache.seatunnel.engine.server.dag.physical.PhysicalVertex.noticeTaskExecutionServiceCancel(PhysicalVertex.java:418)
 ~[seatunnel-starter.jar:2.3.6]
           at 
org.apache.seatunnel.engine.server.dag.physical.PhysicalVertex.stateProcess(PhysicalVertex.java:580)
 ~[seatunnel-starter.jar:2.3.6]
           at 
org.apache.seatunnel.engine.server.dag.physical.PhysicalVertex.updateTaskState(PhysicalVertex.java:400)
 ~[seatunnel-starter.jar:2.3.6]
           at 
org.apache.seatunnel.engine.server.dag.physical.PhysicalVertex.cancel(PhysicalVertex.java:411)
 ~[seatunnel-starter.jar:2.3.6]
           at 
org.apache.seatunnel.engine.server.dag.physical.SubPlan.lambda$stateProcess$21(SubPlan.java:656)
 ~[seatunnel-starter.jar:2.3.6]
           at java.util.ArrayList.forEach(ArrayList.java:1259) ~[?:1.8.0_342]
           at 
org.apache.seatunnel.engine.server.dag.physical.SubPlan.stateProcess(SubPlan.java:653)
 ~[seatunnel-starter.jar:2.3.6]
           at 
org.apache.seatunnel.engine.server.dag.physical.SubPlan.updatePipelineState(SubPlan.java:376)
 ~[seatunnel-starter.jar:2.3.6]
           at 
org.apache.seatunnel.engine.server.dag.physical.SubPlan.handleCheckpointError(SubPlan.java:594)
 ~[seatunnel-starter.jar:2.3.6]
           at 
org.apache.seatunnel.engine.server.master.JobMaster.lambda$handleCheckpointError$3(JobMaster.java:395)
 ~[seatunnel-starter.jar:2.3.6]
           at java.util.ArrayList.forEach(ArrayList.java:1259) ~[?:1.8.0_342]
           at 
org.apache.seatunnel.engine.server.master.JobMaster.handleCheckpointError(JobMaster.java:392)
 ~[seatunnel-starter.jar:2.3.6]
           at 
org.apache.seatunnel.engine.server.checkpoint.CheckpointManager.handleCheckpointError(CheckpointManager.java:176)
 ~[seatunnel-starter.jar:2.3.6]
           at 
org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.handleCoordinatorError(CheckpointCoordinator.java:290)
 ~[seatunnel-starter.jar:2.3.6]
           at 
org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.lambda$null$9(CheckpointCoordinator.java:658)
 ~[seatunnel-starter.jar:2.3.6]
           at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[?:1.8.0_342]
           at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
~[?:1.8.0_342]
           at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 ~[?:1.8.0_342]
           at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 ~[?:1.8.0_342]
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_342]
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_342]
           at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342]
   ```
   
   
   ### Zeta or Flink or Spark Version
   
   Zeta
   
   ### Java or Scala Version
   
   1.8
   
   ### Screenshots
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@seatunnel.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to