Hello, Thanks for your suggestion. In JobManager, I've found things such like:
``` .... *2023-08-21 02:06:23,882 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 892 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1692583583878 for job ffffffff974e401c0000000000000000.2023-08-21 02:06:25,984 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 892 for job ffffffff974e401c0000000000000000 (4372 bytes, checkpointDuration=1644 ms, finalizationTime=462 ms).2023-08-21 02:06:26,277 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Triggering stop-with-savepoint for job ffffffff974e401c0000000000000000.2023-08-21 02:06:26,476 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 893 (type=SavepointType{name='Suspend Savepoint', postCheckpointAction=SUSPEND, formatType=CANONICAL}) @ 1692583586279 for job ffffffff974e401c0000000000000000.2023-08-21 02:06:26,680 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Decline checkpoint 893 by task 715b66bf6a5d097651510dab7a8ed04b_bc764cd8ddf7a0cff126f51c16239658_0_0 of job ffffffff974e401c0000000000000000 at 172.32.113.103:6122-4a7566 @ 172.32.113.103 (dataPort=44803).org.apache.flink.util.SerializedThrowable: org.apache.flink.runtime.checkpoint.CheckpointException: Task name with subtask : Source: kafka-source (1/2)#0 Failure reason: Task has failed. at org.apache.flink.runtime.taskmanager.Task.declineCheckpoint(Task.java:1395) ~[flink-dist-1.17.1.jar:1.17.1] at org.apache.flink.runtime.taskmanager.Task.lambda$triggerCheckpointBarrier$3(Task.java:1338) ~[flink-dist-1.17.1.jar:1.17.1] at java.util.concurrent.CompletableFuture.uniHandle(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?] at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:344) ~[flink-dist-1.17.1.jar:1.17.1]Caused by: org.apache.flink.util.SerializedThrowable: java.util.concurrent.CompletionException: org.apache.flink.streaming.connectors.kafka.internals.Handover$ClosedException at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture$UniCompose.tryFire(Unknown Source) ~[?:?] ... 3 more...2023-08-21 02:06:26,775 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: kafka-source (1/2) (715b66bf6a5d097651510dab7a8ed04b_bc764cd8ddf7a0cff126f51c16239658_0_0) switched from RUNNING to FAILED on 172.32.113.103:6122-4a7566 @ 172.32.113.103 (dataPort=44803).org.apache.flink.streaming.connectors.kafka.internals.Handover$ClosedException: null at org.apache.flink.streaming.connectors.kafka.internals.Handover.close(Handover.java:177) ~[?:?]...2023-08-21 02:06:26,775 WARN org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger or complete checkpoint 893 for job ffffffff974e401c0000000000000000. (0 consecutive failed attempts so far)org.apache.flink.runtime.checkpoint.CheckpointException: Task has failed. at org.apache.flink.runtime.messages.checkpoint.SerializedCheckpointException.unwrap(SerializedCheckpointException.java:51) ~[flink-dist-1.17.1.jar:1.17.1] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:1080) ~[flink-dist-1.17.1.jar:1.17.1] at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$declineCheckpoint$2(ExecutionGraphHandler.java:103) ~[flink-dist-1.17.1.jar:1.17.1] at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) ~[flink-dist-1.17.1.jar:1.17.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?] at java.lang.Thread.run(Unknown Source) [?:?]Caused by: org.apache.flink.util.SerializedThrowable: org.apache.flink.runtime.checkpoint.CheckpointException: Task name with subtask : Source: kafka-source (1/2)#0 Failure reason: Task has failed. at org.apache.flink.runtime.taskmanager.Task.declineCheckpoint(Task.java:1395) ~[flink-dist-1.17.1.jar:1.17.1] at org.apache.flink.runtime.taskmanager.Task.lambda$triggerCheckpointBarrier$3(Task.java:1338) ~[flink-dist-1.17.1.jar:1.17.1] at java.util.concurrent.CompletableFuture.uniHandle(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?] at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:344) ...2023-08-21 02:06:26,874 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - 3 tasks will be restarted to recover the failed task 715b66bf6a5d097651510dab7a8ed04b_bc764cd8ddf7a0cff126f51c16239658_0_0.2023-08-21 02:06:26,875 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job flink-test-job (ffffffff974e401c0000000000000000) switched from state RUNNING to RESTARTING.2023-08-21 02:06:26,876 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: Unnamed (1/2) (715b66bf6a5d097651510dab7a8ed04b_ea632d67b7d595e5b851708ae9ad79d6_0_0) switched from RUNNING to CANCELING.2023-08-21 02:06:26,879 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - FlatMap message to object (1/2) (715b66bf6a5d097651510dab7a8ed04b_0a448493b4782967b150582570326227_0_0) switched from RUNNING to CANCELING.......2023-08-21 02:06:26,973 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - 3 tasks will be restarted to recover the failed task 715b66bf6a5d097651510dab7a8ed04b_bc764cd8ddf7a0cff126f51c16239658_1_0.2023-08-21 02:06:26,974 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: Unnamed (2/2) (715b66bf6a5d097651510dab7a8ed04b_ea632d67b7d595e5b851708ae9ad79d6_1_0) switched from RUNNING to CANCELING.2023-08-21 02:06:26,974 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - FlatMap message to object (2/2) (715b66bf6a5d097651510dab7a8ed04b_0a448493b4782967b150582570326227_1_0) switched from RUNNING to CANCELING.2023-08-21 02:06:26,978 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to recover from a global failure.org.apache.flink.runtime.checkpoint.CheckpointException: Task has failed. at org.apache.flink.runtime.messages.checkpoint.SerializedCheckpointException.unwrap(SerializedCheckpointException.java:51) ~[flink-dist-1.17.1.jar:1.17.1] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:1080) ~[flink-dist-1.17.1.jar:1.17.1] at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$declineCheckpoint$2(ExecutionGraphHandler.java:103) ~[flink-dist-1.17.1.jar:1.17.1] at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) ~[flink-dist-1.17.1.jar:1.17.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?] at java.lang.Thread.run(Unknown Source) ~[?:?]Caused by: org.apache.flink.util.SerializedThrowable: org.apache.flink.runtime.checkpoint.CheckpointException: Task name with subtask : Source: kafka-source (1/2)#0 Failure reason: Task has failed. at org.apache.flink.runtime.taskmanager.Task.declineCheckpoint(Task.java:1395) ~[flink-dist-1.17.1.jar:1.17.1]...2023-08-21 02:06:27,076 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: Unnamed (2/2) (715b66bf6a5d097651510dab7a8ed04b_ea632d67b7d595e5b851708ae9ad79d6_1_0) switched from CANCELING to CANCELED.2023-08-21 02:06:27,077 INFO org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Clearing resource requirements of job ffffffff974e401c00000000000000002023-08-21 02:06:27,078 WARN org.apache.flink.runtime.jobmaster.JobMaster [] - Stop-with-savepoint transitioned from FinalState to FinalState on execution termination handling for job ffffffff974e401c0000000000000000 with some executions being in an not-finished state: [CANCELED, FAILED]2023-08-21 02:06:27,884 WARN org.apache.flink.runtime.taskmanager.TaskManagerLocation [] - No hostname could be resolved for the IP address 172.32.113.103, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.2023-08-21 02:06:27,981 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job flink-test-job (ffffffff974e401c0000000000000000) switched from state RESTARTING to RUNNING.......2023-08-21 02:06:33,369 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: Unnamed (1/2) (715b66bf6a5d097651510dab7a8ed04b_ea632d67b7d595e5b851708ae9ad79d6_0_1) switched from INITIALIZING to RUNNING.2023-08-21 02:06:34,470 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: Unnamed (2/2) (715b66bf6a5d097651510dab7a8ed04b_ea632d67b7d595e5b851708ae9ad79d6_1_1) switched from INITIALIZING to RUNNING.2023-08-21 02:06:38,272 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 894 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1692583598269 for job ffffffff974e401c0000000000000000.2023-08-21 02:06:39,503 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 894 for job ffffffff974e401c0000000000000000 (4372 bytes, checkpointDuration=1199 ms, finalizationTime=35 ms).* ... ``` (also attached full log) Depending on the log, it seems 'stop' via API has been failed by 'Kafka Connector' part, and the checkpoint is not being completed well. (I'm using `FlinkKafkaConsumer` instead of `KafkaSource`, for some reasons. It seems it is not removed until 1.17 version) Does someone know how to avoid this kind of issue? Thanks. 2023년 8월 20일 (일) 오후 3:36, liu ron <ron9....@gmail.com>님이 작성: > Hi, > > I think you can check the client side and JobManager side log to get more > info. > > Best, > Ron > > Dennis Jung <inylov...@gmail.com> 于2023年8月18日周五 10:41写道: > >> Hello people, >> >> I'm facing failure when I try to stop running Flink job with REST API >> 'jobs/:jobid/stop' >> >> ``` >> ... >> java.util.concurrent.CompletionException: >> org.apache.flink.runtime.checkpoint.CheckpointException: Task has failed. >> java.base/java.util.concurrent.CompletableFuture.encodeRelay(Unknown >> Source) >> ... >> ... >> >> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown >> Source) >> >> >> org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1298) >> >> >> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93) >> >> >> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) >> >> >> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92) >> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown >> Source) >> >> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown >> Source) >> ... >> ... >> java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown >> Source)\nCaused by: >> org.apache.flink.runtime.checkpoint.CheckpointException: Task has failed. >> >> >> org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:545) >> >> >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2147) >> >> >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:1106) >> >> >> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$declineCheckpoint$2(ExecutionGraphHandler.java:103) >> >> >> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) >> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown >> Source) >> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown >> Source) >> java.base/java.lang.Thread.run(Unknown Source)\nCaused by: >> org.apache.flink.runtime.checkpoint.CheckpointException: >> org.apache.flink.runtime.checkpoint.CheckpointException: Task name with >> subtask : Source: kafka-source (1/2)#0 Failure reason: Task has failed. >> >> org.apache.flink.runtime.taskmanager.Task.declineCheckpoint(Task.java:1395) >> >> >> org.apache.flink.runtime.taskmanager.Task.lambda$triggerCheckpointBarrier$3(Task.java:1338) >> ... >> ``` >> >> Was there some similar issue? >> >> Regards >> >> >>
flink-stop-api-failure-jobmanager
Description: Binary data