Hello,
Thanks for your suggestion.
In JobManager, I've found things such like:

```
....







































































*2023-08-21 02:06:23,882 INFO
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
Triggering checkpoint 892 (type=CheckpointType{name='Checkpoint',
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1692583583878 for job
ffffffff974e401c0000000000000000.2023-08-21 02:06:25,984 INFO
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
Completed checkpoint 892 for job ffffffff974e401c0000000000000000 (4372
bytes, checkpointDuration=1644 ms, finalizationTime=462 ms).2023-08-21
02:06:26,277 INFO  org.apache.flink.runtime.jobmaster.JobMaster
    [] - Triggering stop-with-savepoint for job
ffffffff974e401c0000000000000000.2023-08-21 02:06:26,476 INFO
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
Triggering checkpoint 893 (type=SavepointType{name='Suspend Savepoint',
postCheckpointAction=SUSPEND, formatType=CANONICAL}) @ 1692583586279 for
job ffffffff974e401c0000000000000000.2023-08-21 02:06:26,680 INFO
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline
checkpoint 893 by task
715b66bf6a5d097651510dab7a8ed04b_bc764cd8ddf7a0cff126f51c16239658_0_0 of
job ffffffff974e401c0000000000000000 at 172.32.113.103:6122-4a7566 @
172.32.113.103 (dataPort=44803).org.apache.flink.util.SerializedThrowable:
org.apache.flink.runtime.checkpoint.CheckpointException: Task name with
subtask : Source: kafka-source (1/2)#0 Failure reason: Task has failed. at
org.apache.flink.runtime.taskmanager.Task.declineCheckpoint(Task.java:1395)
~[flink-dist-1.17.1.jar:1.17.1] at
org.apache.flink.runtime.taskmanager.Task.lambda$triggerCheckpointBarrier$3(Task.java:1338)
~[flink-dist-1.17.1.jar:1.17.1] at
java.util.concurrent.CompletableFuture.uniHandle(Unknown Source) ~[?:?] at
java.util.concurrent.CompletableFuture$UniHandle.tryFire(Unknown Source)
~[?:?] at java.util.concurrent.CompletableFuture.postComplete(Unknown
Source) ~[?:?] at
java.util.concurrent.CompletableFuture.completeExceptionally(Unknown
Source) ~[?:?] at
org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:344)
~[flink-dist-1.17.1.jar:1.17.1]Caused by:
org.apache.flink.util.SerializedThrowable:
java.util.concurrent.CompletionException:
org.apache.flink.streaming.connectors.kafka.internals.Handover$ClosedException
at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source)
~[?:?] at java.util.concurrent.CompletableFuture.completeThrowable(Unknown
Source) ~[?:?] at
java.util.concurrent.CompletableFuture$UniCompose.tryFire(Unknown Source)
~[?:?] ... 3 more...2023-08-21 02:06:26,775 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
kafka-source (1/2)
(715b66bf6a5d097651510dab7a8ed04b_bc764cd8ddf7a0cff126f51c16239658_0_0)
switched from RUNNING to FAILED on 172.32.113.103:6122-4a7566 @
172.32.113.103
(dataPort=44803).org.apache.flink.streaming.connectors.kafka.internals.Handover$ClosedException:
null at
org.apache.flink.streaming.connectors.kafka.internals.Handover.close(Handover.java:177)
~[?:?]...2023-08-21 02:06:26,775 WARN
 org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed
to trigger or complete checkpoint 893 for job
ffffffff974e401c0000000000000000. (0 consecutive failed attempts so
far)org.apache.flink.runtime.checkpoint.CheckpointException: Task has
failed. at
org.apache.flink.runtime.messages.checkpoint.SerializedCheckpointException.unwrap(SerializedCheckpointException.java:51)
~[flink-dist-1.17.1.jar:1.17.1] at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:1080)
~[flink-dist-1.17.1.jar:1.17.1] at
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$declineCheckpoint$2(ExecutionGraphHandler.java:103)
~[flink-dist-1.17.1.jar:1.17.1] at
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
~[flink-dist-1.17.1.jar:1.17.1] at
java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?] at
java.lang.Thread.run(Unknown Source) [?:?]Caused by:
org.apache.flink.util.SerializedThrowable:
org.apache.flink.runtime.checkpoint.CheckpointException: Task name with
subtask : Source: kafka-source (1/2)#0 Failure reason: Task has failed. at
org.apache.flink.runtime.taskmanager.Task.declineCheckpoint(Task.java:1395)
~[flink-dist-1.17.1.jar:1.17.1] at
org.apache.flink.runtime.taskmanager.Task.lambda$triggerCheckpointBarrier$3(Task.java:1338)
~[flink-dist-1.17.1.jar:1.17.1] at
java.util.concurrent.CompletableFuture.uniHandle(Unknown Source) ~[?:?] at
java.util.concurrent.CompletableFuture$UniHandle.tryFire(Unknown Source)
~[?:?] at java.util.concurrent.CompletableFuture.postComplete(Unknown
Source) ~[?:?] at
java.util.concurrent.CompletableFuture.completeExceptionally(Unknown
Source) ~[?:?] at
org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:344)
...2023-08-21 02:06:26,874 INFO
 org.apache.flink.runtime.jobmaster.JobMaster                 [] - 3 tasks
will be restarted to recover the failed task
715b66bf6a5d097651510dab7a8ed04b_bc764cd8ddf7a0cff126f51c16239658_0_0.2023-08-21
02:06:26,875 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
    [] - Job flink-test-job (ffffffff974e401c0000000000000000) switched
from state RUNNING to RESTARTING.2023-08-21 02:06:26,876 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Sink:
Unnamed (1/2)
(715b66bf6a5d097651510dab7a8ed04b_ea632d67b7d595e5b851708ae9ad79d6_0_0)
switched from RUNNING to CANCELING.2023-08-21 02:06:26,879 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - FlatMap
message to object (1/2)
(715b66bf6a5d097651510dab7a8ed04b_0a448493b4782967b150582570326227_0_0)
switched from RUNNING to CANCELING.......2023-08-21 02:06:26,973 INFO
 org.apache.flink.runtime.jobmaster.JobMaster                 [] - 3 tasks
will be restarted to recover the failed task
715b66bf6a5d097651510dab7a8ed04b_bc764cd8ddf7a0cff126f51c16239658_1_0.2023-08-21
02:06:26,974 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
    [] - Sink: Unnamed (2/2)
(715b66bf6a5d097651510dab7a8ed04b_ea632d67b7d595e5b851708ae9ad79d6_1_0)
switched from RUNNING to CANCELING.2023-08-21 02:06:26,974 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - FlatMap
message to object (2/2)
(715b66bf6a5d097651510dab7a8ed04b_0a448493b4782967b150582570326227_1_0)
switched from RUNNING to CANCELING.2023-08-21 02:06:26,978 INFO
 org.apache.flink.runtime.jobmaster.JobMaster                 [] - Trying
to recover from a global
failure.org.apache.flink.runtime.checkpoint.CheckpointException: Task has
failed. at
org.apache.flink.runtime.messages.checkpoint.SerializedCheckpointException.unwrap(SerializedCheckpointException.java:51)
~[flink-dist-1.17.1.jar:1.17.1] at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:1080)
~[flink-dist-1.17.1.jar:1.17.1] at
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$declineCheckpoint$2(ExecutionGraphHandler.java:103)
~[flink-dist-1.17.1.jar:1.17.1] at
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
~[flink-dist-1.17.1.jar:1.17.1] at
java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
at java.lang.Thread.run(Unknown Source) ~[?:?]Caused by:
org.apache.flink.util.SerializedThrowable:
org.apache.flink.runtime.checkpoint.CheckpointException: Task name with
subtask : Source: kafka-source (1/2)#0 Failure reason: Task has failed. at
org.apache.flink.runtime.taskmanager.Task.declineCheckpoint(Task.java:1395)
~[flink-dist-1.17.1.jar:1.17.1]...2023-08-21 02:06:27,076 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Sink:
Unnamed (2/2)
(715b66bf6a5d097651510dab7a8ed04b_ea632d67b7d595e5b851708ae9ad79d6_1_0)
switched from CANCELING to CANCELED.2023-08-21 02:06:27,077 INFO
 org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
[] - Clearing resource requirements of job
ffffffff974e401c00000000000000002023-08-21 02:06:27,078 WARN
 org.apache.flink.runtime.jobmaster.JobMaster                 [] -
Stop-with-savepoint transitioned from FinalState to FinalState on execution
termination handling for job ffffffff974e401c0000000000000000 with some
executions being in an not-finished state: [CANCELED, FAILED]2023-08-21
02:06:27,884 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation
    [] - No hostname could be resolved for the IP address 172.32.113.103,
using IP address as host name. Local input split assignment (such as for
HDFS files) may be impacted.2023-08-21 02:06:27,981 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
flink-test-job (ffffffff974e401c0000000000000000) switched from state
RESTARTING to RUNNING.......2023-08-21 02:06:33,369 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Sink:
Unnamed (1/2)
(715b66bf6a5d097651510dab7a8ed04b_ea632d67b7d595e5b851708ae9ad79d6_0_1)
switched from INITIALIZING to RUNNING.2023-08-21 02:06:34,470 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Sink:
Unnamed (2/2)
(715b66bf6a5d097651510dab7a8ed04b_ea632d67b7d595e5b851708ae9ad79d6_1_1)
switched from INITIALIZING to RUNNING.2023-08-21 02:06:38,272 INFO
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
Triggering checkpoint 894 (type=CheckpointType{name='Checkpoint',
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1692583598269 for job
ffffffff974e401c0000000000000000.2023-08-21 02:06:39,503 INFO
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
Completed checkpoint 894 for job ffffffff974e401c0000000000000000 (4372
bytes, checkpointDuration=1199 ms, finalizationTime=35 ms).*
...
```
(also attached full log)

Depending on the log, it seems 'stop' via API has been failed by 'Kafka
Connector' part, and the checkpoint is not being completed well.
(I'm using `FlinkKafkaConsumer` instead of `KafkaSource`, for some reasons.
It seems it is not removed until 1.17 version)

Does someone know how to avoid this kind of issue?

Thanks.


2023년 8월 20일 (일) 오후 3:36, liu ron <ron9....@gmail.com>님이 작성:

> Hi,
>
> I think you can check the client side and JobManager side log to get more
> info.
>
> Best,
> Ron
>
> Dennis Jung <inylov...@gmail.com> 于2023年8月18日周五 10:41写道:
>
>> Hello people,
>>
>> I'm facing failure when I try to stop running Flink job with REST API
>> 'jobs/:jobid/stop'
>>
>> ```
>> ...
>> java.util.concurrent.CompletionException:
>> org.apache.flink.runtime.checkpoint.CheckpointException: Task has failed.
>>  java.base/java.util.concurrent.CompletableFuture.encodeRelay(Unknown
>> Source)
>> ...
>> ...
>>  
>> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown
>> Source)
>>
>>  
>> org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1298)
>>
>>  
>> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93)
>>
>>  
>> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>>
>>  
>> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92)
>>  java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown
>> Source)
>>  
>> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
>> Source)
>> ...
>> ...
>>  java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown
>> Source)\nCaused by:
>> org.apache.flink.runtime.checkpoint.CheckpointException: Task has failed.
>>
>>  
>> org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:545)
>>
>>  
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2147)
>>
>>  
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:1106)
>>
>>  
>> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$declineCheckpoint$2(ExecutionGraphHandler.java:103)
>>
>>  
>> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
>>  java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
>> Source)
>>  java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
>> Source)
>>  java.base/java.lang.Thread.run(Unknown Source)\nCaused by:
>> org.apache.flink.runtime.checkpoint.CheckpointException:
>> org.apache.flink.runtime.checkpoint.CheckpointException: Task name with
>> subtask : Source: kafka-source (1/2)#0 Failure reason: Task has failed.
>>
>>  org.apache.flink.runtime.taskmanager.Task.declineCheckpoint(Task.java:1395)
>>
>>  
>> org.apache.flink.runtime.taskmanager.Task.lambda$triggerCheckpointBarrier$3(Task.java:1338)
>> ...
>> ```
>>
>> Was there some similar issue?
>>
>> Regards
>>
>>
>>

Attachment: flink-stop-api-failure-jobmanager
Description: Binary data

Reply via email to