Re: JobManager restarts on job failure

Evgeniy Lyutikov Tue, 20 Sep 2022 06:28:46 -0700

Thanks for the answer.
I think this is not about the operator issue, kubernetes deployment just 
restarts the fallen pod, restarted jobmanager without HA metadata starts the 
job itself from an empty state.

I'm looking for a way to prevent it from exiting in case of an job error (we
use application mode cluster).

________________________________
От: Gyula Fóra <gyula.f...@gmail.com>
Отправлено: 20 сентября 2022 г. 19:49:37
Кому: Evgeniy Lyutikov
Копия: user@flink.apache.org
Тема: Re: JobManager restarts on job failure

The best thing for you to do would be to upgrade to Flink 1.15 and the latest
operator version.
In Flink 1.15 we have the option to interact with the Flink jobmanager even
after the job FAILED and the operator leverages this for a much more robust
behaviour.

In any case the operator should not ever start the job from an empty state
(even if it FAILED), if you think that's happening could you please open a JIRA
ticket with the accompanying JM and Operator logs?

Thanks
Gyula

On Tue, Sep 20, 2022 at 1:00 PM Evgeniy Lyutikov
<eblyuti...@avito.ru<mailto:eblyuti...@avito.ru>> wrote:

Hi,
We using flink 1.14.4 with flink kubernetes operator.

Sometimes when updating a job, it fails on startup and flink removes all HA
metadata and exits the jobmanager.

2022-09-14 14:54:44,534 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job
00000000000000000000000000000000 from Checkpoint 30829 @ 1663167158684 for
00000000000000000000000000000000 located at
s3p://flink-checkpoints/k8s-checkpoint-job-name/00000000000000000000000000000000/chk-30829.
2022-09-14 14:54:44,638 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job
00000000000000000000000000000000 reached terminal state FAILED.
org.apache.flink.runtime.client.JobInitializationException: Could not start the
JobMaster.
Caused by: java.util.concurrent.CompletionException:
java.lang.IllegalStateException: There is no operator for the state
4e1d9dde287c33a35e7970cbe64a40fe
2022-09-14 14:54:44,930 ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error
occurred in the cluster entrypoint.
2022-09-14 14:54:45,020 INFO
org.apache.flink.kubernetes.highavailability.KubernetesHaServices [] - Clean up
the high availability data for job 00000000000000000000000000000000.
2022-09-14 14:54:45,020 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Shutting
KubernetesApplicationClusterEntrypoint down with application status UNKNOWN.
Diagnostics Cluster entrypoint has been closed externally..
2022-09-14 14:54:45,026 INFO
org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Shutting
down rest endpoint.
2022-09-14 14:54:46,122 INFO
akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Shutting down
remote daemon.
2022-09-14 14:54:46,321 INFO
akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remoting shut
down.

Kubernetes restarts the pod jobmanager and the new instance, not finding the HA
metadata, starts the job from an empty state.
Is there some option to prevent jobmanager from exiting after an job FAILED
state?

________________________________
“This message contains confidential information/commercial secret. If you are
not the intended addressee of this message you may not copy, save, print or
forward it to any third party and you are kindly requested to destroy this
message and notify the sender thereof by email.
Данное сообщение содержит конфиденциальную информацию/информацию, являющуюся
коммерческой тайной. Если Вы не являетесь надлежащим адресатом данного
сообщения, Вы не вправе копировать, сохранять, печатать или пересылать его
каким либо иным лицам. Просьба уничтожить данное сообщение и уведомить об этом
отправителя электронным письмом.”

Re: JobManager restarts on job failure

Reply via email to