Thanks for your reply I update operator to version 1.1.0 and nothing has changed
2022-08-11 11:56:48,414 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][job-namespace/job-namespace] Upgrading/Restarting running job, suspending first... 2022-08-11 11:56:48,414 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][job-namespace/job-namespace] Job is in running state, ready for upgrade with SAVEPOINT 2022-08-11 11:56:48,423 o.a.f.k.o.s.FlinkService [INFO ][job-namespace/job-namespace] Suspending job with savepoint. 2022-08-11 11:56:48,480 o.a.f.k.o.r.ReconciliationUtils [WARN ][job-namespace/job-namespace] Attempt count: 0, last attempt: false 2022-08-11 11:56:48,515 i.j.o.p.e.ReconciliationDispatcher [ERROR][job-namespace/job-namespace] Error during event processing ExecutionScope{ resource id: ResourceID{name='job-namespace', namespace='job-namespace'}, version: 109000114} failed. org.apache.flink.kubernetes.operator.exception.ReconciliationException: java.util.concurrent.ExecutionException: org.apache.flink.runtime.rest.util.RestClientException: [org.apache.flink.runtime.rest.NotFoundException: Operation not found under key: org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@a84a2ecd 2022-08-11 11:56:54,680 o.a.f.k.o.c.FlinkDeploymentController [INFO ][job-namespace/job-namespace] Starting reconciliation 2022-08-11 11:56:54,681 o.a.f.k.o.o.JobStatusObserver [INFO ][job-namespace/job-namespace] Observing job status 2022-08-11 11:57:06,695 o.a.f.s.n.i.n.c.AbstractChannel [WARN ] Force-closing a channel whose registration task was not accepted by an event loop: [id: 0xfd6e83a0] 2022-08-11 11:57:06,695 o.a.f.s.n.i.n.u.c.D.rejectedExecution [ERROR] Failed to submit a listener notification task. Event loop shut down? java.util.concurrent.RejectedExecutionException: event executor terminated 2022-08-11 11:57:06,696 o.a.f.k.o.o.JobStatusObserver [ERROR][job-namespace/job-namespace] Exception while listing jobs java.util.concurrent.TimeoutException 2022-08-11 11:57:06,696 o.a.f.k.o.o.d.ApplicationObserver [INFO ][job-namespace/job-namespace] Observing JobManager deployment. Previous status: READY 2022-08-11 11:57:06,696 o.a.f.k.o.o.d.ApplicationObserver [ERROR][job-namespace/job-namespace] Missing JobManager deployment 2022-08-11 11:57:06,735 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO ][job-namespace/job-namespace] Detected spec change, starting reconciliation. 2022-08-11 11:57:06,754 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][job-namespace/job-namespace] Upgrading/Restarting running job, suspending first... 2022-08-11 11:57:06,759 o.a.f.k.o.c.FlinkDeploymentController [ERROR][job-namespace/job-namespace] Flink Deployment failed org.apache.flink.kubernetes.operator.exception.DeploymentFailedException: JobManager deployment is missing and HA data is not available to make stateful upgrades. It is possible that the job has finished or terminally failed, or the configmaps have been deleted. Manual restore required. After that all pods and configmaps in namespace are removed. I checked the s3 bucket, the savepoint was saved > In general the Flink JobManager HA /client mechanism ensures that the rest > requests end up at the current leader. 2022-08-11 11:43:57,357 o.a.f.k.KubernetesClusterDescriptor [WARN ][job-namespace/job-namespace] Please note that Flink client operations(e.g. cancel, list, stop, savepoint, etc.) won't work from outside the Kubernetes cluster since 'kubernetes.rest-service.exposed.type' has been set to ClusterIP. 2022-08-11 11:43:57,358 o.a.f.k.KubernetesClusterDescriptor [INFO ][job-namespace/job-namespace] Create flink application cluster job-namespace successfully, JobManager Web Interface: http://job-namespace-rest.job-namespace:8081 The Kubernetes operator uses the kubernetes service, which has 3 endpoints (according to the number of running jobmanagers) and when we access it, we get to different jobmanagers. I myself, when going to this address, do not see information of checkpoints in Flink UI, which indicates that the request did not go to the leader ________________________________ От: Gyula Fóra <gyula.f...@gmail.com> Отправлено: 11 августа 2022 г. 16:25:47 Кому: Evgeniy Lyutikov Копия: user@flink.apache.org Тема: Re: Savepoint problen on KubernetesOperator HA cluster In general the Flink JobManager HA /client mechanism ensures that the rest requests end up at the current leader. In your case it's not clear what the actual cause of the issue was. What I would do is to upgrade to the latest operator version (1.1.0) where the savepoint upgrade mechanism has been hardened. If your cluster is already stopped with the savepoint but the operator did not get the response back you might have to perform the steps outlined in: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#manual-recovery<https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnightlies.apache.org%2Fflink%2Fflink-kubernetes-operator-docs-main%2Fdocs%2Fcustom-resource%2Fjob-management%2F%23manual-recovery&data=05%7C01%7Ceblyutikov%40avito.ru%7Ce7959d0212fd42f2878008da7b7b8685%7Caf0e07b3b90b472392e63fab11dd5396%7C0%7C0%7C637958067744252998%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=THnbw2fz6MI%2B%2BIKS%2FBSNwDbQncRssUKQjcyKDExbKBg%3D&reserved=0> Savepoint upgrades work significantly more robustly with Flink 1.15+ because there we have the ability to keep the cluster/rest api around even after the application was stopped. In Flink 1.14 and before after stopping the job the cluster disappears, making it difficult to handle all situations. Side note: I think in most cases you should not need more job manager replicas than 1. You still have the same HA guarantees with 1 replica, if it goes down it will be restarted. The behaviour is generally the same. Cheers, Gyula On Thu, Aug 11, 2022 at 11:15 AM Evgeniy Lyutikov <eblyuti...@avito.ru<mailto:eblyuti...@avito.ru>> wrote: Hi, I'm using flink 1.14.4 with flink kubernetes operator 1.0.1 with ha configuration on 3 jobmanager. When trying to change the job configuration, it restarts with trigger savepoint and an error occurs each time: 2022-08-10 12:04:21,142 mo.a.f.k.o.c.FlinkDeploymentController [INFO ][job-namespace/job-namespace] Starting reconciliation 2022-08-10 12:04:21,143 mo.a.f.k.o.o.JobStatusObserver [INFO ][job-namespace/job-namespace] Observing job status 2022-08-10 12:04:21,154 mo.a.f.k.o.o.JobStatusObserver [INFO ][job-namespace/job-namespace] Job status (RUNNING) unchanged 2022-08-10 12:04:21,155 mo.a.f.k.o.c.FlinkConfigManager [INFO ][job-namespace/job-namespace] Generating new config 2022-08-10 12:04:21,157 mo.a.f.k.o.r.d.ApplicationReconciler [INFO ][job-namespace/job-namespace] Upgrading/Restarting running job, suspending first... 2022-08-10 12:04:21,157 mo.a.f.k.o.r.d.ApplicationReconciler [INFO ][job-namespace/job-namespace] Job is in running state, ready for upgrade with SAVEPOINT 2022-08-10 12:04:21,157 mo.a.f.k.o.s.FlinkService [INFO ][job-namespace/job-namespace] Suspending job with savepoint. 2022-08-10 12:04:21,171 mo.a.f.k.o.r.ReconciliationUtils[WARN ][job-namespace/job-namespace] Attempt count: 5, last attempt: true 2022-08-10 12:04:21,242 mi.j.o.p.e.ReconciliationDispatcherESC[m ESC[1;31m[ERROR][job-namespace/job-namespace] Error during event processing ExecutionScope{ resource id: CustomResourceID{name='job-namespace', namespace='job-namespace'}, version: null} failed. org.apache.flink.kubernetes.operator.exception.ReconciliationException: java.util.concurrent.ExecutionException: org.apache.flink.runtime.rest.util.RestClientException: [org.apache.flink.runtime.rest.NotFoundException: Operation not found under key: org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@b41b16a8 After 5 retries 2022-08-10 12:04:21,157 o.a.f.k.o.r.d.ApplicationReconciler [INFO ][job-namespace/job-namespace] Job is in running state, ready for upgrade with SAVEPOINT 2022-08-10 12:04:21,157 o.a.f.k.o.s.FlinkService [INFO ][job-namespace/job-namespace] Suspending job with savepoint. 2022-08-10 12:04:21,171 o.a.f.k.o.r.ReconciliationUtils [WARN ][job-namespace/job-namespace] Attempt count: 5, last attempt: true 2022-08-10 12:04:21,242 i.j.o.p.e.ReconciliationDispatcher [ERROR][job-namespace/job-namespace] Error during event processing ExecutionScope{ resource id: CustomResourceID{name='job-namespace', namespace='job-namespace'}, version: null} failed. org.apache.flink.kubernetes.operator.exception.ReconciliationException: java.util.concurrent.ExecutionException: org.apache.flink.runtime.rest.util.RestClientException: [org.apache.flink.runtime.rest.NotFoundException: Operation not found under key: org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@b41b16a8 2022-08-10 12:04:21,243 i.j.o.p.e.EventProcessor [ERROR][job-namespace/job-namespace] Exhausted retries for ExecutionScope{ resource id: CustomResourceID{name='job-namespace', namespace='job-namespace'}, version: null} 2022-08-10 12:04:53,299 o.a.f.k.o.c.FlinkDeploymentController [INFO ][job-namespace/job-namespace] Starting reconciliation 2022-08-10 12:04:53,299 o.a.f.k.o.o.JobStatusObserver [INFO ][job-namespace/job-namespace] Observing job status 2022-08-10 12:05:03,322 o.a.f.s.n.i.n.c.AbstractChannel [WARN ] Force-closing a channel whose registration task was not accepted by an event loop: [id: 0x4fb8bb3b] java.util.concurrent.RejectedExecutionException: event executor terminated 2022-08-10 12:05:03,323 o.a.f.s.n.i.n.u.c.D.rejectedExecution [ERROR] Failed to submit a listener notification task. Event loop shut down? java.util.concurrent.RejectedExecutionException: event executor terminated 2022-08-10 12:05:03,323 o.a.f.k.o.o.JobStatusObserver [ERROR][job-namespace/job-namespace] Exception while listing jobs java.util.concurrent.TimeoutException 2022-08-10 12:05:03,324 o.a.f.k.o.o.d.ApplicationObserver [INFO ][job-namespace/job-namespace] Observing JobManager deployment. Previous status: READY 2022-08-10 12:05:03,324 o.a.f.k.o.o.d.ApplicationObserver [ERROR][job-namespace/job-namespace] Missing JobManager deployment As I suppose the problem is that the savepoint trigger request and getting its status are sent to different jobmanager Does the operator have a service discovery to get leader jobmanager and work with them? ________________________________ “This message contains confidential information/commercial secret. If you are not the intended addressee of this message you may not copy, save, print or forward it to any third party and you are kindly requested to destroy this message and notify the sender thereof by email. Данное сообщение содержит конфиденциальную информацию/информацию, являющуюся коммерческой тайной. Если Вы не являетесь надлежащим адресатом данного сообщения, Вы не вправе копировать, сохранять, печатать или пересылать его каким либо иным лицам. Просьба уничтожить данное сообщение и уведомить об этом отправителя электронным письмом.”