Re: Savepoint problen on KubernetesOperator HA cluster

Evgeniy Lyutikov Thu, 11 Aug 2022 05:24:05 -0700

Thanks for your reply
I update operator to version 1.1.0 and nothing has changed

2022-08-11 11:56:48,414 o.a.f.k.o.r.d.AbstractJobReconciler [INFO 
][job-namespace/job-namespace] Upgrading/Restarting running job, suspending 
first...
2022-08-11 11:56:48,414 o.a.f.k.o.r.d.AbstractJobReconciler [INFO 
][job-namespace/job-namespace] Job is in running state, ready for upgrade with 
SAVEPOINT
2022-08-11 11:56:48,423 o.a.f.k.o.s.FlinkService       [INFO 
][job-namespace/job-namespace] Suspending job with savepoint.
2022-08-11 11:56:48,480 o.a.f.k.o.r.ReconciliationUtils [WARN 
][job-namespace/job-namespace] Attempt count: 0, last attempt: false
2022-08-11 11:56:48,515 i.j.o.p.e.ReconciliationDispatcher 
[ERROR][job-namespace/job-namespace] Error during event processing 
ExecutionScope{ resource id: ResourceID{name='job-namespace', 
namespace='job-namespace'}, version: 109000114} failed.
org.apache.flink.kubernetes.operator.exception.ReconciliationException: 
java.util.concurrent.ExecutionException: 
org.apache.flink.runtime.rest.util.RestClientException: 
[org.apache.flink.runtime.rest.NotFoundException: Operation not found under 
key: 
org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@a84a2ecd
2022-08-11 11:56:54,680 o.a.f.k.o.c.FlinkDeploymentController [INFO 
][job-namespace/job-namespace] Starting reconciliation
2022-08-11 11:56:54,681 o.a.f.k.o.o.JobStatusObserver  [INFO 
][job-namespace/job-namespace] Observing job status
2022-08-11 11:57:06,695 o.a.f.s.n.i.n.c.AbstractChannel [WARN ] Force-closing a 
channel whose registration task was not accepted by an event loop: [id: 
0xfd6e83a0]
2022-08-11 11:57:06,695 o.a.f.s.n.i.n.u.c.D.rejectedExecution [ERROR] Failed to 
submit a listener notification task. Event loop shut down?
java.util.concurrent.RejectedExecutionException: event executor terminated
2022-08-11 11:57:06,696 o.a.f.k.o.o.JobStatusObserver  
[ERROR][job-namespace/job-namespace] Exception while listing jobs
java.util.concurrent.TimeoutException
2022-08-11 11:57:06,696 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][job-namespace/job-namespace] Observing JobManager deployment. Previous 
status: READY
2022-08-11 11:57:06,696 o.a.f.k.o.o.d.ApplicationObserver 
[ERROR][job-namespace/job-namespace] Missing JobManager deployment
2022-08-11 11:57:06,735 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO 
][job-namespace/job-namespace] Detected spec change, starting reconciliation.
2022-08-11 11:57:06,754 o.a.f.k.o.r.d.AbstractJobReconciler [INFO 
][job-namespace/job-namespace] Upgrading/Restarting running job, suspending 
first...
2022-08-11 11:57:06,759 o.a.f.k.o.c.FlinkDeploymentController 
[ERROR][job-namespace/job-namespace] Flink Deployment failed
org.apache.flink.kubernetes.operator.exception.DeploymentFailedException: 
JobManager deployment is missing and HA data is not available to make stateful 
upgrades. It is possible that the job has finished or terminally failed, or the 
configmaps have been deleted. Manual restore required.

After that all pods and configmaps in namespace are removed. I checked the s3 
bucket, the savepoint was saved

> In general the Flink JobManager HA /client mechanism ensures that the rest 
> requests end up at the current leader.

2022-08-11 11:43:57,357 o.a.f.k.KubernetesClusterDescriptor [WARN 
][job-namespace/job-namespace] Please note that Flink client operations(e.g. 
cancel, list, stop, savepoint, etc.) won't work from outside the Kubernetes 
cluster since 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
2022-08-11 11:43:57,358 o.a.f.k.KubernetesClusterDescriptor [INFO 
][job-namespace/job-namespace] Create flink application cluster job-namespace 
successfully, JobManager Web Interface: 
http://job-namespace-rest.job-namespace:8081

The Kubernetes operator uses the kubernetes service, which has 3 endpoints 
(according to the number of running jobmanagers) and when we access it, we get 
to different jobmanagers. I myself, when going to this address, do not see 
information of checkpoints in Flink UI, which indicates that the request did 
not go to the leader

________________________________
От: Gyula Fóra <gyula.f...@gmail.com>
Отправлено: 11 августа 2022 г. 16:25:47
Кому: Evgeniy Lyutikov
Копия: user@flink.apache.org
Тема: Re: Savepoint problen on KubernetesOperator HA cluster

In general the Flink JobManager HA /client mechanism ensures that the rest 
requests end up at the current leader.

In your case it's not clear what the actual cause of the issue was.

What I would do is to upgrade to the latest operator version (1.1.0) where the 
savepoint upgrade mechanism has been hardened.
If your cluster is already stopped with the savepoint but the operator did not 
get the response back you might have to perform the steps outlined in: 
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#manual-recovery<https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnightlies.apache.org%2Fflink%2Fflink-kubernetes-operator-docs-main%2Fdocs%2Fcustom-resource%2Fjob-management%2F%23manual-recovery&data=05%7C01%7Ceblyutikov%40avito.ru%7Ce7959d0212fd42f2878008da7b7b8685%7Caf0e07b3b90b472392e63fab11dd5396%7C0%7C0%7C637958067744252998%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=THnbw2fz6MI%2B%2BIKS%2FBSNwDbQncRssUKQjcyKDExbKBg%3D&reserved=0>

Savepoint upgrades work significantly more robustly with Flink 1.15+ because 
there we have the ability to keep the cluster/rest api around even after the 
application was stopped. In Flink 1.14 and before after stopping the job the 
cluster disappears, making it difficult to handle all situations.

Side note: I think in most cases you should not need more job manager replicas 
than 1. You still have the same HA guarantees with 1 replica, if it goes down 
it will be restarted. The behaviour is generally the same.

Cheers,
Gyula

On Thu, Aug 11, 2022 at 11:15 AM Evgeniy Lyutikov 
<eblyuti...@avito.ru<mailto:eblyuti...@avito.ru>> wrote:

Hi,

I'm using flink 1.14.4 with flink kubernetes operator 1.0.1 with ha 
configuration on 3 jobmanager.

When trying to change the job configuration, it restarts with trigger savepoint 
and an error occurs each time:

2022-08-10 12:04:21,142 mo.a.f.k.o.c.FlinkDeploymentController [INFO 
][job-namespace/job-namespace] Starting reconciliation
2022-08-10 12:04:21,143 mo.a.f.k.o.o.JobStatusObserver  [INFO 
][job-namespace/job-namespace] Observing job status
2022-08-10 12:04:21,154 mo.a.f.k.o.o.JobStatusObserver  [INFO 
][job-namespace/job-namespace] Job status (RUNNING) unchanged
2022-08-10 12:04:21,155 mo.a.f.k.o.c.FlinkConfigManager [INFO 
][job-namespace/job-namespace] Generating new config
2022-08-10 12:04:21,157 mo.a.f.k.o.r.d.ApplicationReconciler [INFO 
][job-namespace/job-namespace] Upgrading/Restarting running job, suspending 
first...
2022-08-10 12:04:21,157 mo.a.f.k.o.r.d.ApplicationReconciler [INFO 
][job-namespace/job-namespace] Job is in running state, ready for upgrade with 
SAVEPOINT
2022-08-10 12:04:21,157 mo.a.f.k.o.s.FlinkService       [INFO 
][job-namespace/job-namespace] Suspending job with savepoint.
2022-08-10 12:04:21,171 mo.a.f.k.o.r.ReconciliationUtils[WARN 
][job-namespace/job-namespace] Attempt count: 5, last attempt: true
2022-08-10 12:04:21,242 mi.j.o.p.e.ReconciliationDispatcherESC[m 
ESC[1;31m[ERROR][job-namespace/job-namespace] Error during event processing 
ExecutionScope{ resource id: CustomResourceID{name='job-namespace', 
namespace='job-namespace'}, version: null} failed.
org.apache.flink.kubernetes.operator.exception.ReconciliationException: 
java.util.concurrent.ExecutionException: 
org.apache.flink.runtime.rest.util.RestClientException: 
[org.apache.flink.runtime.rest.NotFoundException: Operation not found under 
key: 
org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@b41b16a8

After 5 retries

2022-08-10 12:04:21,157 o.a.f.k.o.r.d.ApplicationReconciler [INFO 
][job-namespace/job-namespace] Job is in running state, ready for upgrade with 
SAVEPOINT
2022-08-10 12:04:21,157 o.a.f.k.o.s.FlinkService       [INFO 
][job-namespace/job-namespace] Suspending job with savepoint.
2022-08-10 12:04:21,171 o.a.f.k.o.r.ReconciliationUtils [WARN 
][job-namespace/job-namespace] Attempt count: 5, last attempt: true
2022-08-10 12:04:21,242 i.j.o.p.e.ReconciliationDispatcher 
[ERROR][job-namespace/job-namespace] Error during event processing 
ExecutionScope{ resource id: CustomResourceID{name='job-namespace', 
namespace='job-namespace'}, version: null} failed.
org.apache.flink.kubernetes.operator.exception.ReconciliationException: 
java.util.concurrent.ExecutionException: 
org.apache.flink.runtime.rest.util.RestClientException: 
[org.apache.flink.runtime.rest.NotFoundException: Operation not found under 
key: 
org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@b41b16a8
2022-08-10 12:04:21,243 i.j.o.p.e.EventProcessor       
[ERROR][job-namespace/job-namespace] Exhausted retries for ExecutionScope{ 
resource id: CustomResourceID{name='job-namespace', namespace='job-namespace'}, 
version: null}
2022-08-10 12:04:53,299 o.a.f.k.o.c.FlinkDeploymentController [INFO 
][job-namespace/job-namespace] Starting reconciliation
2022-08-10 12:04:53,299 o.a.f.k.o.o.JobStatusObserver  [INFO 
][job-namespace/job-namespace] Observing job status
2022-08-10 12:05:03,322 o.a.f.s.n.i.n.c.AbstractChannel [WARN ] Force-closing a 
channel whose registration task was not accepted by an event loop: [id: 
0x4fb8bb3b]
java.util.concurrent.RejectedExecutionException: event executor terminated
2022-08-10 12:05:03,323 o.a.f.s.n.i.n.u.c.D.rejectedExecution [ERROR] Failed to 
submit a listener notification task. Event loop shut down?
java.util.concurrent.RejectedExecutionException: event executor terminated
2022-08-10 12:05:03,323 o.a.f.k.o.o.JobStatusObserver  
[ERROR][job-namespace/job-namespace] Exception while listing jobs
java.util.concurrent.TimeoutException
2022-08-10 12:05:03,324 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][job-namespace/job-namespace] Observing JobManager deployment. Previous 
status: READY
2022-08-10 12:05:03,324 o.a.f.k.o.o.d.ApplicationObserver 
[ERROR][job-namespace/job-namespace] Missing JobManager deployment

As I suppose the problem is that the savepoint trigger request and getting its 
status are sent to different jobmanager

Does the operator have a service discovery to get leader jobmanager and work 
with them?

________________________________
“This message contains confidential information/commercial secret. If you are 
not the intended addressee of this message you may not copy, save, print or 
forward it to any third party and you are kindly requested to destroy this 
message and notify the sender thereof by email.
Данное сообщение содержит конфиденциальную информацию/информацию, являющуюся 
коммерческой тайной. Если Вы не являетесь надлежащим адресатом данного 
сообщения, Вы не вправе копировать, сохранять, печатать или пересылать его 
каким либо иным лицам. Просьба уничтожить данное сообщение и уведомить об этом 
отправителя электронным письмом.”

Re: Savepoint problen on KubernetesOperator HA cluster

Reply via email to