Hi,

I'm using flink 1.14.4 with flink kubernetes operator 1.0.1 with ha 
configuration on 3 jobmanager.

When trying to change the job configuration, it restarts with trigger savepoint 
and an error occurs each time:


2022-08-10 12:04:21,142 mo.a.f.k.o.c.FlinkDeploymentController [INFO 
][job-namespace/job-namespace] Starting reconciliation
2022-08-10 12:04:21,143 mo.a.f.k.o.o.JobStatusObserver  [INFO 
][job-namespace/job-namespace] Observing job status
2022-08-10 12:04:21,154 mo.a.f.k.o.o.JobStatusObserver  [INFO 
][job-namespace/job-namespace] Job status (RUNNING) unchanged
2022-08-10 12:04:21,155 mo.a.f.k.o.c.FlinkConfigManager [INFO 
][job-namespace/job-namespace] Generating new config
2022-08-10 12:04:21,157 mo.a.f.k.o.r.d.ApplicationReconciler [INFO 
][job-namespace/job-namespace] Upgrading/Restarting running job, suspending 
first...
2022-08-10 12:04:21,157 mo.a.f.k.o.r.d.ApplicationReconciler [INFO 
][job-namespace/job-namespace] Job is in running state, ready for upgrade with 
SAVEPOINT
2022-08-10 12:04:21,157 mo.a.f.k.o.s.FlinkService       [INFO 
][job-namespace/job-namespace] Suspending job with savepoint.
2022-08-10 12:04:21,171 mo.a.f.k.o.r.ReconciliationUtils[WARN 
][job-namespace/job-namespace] Attempt count: 5, last attempt: true
2022-08-10 12:04:21,242 mi.j.o.p.e.ReconciliationDispatcherESC[m 
ESC[1;31m[ERROR][job-namespace/job-namespace] Error during event processing 
ExecutionScope{ resource id: CustomResourceID{name='job-namespace', 
namespace='job-namespace'}, version: null} failed.
org.apache.flink.kubernetes.operator.exception.ReconciliationException: 
java.util.concurrent.ExecutionException: 
org.apache.flink.runtime.rest.util.RestClientException: 
[org.apache.flink.runtime.rest.NotFoundException: Operation not found under 
key: 
org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@b41b16a8

After 5 retries

2022-08-10 12:04:21,157 o.a.f.k.o.r.d.ApplicationReconciler [INFO 
][job-namespace/job-namespace] Job is in running state, ready for upgrade with 
SAVEPOINT
2022-08-10 12:04:21,157 o.a.f.k.o.s.FlinkService       [INFO 
][job-namespace/job-namespace] Suspending job with savepoint.
2022-08-10 12:04:21,171 o.a.f.k.o.r.ReconciliationUtils [WARN 
][job-namespace/job-namespace] Attempt count: 5, last attempt: true
2022-08-10 12:04:21,242 i.j.o.p.e.ReconciliationDispatcher 
[ERROR][job-namespace/job-namespace] Error during event processing 
ExecutionScope{ resource id: CustomResourceID{name='job-namespace', 
namespace='job-namespace'}, version: null} failed.
org.apache.flink.kubernetes.operator.exception.ReconciliationException: 
java.util.concurrent.ExecutionException: 
org.apache.flink.runtime.rest.util.RestClientException: 
[org.apache.flink.runtime.rest.NotFoundException: Operation not found under 
key: 
org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@b41b16a8
2022-08-10 12:04:21,243 i.j.o.p.e.EventProcessor       
[ERROR][job-namespace/job-namespace] Exhausted retries for ExecutionScope{ 
resource id: CustomResourceID{name='job-namespace', namespace='job-namespace'}, 
version: null}
2022-08-10 12:04:53,299 o.a.f.k.o.c.FlinkDeploymentController [INFO 
][job-namespace/job-namespace] Starting reconciliation
2022-08-10 12:04:53,299 o.a.f.k.o.o.JobStatusObserver  [INFO 
][job-namespace/job-namespace] Observing job status
2022-08-10 12:05:03,322 o.a.f.s.n.i.n.c.AbstractChannel [WARN ] Force-closing a 
channel whose registration task was not accepted by an event loop: [id: 
0x4fb8bb3b]
java.util.concurrent.RejectedExecutionException: event executor terminated
2022-08-10 12:05:03,323 o.a.f.s.n.i.n.u.c.D.rejectedExecution [ERROR] Failed to 
submit a listener notification task. Event loop shut down?
java.util.concurrent.RejectedExecutionException: event executor terminated
2022-08-10 12:05:03,323 o.a.f.k.o.o.JobStatusObserver  
[ERROR][job-namespace/job-namespace] Exception while listing jobs
java.util.concurrent.TimeoutException
2022-08-10 12:05:03,324 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][job-namespace/job-namespace] Observing JobManager deployment. Previous 
status: READY
2022-08-10 12:05:03,324 o.a.f.k.o.o.d.ApplicationObserver 
[ERROR][job-namespace/job-namespace] Missing JobManager deployment

As I suppose the problem is that the savepoint trigger request and getting its 
status are sent to different jobmanager

Does the operator have a service discovery to get leader jobmanager and work 
with them?


________________________________
"This message contains confidential information/commercial secret. If you are 
not the intended addressee of this message you may not copy, save, print or 
forward it to any third party and you are kindly requested to destroy this 
message and notify the sender thereof by email.
Данное сообщение содержит конфиденциальную информацию/информацию, являющуюся 
коммерческой тайной. Если Вы не являетесь надлежащим адресатом данного 
сообщения, Вы не вправе копировать, сохранять, печатать или пересылать его 
каким либо иным лицам. Просьба уничтожить данное сообщение и уведомить об этом 
отправителя электронным письмом."

Reply via email to