HI Evgeniy,

Did you rollback your operator version? If yes, did you run into any issues?

I ran into the following exception in my flink-kubernetes-operator pod
while rolling back, and I was wondering if you encountered this.

2023-10-18 21:01:15,251 i.f.k.c.e.l.LeaderElector      [ERROR] Exception
occurred while releasing lock 'LeaseLock: flink-kubernetes-operator -
flink-operator-lease (flink-kubernetes-operator-74f9688dd-bcqr2)'
io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException:
Unable to update LeaseLock
at
io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LeaseLock.update(LeaseLock.java:102)
at
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.release(LeaderElector.java:139)
at
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.stopLeading(LeaderElector.java:120)
at
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$start$2(LeaderElector.java:104)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown
Source)
at
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
Source)
at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown
Source)
at
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown
Source)
at io.fabric8.kubernetes.client.utils.Utils.lambda$null$12(Utils.java:523)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown
Source)
at
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
Source)
at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown
Source)
at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown
Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
executing: PUT at:
https://10.241.0.1/apis/coordination.k8s.io/v1/namespaces/flink-kubernetes-operator/leases/flink-operator-lease.
Message: Operation cannot be fulfilled on leases.coordination.k8s.io
"flink-operator-lease": the object has been modified; please apply your
changes to the latest version and try again. Received status:
Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=
coordination.k8s.io, kind=leases, name=flink-operator-lease,
retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status,
message=Operation cannot be fulfilled on leases.coordination.k8s.io
"flink-operator-lease": the object has been modified; please apply your
changes to the latest version and try again,
metadata=ListMeta(_continue=null, remainingItemCount=null,
resourceVersion=null, selfLink=null, additionalProperties={}),
reason=Conflict, status=Failure, additionalProperties={}).

On Tue, Sep 12, 2023 at 5:51 AM Gyula Fóra <gyula.f...@gmail.com> wrote:

> Hi!
>
> I think this issue is the same as
> https://issues.apache.org/jira/browse/FLINK-33011
> Not sure what exactly is the underlying cause as I could not repro it, but
> the fix should be simple.
>
> Also I believe it's not 1.6.0 related unless a JOSDK/Fabric8 upgrade
> caused it.
>
> Cheers,
> Gyula
>
>
> On Mon, Sep 11, 2023 at 7:47 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
>
>> You don’t need it but you can really mess up clusters by rolling back CRD
>> changes…
>>
>> On Mon, 11 Sep 2023 at 19:42, Evgeniy Lyutikov <eblyuti...@avito.ru>
>> wrote:
>>
>>> Why we need to use latest CRD version with older operator version?
>>> ------------------------------
>>> *От:* Gyula Fóra <gyula.f...@gmail.com>
>>> *Отправлено:* 12 сентября 2023 г. 0:36:26
>>>
>>> *Кому:* Evgeniy Lyutikov
>>> *Копия:* user@flink.apache.org
>>> *Тема:* Re: Flink kubernets operator delete HA metadata after resuming
>>> from suspend
>>>
>>> Do not change the CRD but you can roll back the operator itself I
>>> believe
>>>
>>> Gyula
>>>
>>> On Mon, 11 Sep 2023 at 18:52, Evgeniy Lyutikov <eblyuti...@avito.ru>
>>> wrote:
>>>
>>>> Is it safe to rollback the operator version with replace to old CRDs?
>>>> ------------------------------
>>>> *От:* Evgeniy Lyutikov <eblyuti...@avito.ru>
>>>> *Отправлено:* 11 сентября 2023 г. 23:50:26
>>>> *Кому:* Gyula Fóra
>>>>
>>>> *Копия:* user@flink.apache.org
>>>> *Тема:* Re: Flink kubernets operator delete HA metadata after resuming
>>>> from suspend
>>>>
>>>>
>>>> Hi!
>>>> No, no one could restart jobmanager,
>>>> I monitored the pods in real time, they all deleted when suspended as
>>>> expected.
>>>>
>>>>
>>>> ------------------------------
>>>> *От:* Gyula Fóra <gyula.f...@gmail.com>
>>>> *Отправлено:* 11 сентября 2023 г. 20:34:52
>>>> *Кому:* Evgeniy Lyutikov
>>>> *Копия:* user@flink.apache.org
>>>> *Тема:* Re: Flink kubernets operator delete HA metadata after resuming
>>>> from suspend
>>>>
>>>> Hi!
>>>>
>>>> I could not reproduce your issue, last-state suspend/restore seems to
>>>> work as before.
>>>> However these 2 logs seem very suspicious:
>>>>
>>>> 2023-09-11 06:02:07,481 o.a.f.k.o.o.d.ApplicationObserver [INFO
>>>> ][rec-job/rec-job] Observing JobManager deployment. Previous status: 
>>>> MISSING
>>>> 2023-09-11 06:02:07,488 o.a.f.k.o.o.d.ApplicationObserver [INFO
>>>> ][rec-job/rec-job] JobManager is being deployed
>>>>
>>>> Looks like after suspending (and deleting the JobManager Deployment)
>>>> somebody restarted the JobManager manually. Is that possible?
>>>>
>>>> Cheers,
>>>> Gyula
>>>>
>>>> On Mon, Sep 11, 2023 at 2:59 PM Evgeniy Lyutikov <eblyuti...@avito.ru>
>>>> wrote:
>>>>
>>>>> Hi all!
>>>>> After updating the operator to version 1.6.0, suspended and resuming
>>>>> flink jobs stopped working.
>>>>> When job resumes, the high availability metadata is removed.
>>>>>
>>>>> Suspend job:
>>>>> 2023-09-11 06:01:41,548 o.a.f.k.o.l.AuditUtils         [INFO
>>>>> ][rec-job/rec-job] >>> Event  | Info    | SPECCHANGED     | UPGRADE
>>>>> change(s) detected (Diff: FlinkDeploymentSpec[job.state : running ->
>>>>> suspended]), starting reconciliation.
>>>>> 2023-09-11 06:01:41,548 o.a.f.k.o.r.d.AbstractJobReconciler [INFO
>>>>> ][rec-job/rec-job] Job is in running state, ready for upgrade with
>>>>> LAST_STATE
>>>>> 2023-09-11 06:01:41,558 o.a.f.k.o.l.AuditUtils         [INFO
>>>>> ][rec-job/rec-job] >>> Event  | Info    | SUSPENDED       | Suspending
>>>>> existing deployment.
>>>>> 2023-09-11 06:01:41,558 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>>> ][rec-job/rec-job] Deleting cluster with Foreground propagation
>>>>> 2023-09-11 06:01:41,558 o.a.f.k.o.s.NativeFlinkService [INFO
>>>>> ][rec-job/rec-job] Deleting JobManager deployment while preserving HA
>>>>> metadata.
>>>>> 2023-09-11 06:01:41,598 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>>> ][rec-job/rec-job] Waiting for cluster shutdown...
>>>>> 2023-09-11 06:01:45,667 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>>> ][rec-job/rec-job] Waiting for cluster shutdown... (5s)
>>>>> 2023-09-11 06:01:50,730 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>>> ][rec-job/rec-job] Waiting for cluster shutdown... (10s)
>>>>> 2023-09-11 06:01:55,837 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>>> ][rec-job/rec-job] Waiting for cluster shutdown... (15s)
>>>>> 2023-09-11 06:02:00,885 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>>> ][rec-job/rec-job] Waiting for cluster shutdown... (20s)
>>>>> 2023-09-11 06:02:01,895 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>>> ][rec-job/rec-job] Cluster shutdown completed.
>>>>> 2023-09-11 06:02:01,973 o.a.f.k.o.l.AuditUtils         [INFO
>>>>> ][rec-job/rec-job] >>> Status | Info    | SUSPENDED       | The resource
>>>>> (job) has been suspended
>>>>> 2023-09-11 06:02:01,981 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler
>>>>> [INFO ][rec-job/rec-job] Resource fully reconciled, nothing to do...
>>>>>
>>>>> Resume:
>>>>> 2023-09-11 06:02:07,481 o.a.f.k.o.o.d.ApplicationObserver [INFO
>>>>> ][rec-job/rec-job] Observing JobManager deployment. Previous status: 
>>>>> MISSING
>>>>> 2023-09-11 06:02:07,488 o.a.f.k.o.o.d.ApplicationObserver [INFO
>>>>> ][rec-job/rec-job] JobManager is being deployed
>>>>> 2023-09-11 06:02:07,563 o.a.f.k.o.l.AuditUtils         [INFO
>>>>> ][rec-job/rec-job] >>> Status | Info    | SUSPENDED       | The resource
>>>>> (job) has been suspended
>>>>> 2023-09-11 06:02:07,576 o.a.f.k.o.l.AuditUtils         [INFO
>>>>> ][rec-job/rec-job] >>> Event  | Info    | SPECCHANGED     | UPGRADE
>>>>> change(s) detected (Diff: FlinkDeploymentSpec[job.state : suspended ->
>>>>> running]), starting reconciliation.
>>>>> 2023-09-11 06:02:07,649 o.a.f.k.o.l.AuditUtils         [INFO
>>>>> ][rec-job/rec-job] >>> Status | Info    | UPGRADING       | The resource 
>>>>> is
>>>>> being upgraded
>>>>> 2023-09-11 06:02:07,649 o.a.f.k.o.r.d.ApplicationReconciler [INFO
>>>>> ][rec-job/rec-job] Deleting deployment with terminated application before
>>>>> new deployment
>>>>> 2023-09-11 06:02:07,649 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>>> ][rec-job/rec-job] Deleting cluster with Foreground propagation
>>>>> 2023-09-11 06:02:07,649 o.a.f.k.o.s.NativeFlinkService [INFO
>>>>> ][rec-job/rec-job] Deleting JobManager deployment and HA metadata.
>>>>> 2023-09-11 06:02:07,691 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>>> ][rec-job/rec-job] Waiting for cluster shutdown...
>>>>> 2023-09-11 06:02:07,763 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>>> ][rec-job/rec-job] Cluster shutdown completed.
>>>>> 2023-09-11 06:02:07,763 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>>> ][rec-job/rec-job] Deleting Kubernetes HA metadata
>>>>> 2023-09-11 06:02:07,820 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>>> ][rec-job/rec-job] Waiting for cluster shutdown...
>>>>> 2023-09-11 06:02:07,831 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>>> ][rec-job/rec-job] Cluster shutdown completed.
>>>>> 2023-09-11 06:02:07,975 o.a.f.k.o.l.AuditUtils         [INFO
>>>>> ][rec-job/rec-job] >>> Status | Info    | UPGRADING       | The resource 
>>>>> is
>>>>> being upgraded
>>>>> 2023-09-11 06:02:07,987 o.a.f.k.o.l.AuditUtils         [INFO
>>>>> ][rec-job/rec-job] >>> Event  | Info    | SUBMIT          | Starting
>>>>> deployment
>>>>> 2023-09-11 06:02:07,987 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>>> ][rec-job/rec-job] Deploying application cluster requiring last-state from
>>>>> HA metadata
>>>>> 2023-09-11 06:02:07,999 o.a.f.k.o.c.FlinkDeploymentController
>>>>> [ERROR][rec-job/rec-job] Flink recovery failed
>>>>> 2023-09-11 06:02:08,012 o.a.f.k.o.l.AuditUtils         [INFO
>>>>> ][rec-job/rec-job] >>> Event  | Warning | RESTOREFAILED   | HA metadata 
>>>>> not
>>>>> available to restore from last state. It is possible that the job has
>>>>> finished or terminally failed, or the configmaps have been deleted. Manual
>>>>> restore required.
>>>>> 2023-09-11 06:02:08,099 o.a.f.k.o.l.AuditUtils         [INFO
>>>>> ][rec-job/rec-job] >>> Status | Error   | UPGRADING       |
>>>>> {"type":"org.apache.flink.kubernetes.operator.exception.RecoveryFailureException","message":"HA
>>>>> metadata not available to restore from last state. It is possible that the
>>>>> job has finished or terminally failed, or the configmaps have been 
>>>>> deleted.
>>>>> Manual restore required.","additionalMetadata":{},"throwableList":[]}
>>>>> 2023-09-11 06:02:08,193 o.a.f.k.o.l.AuditUtils         [INFO
>>>>> ][rec-job/rec-job] >>> Status | Info    | UPGRADING       | The resource 
>>>>> is
>>>>> being upgraded
>>>>> 2023-09-11 06:02:08,218 o.a.f.k.o.l.AuditUtils         [INFO
>>>>> ][rec-job/rec-job] >>> Event  | Info    | SUBMIT          | Starting
>>>>> deployment
>>>>> 2023-09-11 06:02:08,218 o.a.f.k.o.s.AbstractFlinkService [INFO
>>>>> ][rec-job/rec-job] Deploying application cluster requiring last-state from
>>>>> HA metadata
>>>>> 2023-09-11 06:02:08,228 o.a.f.k.o.c.FlinkDeploymentController
>>>>> [ERROR][rec-job/rec-job] Flink recovery failed
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> * ------------------------------ *“This message contains confidential
>>>>> information/commercial secret. If you are not the intended addressee of
>>>>> this message you may not copy, save, print or forward it to any third 
>>>>> party
>>>>> and you are kindly requested to destroy this message and notify the sender
>>>>> thereof by email.
>>>>> Данное сообщение содержит конфиденциальную информацию/информацию,
>>>>> являющуюся коммерческой тайной. Если Вы не являетесь надлежащим адресатом
>>>>> данного сообщения, Вы не вправе копировать, сохранять, печатать или
>>>>> пересылать его каким либо иным лицам. Просьба уничтожить данное сообщение 
>>>>> и
>>>>> уведомить об этом отправителя электронным письмом.”
>>>>>
>>>>

-- 

<http://www.robinhood.com/>

Tony Chen

Software Engineer

Menlo Park, CA

Don't copy, share, or use this email without permission. If you received it
by accident, please let us know and then delete it right away.

Reply via email to