Re: Flink Operator 1.6 causes JobManagerDeploymentStatus: MISSING

Tony Chen Wed, 18 Oct 2023 15:29:08 -0700

I did see another email thread that mentions instructions on getting the
image from this link:
https://github.com/apache/flink-kubernetes-operator/pkgs/container/flink-kubernetes-operator/127962962?tag=3f0dc2e


On Wed, Oct 18, 2023 at 6:25 PM Tony Chen <tony.ch...@robinhood.com> wrote:

> We're using the Helm chart to deploy the operator right now, and the image
> that I'm using was downloaded from Docker Hub:
> https://hub.docker.com/r/apache/flink-kubernetes-operator/tags. I
> wouldn't be able to use the release-1.6 branch (
> https://github.com/apache/flink-kubernetes-operator/commits/release-1.6)
> to pick up the fix, unless I'm missing something.
>
> I was attempting to rollback the operator version to 1.4 today, and I ran
> into the following issues on some operator pods. I was wondering if you
> seen these Lease issues before.
>
> 2023-10-18 21:01:15,251 i.f.k.c.e.l.LeaderElector      [ERROR] Exception
> occurred while releasing lock 'LeaseLock: flink-kubernetes-operator -
> flink-operator-lease (flink-kubernetes-operator-74f9688dd-bcqr2)'
> io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException:
> Unable to update LeaseLock
> at
> io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LeaseLock.update(LeaseLock.java:102)
> at
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.release(LeaderElector.java:139)
> at
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.stopLeading(LeaderElector.java:120)
> at
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$start$2(LeaderElector.java:104)
> at
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown
> Source)
> at
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
> Source)
> at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown
> Source)
> at
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown
> Source)
> at io.fabric8.kubernetes.client.utils.Utils.lambda$null$12(Utils.java:523)
> at
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown
> Source)
> at
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
> Source)
> at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown
> Source)
> at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown
> Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
> at java.base/java.lang.Thread.run(Unknown Source)
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
> executing: PUT at:
> https://10.241.0.1/apis/coordination.k8s.io/v1/namespaces/flink-kubernetes-operator/leases/flink-operator-lease.
> Message: Operation cannot be fulfilled on leases.coordination.k8s.io 
> "flink-operator-lease":
> the object has been modified; please apply your changes to the latest
> version and try again. Received status: Status(apiVersion=v1, code=409,
> details=StatusDetails(causes=[], group=coordination.k8s.io, kind=leases,
> name=flink-operator-lease, retryAfterSeconds=null, uid=null,
> additionalProperties={}), kind=Status, message=Operation cannot be
> fulfilled on leases.coordination.k8s.io "flink-operator-lease": the
> object has been modified; please apply your changes to the latest version
> and try again, metadata=ListMeta(_continue=null, remainingItemCount=null,
> resourceVersion=null, selfLink=null, additionalProperties={}),
> reason=Conflict, status=Failure, additionalProperties={}).
>
> On Wed, Oct 18, 2023 at 2:55 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
>
>> Hi!
>> Not sure if it’s the same but could you try picking up the fix from the
>> release branch and confirming that it solves the problem?
>>
>> If it does we may consider a quick bug fix release.
>>
>> Cheers
>> Gyula
>>
>> On Wed, 18 Oct 2023 at 18:09, Tony Chen <tony.ch...@robinhood.com> wrote:
>>
>>> Hi Flink Community,
>>>
>>> Most of the Flink applications run on 1.14 at my company. After
>>> upgrading the Flink Operator to 1.6, we've seen many jobmanager pods show
>>> "JobManagerDeploymentStatus: MISSING".
>>>
>>> Here are some logs from the operator pod on one of our Flink
>>> applications:
>>>
>>> [m [33m2023-10-18 02:02:40,823 [m [36mo.a.f.k.o.l.AuditUtils [m
>>> [32m[INFO ][nemo/nemo-streaming-users-identi-updates] >>> Event | Warning |
>>> SAVEPOINTERROR | Savepoint failed for savepointTriggerNonce: null
>>> ...
>>> [m [33m2023-10-18 02:02:40,883 [m [36mo.a.f.k.o.l.AuditUtils [m
>>> [32m[INFO ][nemo/nemo-streaming-users-identi-updates] >>> Event | Warning |
>>> CLUSTERDEPLOYMENTEXCEPTION | Status have been modified externally in
>>> version 17447422864 Previous: <redacted>
>>> ...
>>> [m [33m2023-10-18 02:02:40,919 [m [36mi.j.o.p.e.ReconciliationDispatcher
>>> [m [1;31m[ERROR][nemo/nemo-streaming-users-identi-updates] Error during
>>> event processing ExecutionScope{ resource id:
>>> ResourceID{name='nemo-streaming-users-identi-updates', namespace='nemo'},
>>> version: 17447420285} failed.
>>> ...
>>> org.apache.flink.kubernetes.operator.exception.ReconciliationException:
>>> org.apache.flink.kubernetes.operator.exception.StatusConflictException:
>>> Status have been modified externally in version 17447422864 Previous:
>>> <redacted>
>>> ...
>>> [m [33m2023-10-18 02:03:03,273 [m [36mo.a.f.k.o.o.d.ApplicationObserver
>>> [m [1;31m[ERROR][nemo/nemo-streaming-users-identi-updates] Missing
>>> JobManager deployment
>>> ...
>>> [m [33m2023-10-18 02:03:03,295 [m [36mo.a.f.k.o.l.AuditUtils [m
>>> [32m[INFO ][nemo/nemo-streaming-users-identi-updates] >>> Event | Warning |
>>> MISSING | Missing JobManager deployment
>>> [m [33m2023-10-18 02:03:03,295 [m [36mo.a.f.c.Configuration [m [33m[WARN
>>> ][nemo/nemo-streaming-users-identi-updates] Config uses deprecated
>>> configuration key 'high-availability' instead of proper key
>>> 'high-availability.type'
>>>
>>>
>>> This seems related to this email thread:
>>> https://www.mail-archive.com/user@flink.apache.org/msg51439.html.
>>> However, I believe that we're not seeing the HA metadata getting deleted.
>>>
>>> What could cause the JobManagerDeploymentStatus to be MISSING?
>>>
>>> Thanks,
>>> Tony
>>>
>>> --
>>>
>>> <http://www.robinhood.com/>
>>>
>>> Tony Chen
>>>
>>> Software Engineer
>>>
>>> Menlo Park, CA
>>>
>>> Don't copy, share, or use this email without permission. If you received
>>> it by accident, please let us know and then delete it right away.
>>>
>>
>
> --
>
> <http://www.robinhood.com/>
>
> Tony Chen
>
> Software Engineer
>
> Menlo Park, CA
>
> Don't copy, share, or use this email without permission. If you received
> it by accident, please let us know and then delete it right away.
>


-- 

<http://www.robinhood.com/>

Tony Chen

Software Engineer

Menlo Park, CA

Don't copy, share, or use this email without permission. If you received it
by accident, please let us know and then delete it right away.

Re: Flink Operator 1.6 causes JobManagerDeploymentStatus: MISSING

Reply via email to