Hi Flink Community, I am currently running Apache flink-kubernetes-operator on our kubernetes clusters, and I have Flink applications that are deployed using the FlinkDeployment Custom Resources (CR). I am trying to automate the process of rollbacks and I am running into some issues.
I was testing out a bad deployment where the jobmanager never becomes healthy. I simulated this bad deployment by creating a Flink image with a bug in it. I see in the operator logs that the jobmanager is unhealthy: [m [33m2023-10-02 22:14:34,874 [m [36mo.a.f.k.o.r.d.AbstractFlinkResourceReconciler [m [32m[INFO ][flink-testing-service/flink-testing-service] UPGRADE change(s) detected (FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJo] differs from FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJob]), starting reconciliation. ... [m [33m2023-10-02 22:15:09,001 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO ][flink-testing-service/flink-testing-service] >>> Status | Info | UPGRADING | The resource is being upgraded ... [m [33m2023-10-02 22:17:23,911 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO ][flink-testing-service/flink-testing-service] >>> Status | Error | DEPLOYED | {"type":"org.apache.flink.kubernetes.operator.exception.DeploymentFailedException","message":"back-off 20s restarting failed container=flink-main-container pod=flink-testing-service-749dd97c75-4w9ps_flink-testing-service(6db1adb3-4ca4-4924-a8c3-57a417818d85)","additionalMetadata":{"reason":"CrashLoopBackOff"},"throwableList":[]} ... [m [33m2023-10-02 22:17:33,576 [m [36mo.a.f.k.o.o.d.ApplicationObserver [m [32m[INFO ][flink-testing-service/flink-testing-service] Observing JobManager deployment. Previous status: ERROR What I do next is I change the spec of the FlinkDeployment so that it uses a Flink image that is healthy. The operator shows that the spec has changed: [m [33m2023-10-02 22:45:37,445 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO ][flink-testing-service/flink-testing-service] >>> Event | Info | SPECCHANGED | UPGRADE change(s) detected (FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJob,job.initialSavepointPath=s3a://robinhood-dev-core-flink-states/flink-testing-service/savepoints/savepoint-329a14-2f8264206b1d] differs from FlinkDeploymentSpec[job.entryClass=com.robinhood.flink.chaos.StreamingSumByKeyJo,job.initialSavepointPath=s3a://robinhood-dev-core-flink-states/flink-testing-service/savepoints/savepoint-dc1077-134923759e30]), starting reconciliation. However, the Flink operator cannot reconcile this spec change, and the jobmanager is now permanently failing because it's still running the bad Flink image: [m [33m2023-10-02 22:45:37,461 [m [36mo.a.f.k.o.l.AuditUtils [m [32m[INFO ][flink-testing-service/flink-testing-service] >>> Event | Warning | UPGRADEFAILED | JobManager deployment is missing and HA data is not available to make stateful upgrades. It is possible that the job has finished or terminally failed, or the configmaps have been deleted. Manual restore required. I can simply delete this FlinkDeployment and redeploy with the healthy Flink image, but I would like to avoid manual restores if possible. Is it possible to recover by just changing the FlinkDeployment spec? Thanks, Tony -- <http://www.robinhood.com/> Tony Chen Software Engineer Menlo Park, CA Don't copy, share, or use this email without permission. If you received it by accident, please let us know and then delete it right away.