Hi Gyula, Thanks for the prompt response.
> The Flink operator currently does not delete the jobmanager pod when a deployment is suspended. Are you sure this is true ? I have re-tried this many times, but each time the pods get deleted, along with the deployment resources. Additionally, the flink-operator logs also denote that the resources are being deleted ( highlighted in red) after I change the state in the FlinkDeployment yaml from running --> suspended ( note: my FlinkDeployment name is *my-sample-dagger-v7 *) 2022-10-13 06:11:47,392 o.a.f.k.o.c.FlinkDeploymentController [INFO ][flink-operator/my-sample-dagger-v7] End of reconciliation 2022-10-13 06:11:49,879 o.a.f.k.o.c.FlinkDeploymentController [INFO ][flink-operator/parquet-savepoint-test] Starting reconciliation 2022-10-13 06:11:49,880 o.a.f.k.o.o.JobStatusObserver [INFO ][flink-operator/parquet-savepoint-test] Observing job status 2022-10-13 06:11:52,710 o.a.f.k.o.c.FlinkDeploymentController [INFO ][flink-operator/my-sample-dagger-v7] Starting reconciliation 2022-10-13 06:11:52,712 o.a.f.k.o.o.JobStatusObserver [INFO ][flink-operator/my-sample-dagger-v7] Observing job status 2022-10-13 06:11:52,721 o.a.f.k.o.o.JobStatusObserver [INFO ][flink-operator/my-sample-dagger-v7] Job status (RUNNING) unchanged 2022-10-13 06:11:52,723 o.a.f.k.o.c.FlinkConfigManager [INFO ][flink-operator/my-sample-dagger-v7] Generating new config 2022-10-13 06:11:52,725 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO ][flink-operator/my-sample-dagger-v7] Detected spec change, starting reconciliation. 2022-10-13 06:11:52,788 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][flink-operator/my-sample-dagger-v7] Stateless job, ready for upgrade 2022-10-13 06:11:52,798 o.a.f.k.o.s.FlinkService [INFO ][flink-operator/my-sample-dagger-v7] Job is running, cancelling job. 2022-10-13 06:11:52,815 o.a.f.k.o.s.FlinkService [INFO ][flink-operator/my-sample-dagger-v7] Job successfully cancelled. 2022-10-13 06:11:52,815 o.a.f.k.o.u.FlinkUtils [INFO ][flink-operator/my-sample-dagger-v7] Deleting JobManager deployment and HA metadata. 2022-10-13 06:11:56,863 o.a.f.k.o.u.FlinkUtils [INFO ][flink-operator/my-sample-dagger-v7] Cluster shutdown completed. 2022-10-13 06:11:56,903 o.a.f.k.o.u.FlinkUtils [INFO ][flink-operator/my-sample-dagger-v7] Cluster shutdown completed. 2022-10-13 06:11:56,904 o.a.f.k.o.c.FlinkDeploymentController [INFO ][flink-operator/my-sample-dagger-v7] End of reconciliation 2022-10-13 06:11:56,928 o.a.f.k.o.c.FlinkDeploymentController [INFO ][flink-operator/my-sample-dagger-v7] Starting reconciliation 2022-10-13 06:11:56,930 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO ][flink-operator/my-sample-dagger-v7] Resource fully reconciled, nothing to do... Also, my original doubt was around the uptime metric itself. What is the correct metric to use for monitoring the status ( running or suspended) of a job which is being managed by the Flink Operator ? The *jobmanager_job_uptime_value * seems to be giving the wrong status as mentioned in the earlier mail. Regards, Meghajit On Wed, Oct 12, 2022 at 7:32 PM Gyula Fóra <gyula.f...@gmail.com> wrote: > Hello! > The Flink operator currently does not delete the jobmanager pod when a > deployment is suspended. > This way the rest api stay available but no other resources are consumed > (taskmanagers are deleted) > > When you delete the FlinkDeployment resource completely, then the > jobmanager deployment is also deleted. > > In theory we could improve the logic to eventually delete the jobmanager > for suspended resources but we currently use this is a way to guarantee > more resiliency for the operator flow. > > Cheers, > Gyula > > On Wed, Oct 12, 2022 at 3:56 PM Meghajit Mazumdar < > meghajit.mazum...@gojek.com> wrote: > >> Hello, >> >> I recently deployed a Flink Operator in Kubernetes and wrote a simple >> FlinkDeployment CRD to run it in application mode following this >> <https://github.com/apache/flink-kubernetes-operator/blob/main/examples/pod-template.yaml> >> . >> >> I noticed that, even after I edited the CRD and marked the spec.job.state >> field as *suspended, *the metric *jobmanager_job_uptime_value *continued >> to show the job status as *running*. I did verify that after re-applying >> these changes, the JM and TM pods were deleted and the cluster was not >> running anymore. >> >> Am I doing something incorrect or is there some other metric to monitor >> the job status when using Flink Operator ? >> >> >> >> -- >> *Regards,* >> *Meghajit* >> > -- *Regards,* *Meghajit*