Re: Job uptime metric in Flink Operator managed cluster

Meghajit Mazumdar Wed, 12 Oct 2022 23:26:31 -0700

Hi Gyula,

Thanks for the prompt response.

> The Flink operator currently does not delete the jobmanager pod when a
deployment is suspended.
Are you sure this is true ? I have re-tried this many times, but each time
the pods get deleted, along with the deployment resources.

Additionally, the flink-operator logs also denote that the resources are
being deleted ( highlighted in red) after I change the state in the
FlinkDeployment yaml from running --> suspended
( note: my FlinkDeployment name is *my-sample-dagger-v7 *)

2022-10-13 06:11:47,392 o.a.f.k.o.c.FlinkDeploymentController [INFO
][flink-operator/my-sample-dagger-v7] End of reconciliation
2022-10-13 06:11:49,879 o.a.f.k.o.c.FlinkDeploymentController [INFO
][flink-operator/parquet-savepoint-test] Starting reconciliation
2022-10-13 06:11:49,880 o.a.f.k.o.o.JobStatusObserver  [INFO
][flink-operator/parquet-savepoint-test] Observing job status
2022-10-13 06:11:52,710 o.a.f.k.o.c.FlinkDeploymentController [INFO
][flink-operator/my-sample-dagger-v7] Starting reconciliation
2022-10-13 06:11:52,712 o.a.f.k.o.o.JobStatusObserver  [INFO
][flink-operator/my-sample-dagger-v7] Observing job status
2022-10-13 06:11:52,721 o.a.f.k.o.o.JobStatusObserver  [INFO
][flink-operator/my-sample-dagger-v7] Job status (RUNNING) unchanged
2022-10-13 06:11:52,723 o.a.f.k.o.c.FlinkConfigManager [INFO
][flink-operator/my-sample-dagger-v7] Generating new config
2022-10-13 06:11:52,725 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO
][flink-operator/my-sample-dagger-v7] Detected spec change, starting
reconciliation.

2022-10-13 06:11:52,788 o.a.f.k.o.r.d.AbstractJobReconciler [INFO
][flink-operator/my-sample-dagger-v7] Stateless job, ready for upgrade
2022-10-13 06:11:52,798 o.a.f.k.o.s.FlinkService       [INFO
][flink-operator/my-sample-dagger-v7] Job is running, cancelling job.
2022-10-13 06:11:52,815 o.a.f.k.o.s.FlinkService       [INFO
][flink-operator/my-sample-dagger-v7] Job successfully cancelled.
2022-10-13 06:11:52,815 o.a.f.k.o.u.FlinkUtils         [INFO
][flink-operator/my-sample-dagger-v7] Deleting JobManager deployment and HA
metadata.
2022-10-13 06:11:56,863 o.a.f.k.o.u.FlinkUtils         [INFO
][flink-operator/my-sample-dagger-v7] Cluster shutdown completed.
2022-10-13 06:11:56,903 o.a.f.k.o.u.FlinkUtils         [INFO
][flink-operator/my-sample-dagger-v7] Cluster shutdown completed.
2022-10-13 06:11:56,904 o.a.f.k.o.c.FlinkDeploymentController [INFO
][flink-operator/my-sample-dagger-v7] End of reconciliation
2022-10-13 06:11:56,928 o.a.f.k.o.c.FlinkDeploymentController [INFO
][flink-operator/my-sample-dagger-v7] Starting reconciliation
2022-10-13 06:11:56,930 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO
][flink-operator/my-sample-dagger-v7] Resource fully reconciled, nothing to
do...

Also, my original doubt was around the uptime metric itself. What is the
correct metric to use for monitoring the status ( running or suspended) of
a job which is being managed by the Flink Operator ?
The  *jobmanager_job_uptime_value * seems to be giving the wrong status as
mentioned in the earlier mail.

Regards,
Meghajit

On Wed, Oct 12, 2022 at 7:32 PM Gyula Fóra <gyula.f...@gmail.com> wrote:

> Hello!
> The Flink operator currently does not delete the jobmanager pod when a
> deployment is suspended.
> This way the rest api stay available but no other resources are consumed
> (taskmanagers are deleted)
>
> When you delete the FlinkDeployment resource completely, then the
> jobmanager deployment is also deleted.
>
> In theory we could improve the logic to eventually delete the jobmanager
> for suspended resources but we currently use this is a way to guarantee
> more resiliency for the operator flow.
>
> Cheers,
> Gyula
>
> On Wed, Oct 12, 2022 at 3:56 PM Meghajit Mazumdar <
> meghajit.mazum...@gojek.com> wrote:
>
>> Hello,
>>
>> I recently deployed a Flink Operator in Kubernetes and wrote a simple
>> FlinkDeployment CRD  to run it in application mode following this
>> <https://github.com/apache/flink-kubernetes-operator/blob/main/examples/pod-template.yaml>
>> .
>>
>> I noticed that, even after I edited the CRD and marked the spec.job.state
>> field as *suspended, *the metric *jobmanager_job_uptime_value *continued
>> to show the job status as *running*. I did verify that after re-applying
>> these changes, the JM and TM pods were deleted and the cluster was not
>> running anymore.
>>
>> Am I doing something incorrect or is there some other metric to monitor
>> the job status when using Flink Operator ?
>>
>>
>>
>> --
>> *Regards,*
>> *Meghajit*
>>
>

-- 
*Regards,*
*Meghajit*

Re: Job uptime metric in Flink Operator managed cluster

Reply via email to