Re: Job uptime metric in Flink Operator managed cluster

Gyula Fóra Wed, 12 Oct 2022 23:36:18 -0700

Sorry, what I said applies to Flink 1.15+ and the savepoint upgrade mode
(not stateless).


In any case if there is no job manager there are no metrics... So not sure
how to answer your question.

Gyula

On Thu, Oct 13, 2022 at 8:24 AM Meghajit Mazumdar <
meghajit.mazum...@gojek.com> wrote:

> Hi Gyula,
>
> Thanks for the prompt response.
>
> > The Flink operator currently does not delete the jobmanager pod when a
> deployment is suspended.
> Are you sure this is true ? I have re-tried this many times, but each time
> the pods get deleted, along with the deployment resources.
>
> Additionally, the flink-operator logs also denote that the resources are
> being deleted ( highlighted in red) after I change the state in the
> FlinkDeployment yaml from running --> suspended
> ( note: my FlinkDeployment name is *my-sample-dagger-v7 *)
>
> 2022-10-13 06:11:47,392 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][flink-operator/my-sample-dagger-v7] End of reconciliation
> 2022-10-13 06:11:49,879 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][flink-operator/parquet-savepoint-test] Starting reconciliation
> 2022-10-13 06:11:49,880 o.a.f.k.o.o.JobStatusObserver  [INFO
> ][flink-operator/parquet-savepoint-test] Observing job status
> 2022-10-13 06:11:52,710 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][flink-operator/my-sample-dagger-v7] Starting reconciliation
> 2022-10-13 06:11:52,712 o.a.f.k.o.o.JobStatusObserver  [INFO
> ][flink-operator/my-sample-dagger-v7] Observing job status
> 2022-10-13 06:11:52,721 o.a.f.k.o.o.JobStatusObserver  [INFO
> ][flink-operator/my-sample-dagger-v7] Job status (RUNNING) unchanged
> 2022-10-13 06:11:52,723 o.a.f.k.o.c.FlinkConfigManager [INFO
> ][flink-operator/my-sample-dagger-v7] Generating new config
> 2022-10-13 06:11:52,725 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler
> [INFO ][flink-operator/my-sample-dagger-v7] Detected spec change, starting
> reconciliation.
>
>
> 2022-10-13 06:11:52,788 o.a.f.k.o.r.d.AbstractJobReconciler [INFO
> ][flink-operator/my-sample-dagger-v7] Stateless job, ready for upgrade
> 2022-10-13 06:11:52,798 o.a.f.k.o.s.FlinkService       [INFO
> ][flink-operator/my-sample-dagger-v7] Job is running, cancelling job.
> 2022-10-13 06:11:52,815 o.a.f.k.o.s.FlinkService       [INFO
> ][flink-operator/my-sample-dagger-v7] Job successfully cancelled.
> 2022-10-13 06:11:52,815 o.a.f.k.o.u.FlinkUtils         [INFO
> ][flink-operator/my-sample-dagger-v7] Deleting JobManager deployment and HA
> metadata.
> 2022-10-13 06:11:56,863 o.a.f.k.o.u.FlinkUtils         [INFO
> ][flink-operator/my-sample-dagger-v7] Cluster shutdown completed.
> 2022-10-13 06:11:56,903 o.a.f.k.o.u.FlinkUtils         [INFO
> ][flink-operator/my-sample-dagger-v7] Cluster shutdown completed.
> 2022-10-13 06:11:56,904 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][flink-operator/my-sample-dagger-v7] End of reconciliation
> 2022-10-13 06:11:56,928 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][flink-operator/my-sample-dagger-v7] Starting reconciliation
> 2022-10-13 06:11:56,930 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler
> [INFO ][flink-operator/my-sample-dagger-v7] Resource fully reconciled,
> nothing to do...
>
> Also, my original doubt was around the uptime metric itself. What is the
> correct metric to use for monitoring the status ( running or suspended) of
> a job which is being managed by the Flink Operator ?
> The  *jobmanager_job_uptime_value * seems to be giving the wrong status
> as mentioned in the earlier mail.
>
> Regards,
> Meghajit
>
>
> On Wed, Oct 12, 2022 at 7:32 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
>
>> Hello!
>> The Flink operator currently does not delete the jobmanager pod when a
>> deployment is suspended.
>> This way the rest api stay available but no other resources are consumed
>> (taskmanagers are deleted)
>>
>> When you delete the FlinkDeployment resource completely, then the
>> jobmanager deployment is also deleted.
>>
>> In theory we could improve the logic to eventually delete the jobmanager
>> for suspended resources but we currently use this is a way to guarantee
>> more resiliency for the operator flow.
>>
>> Cheers,
>> Gyula
>>
>> On Wed, Oct 12, 2022 at 3:56 PM Meghajit Mazumdar <
>> meghajit.mazum...@gojek.com> wrote:
>>
>>> Hello,
>>>
>>> I recently deployed a Flink Operator in Kubernetes and wrote a simple
>>> FlinkDeployment CRD  to run it in application mode following this
>>> <https://github.com/apache/flink-kubernetes-operator/blob/main/examples/pod-template.yaml>
>>> .
>>>
>>> I noticed that, even after I edited the CRD and marked the
>>> spec.job.state field as *suspended, *the metric *jobmanager_job_uptime_value
>>> *continued to show the job status as *running*. I did verify that after
>>> re-applying these changes, the JM and TM pods were deleted and the cluster
>>> was not running anymore.
>>>
>>> Am I doing something incorrect or is there some other metric to monitor
>>> the job status when using Flink Operator ?
>>>
>>>
>>>
>>> --
>>> *Regards,*
>>> *Meghajit*
>>>
>>
>
> --
> *Regards,*
> *Meghajit*
>

Re: Job uptime metric in Flink Operator managed cluster

Reply via email to