Re: [Flink Kubernetes Operator] How to enable Flink history server for the Flink jobs managed by Flink Kubernetes operator

Alan Zhang Thu, 22 Aug 2024 13:43:45 -0700

Thanks for sharing your thoughts, Gyula!

>Not really sure what you are trying to achieve with it , maybe some other
audit feature would be enough to simply track the spec changes over time of
the CR?
Basically we wanted to leverage Flink history server to know the details /
insights(e.g. DAG/operator, exceptions, checkpoints, etc) of past completed
jobs, like how we can get from Flink Web UI for running jobs.


>Cancelling during the last-state upgrade instead of using the HA metadata
will generate a new job.id but it will also be a slightly slower process
overall.
Thanks for sharing more details and plans. This new way makes sense to me,
since in general Flink wants to keep using the same job.id during job
failover and using a different job.id during job redeployment. With it,
there should be no gap on integrating the Flink history server with the
operator. Looking forward to this change.

Let me play with the latest operator codes to check the current savepoint
upgrade behavior.
>Savepoint upgrades on the other hand would generate a new job id (at least
after a recent fix on operator main).

On Wed, Aug 21, 2024 at 11:38 PM Gyula Fóra <gyula.f...@gmail.com> wrote:

> Hey!
>
> Overall, if SLA permits, we generally recommend savepoint upgrades for
> prod, you may use configure the `kubernetes.operator.savepoint.format.type`
> to NATIVE to get close to checkpoint performance on your savepoints.
> If the job is not running savepoint upgrades by default fall back to
> "last-state" so that's not a big problem.
>
> Cancelling during the last-state upgrade instead of using the HA metadata
> will generate a new job.id but it also be a slightly slower process
> overall.
>
> To be honest I don't know many people who are using the history server
> like this for streaming jobs with the operator. Not really sure what you
> are trying to achieve with it , maybe some other audit feature would be
> enough to simply track the spec changes over time of the CR?
>
> Cheers,
> Gyula
>
>
> On Thu, Aug 22, 2024 at 8:31 AM Alan Zhang <shuai....@gmail.com> wrote:
>
>> Hi Gyula,
>> Thanks for answering my questions!
>>
>> >Savepoint upgrades on the other hand would generate a new job id (at
>> least after a recent fix on operator main).
>> Yes, the savepoint can help. However, IMO savepoint is not ideal compared
>> with checkpoints because of 1) performance concern: savepoint does full
>> snapshot which could take a long time especially for jobs with large state
>> 2) Flink jobs need to be running to allow the savepoint to get created. So
>> just simply leverage savepoints instead of checkpoints for all job
>> redeployment / upgrade may not be practical,(e.g. job downtime could be
>> longer than SLA).
>>
>> My understanding is that the "last-state" upgrade is recommended for the
>> deployments of near-real time use cases that are usually sensitive to
>> latency(including job downtime). So once we want to redeploy a Flink job,
>> we just do in-place updates on the existing FlinkDeployment with
>> "last-state" enabled. Aka, use a job failover trick to achieve job
>> redeployment.
>>
>> What are your thoughts on using "last-state" vs "savepoint"? Would you
>> mind sharing how you use / decide "last-state" vs "savepoint" in production?
>>
>> >I am actually working on adding a new way to perform the last-state
>> upgrade via simple cancellation but that's a slightly orthogonal question.
>> Will this new way help generate a new job.id during last-state upgrade?
>>
>> Thanks,
>> Alan
>>
>> On Tue, Aug 20, 2024 at 10:17 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
>>
>>> Hi Alan!
>>>
>>> The job.id remains the same as the last-state mode uses flinks internal
>>> failover mechanism to access the state. We cannot change the job.id
>>> while doing this unfortunately.
>>>
>>> Savepoint upgrades on the other hand would generate a new job id (at
>>> least after a recent fix on operator main). I am actually working on adding
>>> a new way to perform the last-state upgrade via simple cancellation but
>>> that's a slightly orthogonal question.
>>>
>>> Long story short if you really need to integrate this with the history
>>> server, then you should switch to savepoint upgrades.
>>>
>>> Cheers,
>>> Gyula
>>>
>>> On Wed, Aug 21, 2024 at 12:14 AM Alan Zhang <shuai....@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> We are using Apache Flink Kubernetes operator to manage the deployment
>>>> lifecycle of our Flink jobs. And we are using the application mode with
>>>> "last-state" upgrade mode for each FlinkDeployment.
>>>>
>>>> As I know, each FlinkDeployment will keep using the same job id across
>>>> different job deployments / upgrades,  because the operator uses the
>>>> job failover mechanism to achieve "last-state" upgrade mode.
>>>> However, with it, it seems impossible to integrate with Flink history
>>>> server which uses job.id to differentiate different job deployments.
>>>>
>>>> Questions:
>>>>
>>>>    - Is there any way to make the job.id different for "last-state"
>>>>    upgrade mode?
>>>>    - What could be the right way to enable Flink history server in
>>>>    this case?
>>>>
>>>>

Re: [Flink Kubernetes Operator] How to enable Flink history server for the Flink jobs managed by Flink Kubernetes operator

Reply via email to