Hey!

Overall, if SLA permits, we generally recommend savepoint upgrades for
prod, you may use configure the `kubernetes.operator.savepoint.format.type`
to NATIVE to get close to checkpoint performance on your savepoints.
If the job is not running savepoint upgrades by default fall back to
"last-state" so that's not a big problem.

Cancelling during the last-state upgrade instead of using the HA metadata
will generate a new job.id but it also be a slightly slower process overall.

To be honest I don't know many people who are using the history server like
this for streaming jobs with the operator. Not really sure what you are
trying to achieve with it , maybe some other audit feature would be enough
to simply track the spec changes over time of the CR?

Cheers,
Gyula


On Thu, Aug 22, 2024 at 8:31 AM Alan Zhang <shuai....@gmail.com> wrote:

> Hi Gyula,
> Thanks for answering my questions!
>
> >Savepoint upgrades on the other hand would generate a new job id (at
> least after a recent fix on operator main).
> Yes, the savepoint can help. However, IMO savepoint is not ideal compared
> with checkpoints because of 1) performance concern: savepoint does full
> snapshot which could take a long time especially for jobs with large state
> 2) Flink jobs need to be running to allow the savepoint to get created. So
> just simply leverage savepoints instead of checkpoints for all job
> redeployment / upgrade may not be practical,(e.g. job downtime could be
> longer than SLA).
>
> My understanding is that the "last-state" upgrade is recommended for the
> deployments of near-real time use cases that are usually sensitive to
> latency(including job downtime). So once we want to redeploy a Flink job,
> we just do in-place updates on the existing FlinkDeployment with
> "last-state" enabled. Aka, use a job failover trick to achieve job
> redeployment.
>
> What are your thoughts on using "last-state" vs "savepoint"? Would you
> mind sharing how you use / decide "last-state" vs "savepoint" in production?
>
> >I am actually working on adding a new way to perform the last-state
> upgrade via simple cancellation but that's a slightly orthogonal question.
> Will this new way help generate a new job.id during last-state upgrade?
>
> Thanks,
> Alan
>
> On Tue, Aug 20, 2024 at 10:17 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
>
>> Hi Alan!
>>
>> The job.id remains the same as the last-state mode uses flinks internal
>> failover mechanism to access the state. We cannot change the job.id
>> while doing this unfortunately.
>>
>> Savepoint upgrades on the other hand would generate a new job id (at
>> least after a recent fix on operator main). I am actually working on adding
>> a new way to perform the last-state upgrade via simple cancellation but
>> that's a slightly orthogonal question.
>>
>> Long story short if you really need to integrate this with the history
>> server, then you should switch to savepoint upgrades.
>>
>> Cheers,
>> Gyula
>>
>> On Wed, Aug 21, 2024 at 12:14 AM Alan Zhang <shuai....@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> We are using Apache Flink Kubernetes operator to manage the deployment
>>> lifecycle of our Flink jobs. And we are using the application mode with
>>> "last-state" upgrade mode for each FlinkDeployment.
>>>
>>> As I know, each FlinkDeployment will keep using the same job id across
>>> different job deployments / upgrades,  because the operator uses the
>>> job failover mechanism to achieve "last-state" upgrade mode.
>>> However, with it, it seems impossible to integrate with Flink history
>>> server which uses job.id to differentiate different job deployments.
>>>
>>> Questions:
>>>
>>>    - Is there any way to make the job.id different for "last-state"
>>>    upgrade mode?
>>>    - What could be the right way to enable Flink history server in this
>>>    case?
>>>
>>>

Reply via email to