Re: [Flink Kubernetes Operator] The "last-state" upgrade mode is only supported in FlinkDeployments

Gyula Fóra Tue, 30 Apr 2024 12:01:07 -0700

The application mode indeed has a sticky jobId (at least when we are
performing a last-state upgrade, otherwise a new jobId is generated during
stateless deployments). But that's only part of the story and arguably the
less important bit. The last-state upgrade mechanism for running/failing
(but otherwise non-terminal) jobs relies on the Flink HA metadata to carry
over the state information automagically. In Flink the HA mechanism always
keeps track of the last state of a job so that even in the case  of a JM
loss the job can correctly recover.

The operator last-state upgrade uses this exact mechanism: we delete the
deployment (JMs, and TMs) but keep the HA metadata and then start the new
cluster with the upgraded spec. The JM will recover thinking that it's only
a failover and pick up the state automatically. We can do this because we
have 1 cluster - 1 job and upgrading means upgrading the entire deployment.

The same is not true for session jobs where we can't use the HA metadata
trick and we actually need to figure out the last state (the checkpoint or
savepoint path). This can only be done through the JM rest api. This should
be possible in most cases when the JM is healthy after cancelling the
session job. By the way for terminal jobs (FAILED/FINISHED/CANCELLED) we
also do similarly in case of the FlinkDeployments, where the last
checkpoint info is queried from the JM (
https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/SnapshotObserver.java#L74-L78
)
For session jobs you will not need sticky job ids because it's simply not
relevant.

Gyula

On Tue, Apr 30, 2024 at 7:51 PM Alan Zhang <shuai....@gmail.com> wrote:

> Hi Gyula,
>
> Thanks for your reply! Good suggestion on JIRA ticket, I created a JIRA
> ticket for tracking it: https://issues.apache.org/jira/browse/FLINK-35279.
> We could be interested in working on it because of our own requirement, I
> will check you and the community again once we have some updates.
>
> >We don't have the same robust way of getting the last-state information
> for session jobs as we do for applications, so it will be slightly less
> reliable overall.
> My understanding is that application mode has sticky job id but session
> mode doesn't have, with sticky job id it is easier to implement
> "last-state" upgrade mode. When you were saying "robust way", does it mean
> "sticky job id" in application mode?
>
>
> On Mon, Apr 29, 2024 at 10:28 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
>
>> Hi Alan!
>>
>> I think it should be possible to address this gap for most cases. We
>> don't have the same robust way of getting the last-state information for
>> session jobs as we do for applications, so it will be slightly less
>> reliable overall.
>> For session jobs the last checkpoint info has to be queried from the JM
>> rest api, so as long that is available it should work fine.
>>
>> I am not aware of anyone working on this at the moment, it would be great
>> if you could open a JIRA ticket to track this. If you are interested in
>> working on this, we can also support you but this is a fairly complex
>> feature that involves many layers of operator logic.
>>
>> Cheers,
>> Gyula
>>
>> On Tue, Apr 30, 2024 at 1:08 AM Alan Zhang <shuai....@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> We wanted to use the Apache Flink Kubernetes operator to manage the
>>> lifecycle of our Flink jobs in Flink session clusters. And we wanted to
>>> have the "last-state" upgrade feature for our use cases.
>>>
>>> However, the latest official doc states the "last-state" upgrade mode is
>>> not supported in the session mode(aka. FlinkSessionJob) currently:
>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades
>>>
>>> Last state upgrade mode is currently only supported for FlinkDeployments
>>> .
>>>
>>> Why didn't we support this upgrade mode in session mode? Do we have a
>>> plan to address this gap? Any suggestions for us if we want to stick with
>>> session mode?
>>>
>>> --
>>> Thanks,
>>> Alan
>>>
>>
>
> --
> Thanks,
> Alan
>

Re: [Flink Kubernetes Operator] The "last-state" upgrade mode is only supported in FlinkDeployments

Reply via email to