Re: [Flink Kubernetes Operator] The "last-state" upgrade mode is only supported in FlinkDeployments

Alan Zhang Wed, 01 May 2024 10:12:15 -0700

Thanks for answering my questions, Gyula! And your insights are very
helpful. Let me take a deeper look at the existing logic and think more.


On Tue, Apr 30, 2024 at 12:00 PM Gyula Fóra <[email protected]> wrote:

> The application mode indeed has a sticky jobId (at least when we are
> performing a last-state upgrade, otherwise a new jobId is generated during
> stateless deployments). But that's only part of the story and arguably the
> less important bit. The last-state upgrade mechanism for running/failing
> (but otherwise non-terminal) jobs relies on the Flink HA metadata to carry
> over the state information automagically. In Flink the HA mechanism always
> keeps track of the last state of a job so that even in the case  of a JM
> loss the job can correctly recover.
>
> The operator last-state upgrade uses this exact mechanism: we delete the
> deployment (JMs, and TMs) but keep the HA metadata and then start the new
> cluster with the upgraded spec. The JM will recover thinking that it's only
> a failover and pick up the state automatically. We can do this because we
> have 1 cluster - 1 job and upgrading means upgrading the entire deployment.
>
> The same is not true for session jobs where we can't use the HA metadata
> trick and we actually need to figure out the last state (the checkpoint or
> savepoint path). This can only be done through the JM rest api. This should
> be possible in most cases when the JM is healthy after cancelling the
> session job. By the way for terminal jobs (FAILED/FINISHED/CANCELLED) we
> also do similarly in case of the FlinkDeployments, where the last
> checkpoint info is queried from the JM (
> https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/SnapshotObserver.java#L74-L78
> )
> For session jobs you will not need sticky job ids because it's simply not
> relevant.
>
> Gyula
>
> On Tue, Apr 30, 2024 at 7:51 PM Alan Zhang <[email protected]> wrote:
>
>> Hi Gyula,
>>
>> Thanks for your reply! Good suggestion on JIRA ticket, I created a JIRA
>> ticket for tracking it: https://issues.apache.org/jira/browse/FLINK-35279.
>> We could be interested in working on it because of our own requirement, I
>> will check you and the community again once we have some updates.
>>
>> >We don't have the same robust way of getting the last-state information
>> for session jobs as we do for applications, so it will be slightly less
>> reliable overall.
>> My understanding is that application mode has sticky job id but session
>> mode doesn't have, with sticky job id it is easier to implement
>> "last-state" upgrade mode. When you were saying "robust way", does it mean
>> "sticky job id" in application mode?
>>
>>
>> On Mon, Apr 29, 2024 at 10:28 PM Gyula Fóra <[email protected]> wrote:
>>
>>> Hi Alan!
>>>
>>> I think it should be possible to address this gap for most cases. We
>>> don't have the same robust way of getting the last-state information for
>>> session jobs as we do for applications, so it will be slightly less
>>> reliable overall.
>>> For session jobs the last checkpoint info has to be queried from the JM
>>> rest api, so as long that is available it should work fine.
>>>
>>> I am not aware of anyone working on this at the moment, it would be
>>> great if you could open a JIRA ticket to track this. If you are interested
>>> in working on this, we can also support you but this is a fairly complex
>>> feature that involves many layers of operator logic.
>>>
>>> Cheers,
>>> Gyula
>>>
>>> On Tue, Apr 30, 2024 at 1:08 AM Alan Zhang <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> We wanted to use the Apache Flink Kubernetes operator to manage the
>>>> lifecycle of our Flink jobs in Flink session clusters. And we wanted to
>>>> have the "last-state" upgrade feature for our use cases.
>>>>
>>>> However, the latest official doc states the "last-state" upgrade mode
>>>> is not supported in the session mode(aka. FlinkSessionJob) currently:
>>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades
>>>>
>>>> Last state upgrade mode is currently only supported for
>>>> FlinkDeployments.
>>>>
>>>> Why didn't we support this upgrade mode in session mode? Do we have a
>>>> plan to address this gap? Any suggestions for us if we want to stick with
>>>> session mode?
>>>>
>>>> --
>>>> Thanks,
>>>> Alan
>>>>
>>>
>>
>> --
>> Thanks,
>> Alan
>>
>

-- 
Thanks,
Alan

Re: [Flink Kubernetes Operator] The "last-state" upgrade mode is only supported in FlinkDeployments

Reply via email to