Thanks for answering my questions, Gyula! And your insights are very helpful. Let me take a deeper look at the existing logic and think more.
On Tue, Apr 30, 2024 at 12:00 PM Gyula Fóra <gyula.f...@gmail.com> wrote: > The application mode indeed has a sticky jobId (at least when we are > performing a last-state upgrade, otherwise a new jobId is generated during > stateless deployments). But that's only part of the story and arguably the > less important bit. The last-state upgrade mechanism for running/failing > (but otherwise non-terminal) jobs relies on the Flink HA metadata to carry > over the state information automagically. In Flink the HA mechanism always > keeps track of the last state of a job so that even in the case of a JM > loss the job can correctly recover. > > The operator last-state upgrade uses this exact mechanism: we delete the > deployment (JMs, and TMs) but keep the HA metadata and then start the new > cluster with the upgraded spec. The JM will recover thinking that it's only > a failover and pick up the state automatically. We can do this because we > have 1 cluster - 1 job and upgrading means upgrading the entire deployment. > > The same is not true for session jobs where we can't use the HA metadata > trick and we actually need to figure out the last state (the checkpoint or > savepoint path). This can only be done through the JM rest api. This should > be possible in most cases when the JM is healthy after cancelling the > session job. By the way for terminal jobs (FAILED/FINISHED/CANCELLED) we > also do similarly in case of the FlinkDeployments, where the last > checkpoint info is queried from the JM ( > https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/SnapshotObserver.java#L74-L78 > ) > For session jobs you will not need sticky job ids because it's simply not > relevant. > > Gyula > > On Tue, Apr 30, 2024 at 7:51 PM Alan Zhang <shuai....@gmail.com> wrote: > >> Hi Gyula, >> >> Thanks for your reply! Good suggestion on JIRA ticket, I created a JIRA >> ticket for tracking it: https://issues.apache.org/jira/browse/FLINK-35279. >> We could be interested in working on it because of our own requirement, I >> will check you and the community again once we have some updates. >> >> >We don't have the same robust way of getting the last-state information >> for session jobs as we do for applications, so it will be slightly less >> reliable overall. >> My understanding is that application mode has sticky job id but session >> mode doesn't have, with sticky job id it is easier to implement >> "last-state" upgrade mode. When you were saying "robust way", does it mean >> "sticky job id" in application mode? >> >> >> On Mon, Apr 29, 2024 at 10:28 PM Gyula Fóra <gyula.f...@gmail.com> wrote: >> >>> Hi Alan! >>> >>> I think it should be possible to address this gap for most cases. We >>> don't have the same robust way of getting the last-state information for >>> session jobs as we do for applications, so it will be slightly less >>> reliable overall. >>> For session jobs the last checkpoint info has to be queried from the JM >>> rest api, so as long that is available it should work fine. >>> >>> I am not aware of anyone working on this at the moment, it would be >>> great if you could open a JIRA ticket to track this. If you are interested >>> in working on this, we can also support you but this is a fairly complex >>> feature that involves many layers of operator logic. >>> >>> Cheers, >>> Gyula >>> >>> On Tue, Apr 30, 2024 at 1:08 AM Alan Zhang <shuai....@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> We wanted to use the Apache Flink Kubernetes operator to manage the >>>> lifecycle of our Flink jobs in Flink session clusters. And we wanted to >>>> have the "last-state" upgrade feature for our use cases. >>>> >>>> However, the latest official doc states the "last-state" upgrade mode >>>> is not supported in the session mode(aka. FlinkSessionJob) currently: >>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades >>>> >>>> Last state upgrade mode is currently only supported for >>>> FlinkDeployments. >>>> >>>> Why didn't we support this upgrade mode in session mode? Do we have a >>>> plan to address this gap? Any suggestions for us if we want to stick with >>>> session mode? >>>> >>>> -- >>>> Thanks, >>>> Alan >>>> >>> >> >> -- >> Thanks, >> Alan >> > -- Thanks, Alan