Hey! Overall, if SLA permits, we generally recommend savepoint upgrades for prod, you may use configure the `kubernetes.operator.savepoint.format.type` to NATIVE to get close to checkpoint performance on your savepoints. If the job is not running savepoint upgrades by default fall back to "last-state" so that's not a big problem.
Cancelling during the last-state upgrade instead of using the HA metadata will generate a new job.id but it also be a slightly slower process overall. To be honest I don't know many people who are using the history server like this for streaming jobs with the operator. Not really sure what you are trying to achieve with it , maybe some other audit feature would be enough to simply track the spec changes over time of the CR? Cheers, Gyula On Thu, Aug 22, 2024 at 8:31 AM Alan Zhang <shuai....@gmail.com> wrote: > Hi Gyula, > Thanks for answering my questions! > > >Savepoint upgrades on the other hand would generate a new job id (at > least after a recent fix on operator main). > Yes, the savepoint can help. However, IMO savepoint is not ideal compared > with checkpoints because of 1) performance concern: savepoint does full > snapshot which could take a long time especially for jobs with large state > 2) Flink jobs need to be running to allow the savepoint to get created. So > just simply leverage savepoints instead of checkpoints for all job > redeployment / upgrade may not be practical,(e.g. job downtime could be > longer than SLA). > > My understanding is that the "last-state" upgrade is recommended for the > deployments of near-real time use cases that are usually sensitive to > latency(including job downtime). So once we want to redeploy a Flink job, > we just do in-place updates on the existing FlinkDeployment with > "last-state" enabled. Aka, use a job failover trick to achieve job > redeployment. > > What are your thoughts on using "last-state" vs "savepoint"? Would you > mind sharing how you use / decide "last-state" vs "savepoint" in production? > > >I am actually working on adding a new way to perform the last-state > upgrade via simple cancellation but that's a slightly orthogonal question. > Will this new way help generate a new job.id during last-state upgrade? > > Thanks, > Alan > > On Tue, Aug 20, 2024 at 10:17 PM Gyula Fóra <gyula.f...@gmail.com> wrote: > >> Hi Alan! >> >> The job.id remains the same as the last-state mode uses flinks internal >> failover mechanism to access the state. We cannot change the job.id >> while doing this unfortunately. >> >> Savepoint upgrades on the other hand would generate a new job id (at >> least after a recent fix on operator main). I am actually working on adding >> a new way to perform the last-state upgrade via simple cancellation but >> that's a slightly orthogonal question. >> >> Long story short if you really need to integrate this with the history >> server, then you should switch to savepoint upgrades. >> >> Cheers, >> Gyula >> >> On Wed, Aug 21, 2024 at 12:14 AM Alan Zhang <shuai....@gmail.com> wrote: >> >>> Hi, >>> >>> We are using Apache Flink Kubernetes operator to manage the deployment >>> lifecycle of our Flink jobs. And we are using the application mode with >>> "last-state" upgrade mode for each FlinkDeployment. >>> >>> As I know, each FlinkDeployment will keep using the same job id across >>> different job deployments / upgrades, because the operator uses the >>> job failover mechanism to achieve "last-state" upgrade mode. >>> However, with it, it seems impossible to integrate with Flink history >>> server which uses job.id to differentiate different job deployments. >>> >>> Questions: >>> >>> - Is there any way to make the job.id different for "last-state" >>> upgrade mode? >>> - What could be the right way to enable Flink history server in this >>> case? >>> >>>