I think I overlooked it. Good point. I am using Redis to save the path to my savepoint, I might be able to set a TTL to avoid such issue.
Hao Sun On Tue, Sep 24, 2019 at 9:54 AM Yuval Itzchakov <yuva...@gmail.com> wrote: > Hi Hao, > > I think he's exactly talking about the usecase where the JM/TM restart and > they come back up from the latest savepoint which might be stale by that > time. > > On Tue, 24 Sep 2019, 19:24 Hao Sun, <ha...@zendesk.com> wrote: > >> We always make a savepoint before we shutdown the job-cluster. So the >> savepoint is always the latest. When we fix a bug or change the job graph, >> it can resume well. >> We only use checkpoints for unplanned downtime, e.g. K8S killed JM/TM, >> uncaught exception, etc. >> >> Maybe I do not understand your use case well, I do not see a need to >> start from checkpoint after a bug fix. >> From what I know, currently you can use checkpoint as a savepoint as well >> >> Hao Sun >> >> >> On Tue, Sep 24, 2019 at 7:48 AM Yuval Itzchakov <yuva...@gmail.com> >> wrote: >> >>> AFAIK there's currently nothing implemented to solve this problem, but >>> working on a possible fix can be implemented on top of >>> https://github.com/lyft/flinkk8soperator >>> <https://github.com/lyft/flinkk8soperator> which >>> already has a pretty fancy state machine for rolling upgrades. I'd love to >>> be involved as this is an issue I've been thinking about as well. >>> >>> Yuval >>> >>> On Tue, Sep 24, 2019 at 5:02 PM Sean Hester <sean.hes...@bettercloud.com> >>> wrote: >>> >>>> hi all--we've run into a gap (knowledge? design? tbd?) for our use >>>> cases when deploying Flink jobs to start from savepoints using the >>>> job-cluster mode in Kubernetes. >>>> >>>> we're running a ~15 different jobs, all in job-cluster mode, using a >>>> mix of Flink 1.8.1 and 1.9.0, under GKE (Google Kubernetes Engine). these >>>> are all long-running streaming jobs, all essentially acting as >>>> microservices. we're using Helm charts to configure all of our deployments. >>>> >>>> we have a number of use cases where we want to restart jobs from a >>>> savepoint to replay recent events, i.e. when we've enhanced the job logic >>>> or fixed a bug. but after the deployment we want to have the job resume >>>> it's "long-running" behavior, where any unplanned restarts resume from the >>>> latest checkpoint. >>>> >>>> the issue we run into is that any obvious/standard/idiomatic Kubernetes >>>> deployment includes the savepoint argument in the configuration. if the Job >>>> Manager container(s) have an unplanned restart, when they come back up they >>>> will start from the savepoint instead of resuming from the latest >>>> checkpoint. everything is working as configured, but that's not exactly >>>> what we want. we want the savepoint argument to be transient somehow (only >>>> used during the initial deployment), but Kubernetes doesn't really support >>>> the concept of transient configuration. >>>> >>>> i can see a couple of potential solutions that either involve custom >>>> code in the jobs or custom logic in the container (i.e. a custom entrypoint >>>> script that records that the configured savepoint has already been used in >>>> a file on a persistent volume or GCS, and potentially when/why/by which >>>> deployment). but these seem like unexpected and hacky solutions. before we >>>> head down that road i wanted to ask: >>>> >>>> - is this is already a solved problem that i've missed? >>>> - is this issue already on the community's radar? >>>> >>>> thanks in advance! >>>> >>>> -- >>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865 >>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305 >>>> <http://www.bettercloud.com> >>>> <http://www.bettercloud.com> >>>> *Altitude 2019 in San Francisco | Sept. 23 - 25* >>>> It’s not just an IT conference, it’s “a complete learning and >>>> networking experience” >>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude> >>>> >>>> >>> >>> -- >>> Best Regards, >>> Yuval Itzchakov. >>> >>