Hi Vijay, If you are using HA solution, i think you do not need to specify the savepoint. Instead the checkpoint is used. The checkpoint is done automatically and periodically based on your configuration.When the jobmanager/taskmanager fails or the whole cluster crashes, it could always recover from the latest checkpoint. Does this meed your requirement?
Best, Yang Sean Hester <sean.hes...@bettercloud.com> 于2019年10月1日周二 上午1:47写道: > Vijay, > > That is my understanding as well: the HA solution only solves the problem > up to the point all job managers fail/restart at the same time. That's > where my original concern was. > > But to Aleksandar and Yun's point, running in HA with 2 or 3 Job Managers > per cluster--as long as they are all deployed to separate GKE nodes--would > provide a very high uptime/low failure rate, at least on paper. It's a > promising enough option that we're going to run in HA for a month or two > and monitor results before we put in any extra work to customize the > savepoint start-up behavior. > > On Fri, Sep 27, 2019 at 2:24 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> > wrote: > >> I don't think HA will help to recover from cluster crash, for that we >> should take periodic savepoint right? Please correct me in case i am wrong >> >> Regards >> Bhaskar >> >> On Fri, Sep 27, 2019 at 11:48 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> >> wrote: >> >>> Suppose my cluster got crashed and need to bring up the entire cluster >>> back? Does HA still helps to run the cluster from latest save point? >>> >>> Regards >>> Bhaskar >>> >>> On Thu, Sep 26, 2019 at 7:44 PM Sean Hester <sean.hes...@bettercloud.com> >>> wrote: >>> >>>> thanks to everyone for all the replies. >>>> >>>> i think the original concern here with "just" relying on the HA option >>>> is that there are some disaster recovery and data center migration use >>>> cases where the continuity of the job managers is difficult to preserve. >>>> but those are admittedly very edgy use cases. i think it's definitely worth >>>> reviewing the SLAs with our site reliability engineers to see how likely it >>>> would be to completely lose all job managers under an HA configuration. >>>> that small a risk might be acceptable/preferable to a one-off solution. >>>> >>>> @Aleksander, would love to learn more about Zookeeper-less HA. i >>>> think i spotted a thread somewhere between Till and someone (perhaps you) >>>> about that. feel free to DM me. >>>> >>>> thanks again to everyone! >>>> >>>> On Thu, Sep 26, 2019 at 7:32 AM Yang Wang <danrtsey...@gmail.com> >>>> wrote: >>>> >>>>> Hi, Aleksandar >>>>> >>>>> Savepoint option in standalone job cluster is optional. If you want to >>>>> always recover >>>>> from the latest checkpoint, just as Aleksandar and Yun Tang said you >>>>> could use the >>>>> high-availability configuration. Make sure the cluster-id is not >>>>> changed, i think the job >>>>> could recover both at exceptionally crash and restart by expectation. >>>>> >>>>> @Aleksandar Mastilovic <amastilo...@sightmachine.com>, we are also >>>>> have an zookeeper-less high-availability implementation[1]. >>>>> Maybe we could have some discussion and contribute this useful feature >>>>> to the community. >>>>> >>>>> [1]. >>>>> https://docs.google.com/document/d/1Z-VdJlPPEQoWT1WLm5woM4y0bFep4FrgdJ9ipQuRv8g/edit >>>>> >>>>> Best, >>>>> Yang >>>>> >>>>> Aleksandar Mastilovic <amastilo...@sightmachine.com> 于2019年9月26日周四 >>>>> 上午4:11写道: >>>>> >>>>>> Would you guys (Flink devs) be interested in our solution for >>>>>> zookeeper-less HA? I could ask the managers how they feel about >>>>>> open-sourcing the improvement. >>>>>> >>>>>> On Sep 25, 2019, at 11:49 AM, Yun Tang <myas...@live.com> wrote: >>>>>> >>>>>> As Aleksandar said, k8s with HA configuration could solve your >>>>>> problem. There already have some discussion about how to implement such >>>>>> HA >>>>>> in k8s if we don't have a zookeeper service: FLINK-11105 [1] and >>>>>> FLINK-12884 [2]. Currently, you might only have to choose zookeeper as >>>>>> high-availability service. >>>>>> >>>>>> [1] https://issues.apache.org/jira/browse/FLINK-11105 >>>>>> [2] https://issues.apache.org/jira/browse/FLINK-12884 >>>>>> >>>>>> Best >>>>>> Yun Tang >>>>>> ------------------------------ >>>>>> *From:* Aleksandar Mastilovic <amastilo...@sightmachine.com> >>>>>> *Sent:* Thursday, September 26, 2019 1:57 >>>>>> *To:* Sean Hester <sean.hes...@bettercloud.com> >>>>>> *Cc:* Hao Sun <ha...@zendesk.com>; Yuval Itzchakov <yuva...@gmail.com>; >>>>>> user <user@flink.apache.org> >>>>>> *Subject:* Re: Challenges Deploying Flink With Savepoints On >>>>>> Kubernetes >>>>>> >>>>>> Can’t you simply use JobManager in HA mode? It would pick up where it >>>>>> left off if you don’t provide a Savepoint. >>>>>> >>>>>> On Sep 25, 2019, at 6:07 AM, Sean Hester <sean.hes...@bettercloud.com> >>>>>> wrote: >>>>>> >>>>>> thanks for all replies! i'll definitely take a look at the Flink k8s >>>>>> Operator project. >>>>>> >>>>>> i'll try to restate the issue to clarify. this issue is specific to >>>>>> starting a job from a savepoint in job-cluster mode. in these cases the >>>>>> Job >>>>>> Manager container is configured to run a single Flink job at start-up. >>>>>> the >>>>>> savepoint needs to be provided as an argument to the entrypoint. the >>>>>> Flink >>>>>> documentation for this approach is here: >>>>>> >>>>>> >>>>>> https://github.com/apache/flink/tree/master/flink-container/kubernetes#resuming-from-a-savepoint >>>>>> >>>>>> the issue is that taking this approach means that the job will >>>>>> *always* start from the savepoint provided as the start argument in >>>>>> the Kubernetes YAML. this includes unplanned restarts of the job manager, >>>>>> but we'd really prefer any *unplanned* restarts resume for the most >>>>>> recent checkpoint instead of restarting from the configured savepoint. so >>>>>> in a sense we want the savepoint argument to be transient, only being >>>>>> used >>>>>> during the initial deployment, but this runs counter to the design of >>>>>> Kubernetes which always wants to restore a deployment to the "goal state" >>>>>> as defined in the YAML. >>>>>> >>>>>> i hope this helps. if you want more details please let me know, and >>>>>> thanks again for your time. >>>>>> >>>>>> >>>>>> On Tue, Sep 24, 2019 at 1:09 PM Hao Sun <ha...@zendesk.com> wrote: >>>>>> >>>>>> I think I overlooked it. Good point. I am using Redis to save the >>>>>> path to my savepoint, I might be able to set a TTL to avoid such issue. >>>>>> >>>>>> Hao Sun >>>>>> >>>>>> >>>>>> On Tue, Sep 24, 2019 at 9:54 AM Yuval Itzchakov <yuva...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi Hao, >>>>>> >>>>>> I think he's exactly talking about the usecase where the JM/TM >>>>>> restart and they come back up from the latest savepoint which might be >>>>>> stale by that time. >>>>>> >>>>>> On Tue, 24 Sep 2019, 19:24 Hao Sun, <ha...@zendesk.com> wrote: >>>>>> >>>>>> We always make a savepoint before we shutdown the job-cluster. So the >>>>>> savepoint is always the latest. When we fix a bug or change the job >>>>>> graph, >>>>>> it can resume well. >>>>>> We only use checkpoints for unplanned downtime, e.g. K8S killed >>>>>> JM/TM, uncaught exception, etc. >>>>>> >>>>>> Maybe I do not understand your use case well, I do not see a need to >>>>>> start from checkpoint after a bug fix. >>>>>> From what I know, currently you can use checkpoint as a savepoint as >>>>>> well >>>>>> >>>>>> Hao Sun >>>>>> >>>>>> >>>>>> On Tue, Sep 24, 2019 at 7:48 AM Yuval Itzchakov <yuva...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> AFAIK there's currently nothing implemented to solve this problem, >>>>>> but working on a possible fix can be implemented on top of >>>>>> https://github.com/lyft/flinkk8soperator which already has a pretty >>>>>> fancy state machine for rolling upgrades. I'd love to be involved as this >>>>>> is an issue I've been thinking about as well. >>>>>> >>>>>> Yuval >>>>>> >>>>>> On Tue, Sep 24, 2019 at 5:02 PM Sean Hester < >>>>>> sean.hes...@bettercloud.com> wrote: >>>>>> >>>>>> hi all--we've run into a gap (knowledge? design? tbd?) for our use >>>>>> cases when deploying Flink jobs to start from savepoints using the >>>>>> job-cluster mode in Kubernetes. >>>>>> >>>>>> we're running a ~15 different jobs, all in job-cluster mode, using a >>>>>> mix of Flink 1.8.1 and 1.9.0, under GKE (Google Kubernetes Engine). these >>>>>> are all long-running streaming jobs, all essentially acting as >>>>>> microservices. we're using Helm charts to configure all of our >>>>>> deployments. >>>>>> >>>>>> we have a number of use cases where we want to restart jobs from a >>>>>> savepoint to replay recent events, i.e. when we've enhanced the job logic >>>>>> or fixed a bug. but after the deployment we want to have the job resume >>>>>> it's "long-running" behavior, where any unplanned restarts resume from >>>>>> the >>>>>> latest checkpoint. >>>>>> >>>>>> the issue we run into is that any obvious/standard/idiomatic >>>>>> Kubernetes deployment includes the savepoint argument in the >>>>>> configuration. >>>>>> if the Job Manager container(s) have an unplanned restart, when they come >>>>>> back up they will start from the savepoint instead of resuming from the >>>>>> latest checkpoint. everything is working as configured, but that's not >>>>>> exactly what we want. we want the savepoint argument to be transient >>>>>> somehow (only used during the initial deployment), but Kubernetes doesn't >>>>>> really support the concept of transient configuration. >>>>>> >>>>>> i can see a couple of potential solutions that either involve custom >>>>>> code in the jobs or custom logic in the container (i.e. a custom >>>>>> entrypoint >>>>>> script that records that the configured savepoint has already been used >>>>>> in >>>>>> a file on a persistent volume or GCS, and potentially when/why/by which >>>>>> deployment). but these seem like unexpected and hacky solutions. before >>>>>> we >>>>>> head down that road i wanted to ask: >>>>>> >>>>>> - is this is already a solved problem that i've missed? >>>>>> - is this issue already on the community's radar? >>>>>> >>>>>> thanks in advance! >>>>>> >>>>>> -- >>>>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865 >>>>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305 >>>>>> <http://www.bettercloud.com/> <http://www.bettercloud.com/> >>>>>> *Altitude 2019 in San Francisco | Sept. 23 - 25* >>>>>> It’s not just an IT conference, it’s “a complete learning and >>>>>> networking experience” >>>>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best Regards, >>>>>> Yuval Itzchakov. >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865 >>>>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305 >>>>>> <http://www.bettercloud.com/> <http://www.bettercloud.com/> >>>>>> *Altitude 2019 in San Francisco | Sept. 23 - 25* >>>>>> It’s not just an IT conference, it’s “a complete learning and >>>>>> networking experience” >>>>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude> >>>>>> >>>>>> >>>>>> >>>> >>>> -- >>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865 >>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305 >>>> <http://www.bettercloud.com> <http://www.bettercloud.com> >>>> *Altitude 2019 in San Francisco | Sept. 23 - 25* >>>> It’s not just an IT conference, it’s “a complete learning and >>>> networking experience” >>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude> >>>> >>>> > > -- > *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865 > 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305 > <http://www.bettercloud.com> <http://www.bettercloud.com> > *Introducing the BetterCloud Integration Center * > Automate actions across every app and own SaaSOps > <https://www.bettercloud.com/integrations-webinar/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-integration-center> > >