Re: Challenges Deploying Flink With Savepoints On Kubernetes

Yang Wang Tue, 08 Oct 2019 20:22:13 -0700

Hi Vijay,

If you are using HA solution, i think you do not need to specify the
savepoint. Instead the checkpoint is used.
The checkpoint is done automatically and periodically based on your
configuration.When the
jobmanager/taskmanager fails or the whole cluster crashes, it could always
recover from the latest
checkpoint. Does this meed your requirement?


Best,
Yang

Sean Hester <sean.hes...@bettercloud.com> 于2019年10月1日周二 上午1:47写道：

> Vijay,
>
> That is my understanding as well: the HA solution only solves the problem
> up to the point all job managers fail/restart at the same time. That's
> where my original concern was.
>
> But to Aleksandar and Yun's point, running in HA with 2 or 3 Job Managers
> per cluster--as long as they are all deployed to separate GKE nodes--would
> provide a very high uptime/low failure rate, at least on paper. It's a
> promising enough option that we're going to run in HA for a month or two
> and monitor results before we put in any extra work to customize the
> savepoint start-up behavior.
>
> On Fri, Sep 27, 2019 at 2:24 AM Vijay Bhaskar <bhaskar.eba...@gmail.com>
> wrote:
>
>> I don't think HA will help to recover from cluster crash, for that we
>> should take periodic savepoint right? Please correct me in case i am wrong
>>
>> Regards
>> Bhaskar
>>
>> On Fri, Sep 27, 2019 at 11:48 AM Vijay Bhaskar <bhaskar.eba...@gmail.com>
>> wrote:
>>
>>> Suppose my cluster got crashed and need to bring up the entire cluster
>>> back? Does HA still helps to run the cluster from latest save point?
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Thu, Sep 26, 2019 at 7:44 PM Sean Hester <sean.hes...@bettercloud.com>
>>> wrote:
>>>
>>>> thanks to everyone for all the replies.
>>>>
>>>> i think the original concern here with "just" relying on the HA option
>>>> is that there are some disaster recovery and data center migration use
>>>> cases where the continuity of the job managers is difficult to preserve.
>>>> but those are admittedly very edgy use cases. i think it's definitely worth
>>>> reviewing the SLAs with our site reliability engineers to see how likely it
>>>> would be to completely lose all job managers under an HA configuration.
>>>> that small a risk might be acceptable/preferable to a one-off solution.
>>>>
>>>> @Aleksander, would love to learn more about Zookeeper-less HA. i
>>>> think i spotted a thread somewhere between Till and someone (perhaps you)
>>>> about that. feel free to DM me.
>>>>
>>>> thanks again to everyone!
>>>>
>>>> On Thu, Sep 26, 2019 at 7:32 AM Yang Wang <danrtsey...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, Aleksandar
>>>>>
>>>>> Savepoint option in standalone job cluster is optional. If you want to
>>>>> always recover
>>>>> from the latest checkpoint, just as Aleksandar and Yun Tang said you
>>>>> could use the
>>>>> high-availability configuration. Make sure the cluster-id is not
>>>>> changed, i think the job
>>>>> could recover both at exceptionally crash and restart by expectation.
>>>>>
>>>>> @Aleksandar Mastilovic <amastilo...@sightmachine.com>, we are also
>>>>> have an zookeeper-less high-availability implementation[1].
>>>>> Maybe we could have some discussion and contribute this useful feature
>>>>> to the community.
>>>>>
>>>>> [1].
>>>>> https://docs.google.com/document/d/1Z-VdJlPPEQoWT1WLm5woM4y0bFep4FrgdJ9ipQuRv8g/edit
>>>>>
>>>>> Best,
>>>>> Yang
>>>>>
>>>>> Aleksandar Mastilovic <amastilo...@sightmachine.com> 于2019年9月26日周四
>>>>> 上午4:11写道：
>>>>>
>>>>>> Would you guys (Flink devs) be interested in our solution for
>>>>>> zookeeper-less HA? I could ask the managers how they feel about
>>>>>> open-sourcing the improvement.
>>>>>>
>>>>>> On Sep 25, 2019, at 11:49 AM, Yun Tang <myas...@live.com> wrote:
>>>>>>
>>>>>> As Aleksandar said, k8s with HA configuration could solve your
>>>>>> problem. There already have some discussion about how to implement such 
>>>>>> HA
>>>>>> in k8s if we don't have a zookeeper service: FLINK-11105 [1] and
>>>>>> FLINK-12884 [2]. Currently, you might only have to choose zookeeper as
>>>>>> high-availability service.
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-11105
>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-12884
>>>>>>
>>>>>> Best
>>>>>> Yun Tang
>>>>>> ------------------------------
>>>>>> *From:* Aleksandar Mastilovic <amastilo...@sightmachine.com>
>>>>>> *Sent:* Thursday, September 26, 2019 1:57
>>>>>> *To:* Sean Hester <sean.hes...@bettercloud.com>
>>>>>> *Cc:* Hao Sun <ha...@zendesk.com>; Yuval Itzchakov <yuva...@gmail.com>;
>>>>>> user <user@flink.apache.org>
>>>>>> *Subject:* Re: Challenges Deploying Flink With Savepoints On
>>>>>> Kubernetes
>>>>>>
>>>>>> Can’t you simply use JobManager in HA mode? It would pick up where it
>>>>>> left off if you don’t provide a Savepoint.
>>>>>>
>>>>>> On Sep 25, 2019, at 6:07 AM, Sean Hester <sean.hes...@bettercloud.com>
>>>>>> wrote:
>>>>>>
>>>>>> thanks for all replies! i'll definitely take a look at the Flink k8s
>>>>>> Operator project.
>>>>>>
>>>>>> i'll try to restate the issue to clarify. this issue is specific to
>>>>>> starting a job from a savepoint in job-cluster mode. in these cases the 
>>>>>> Job
>>>>>> Manager container is configured to run a single Flink job at start-up. 
>>>>>> the
>>>>>> savepoint needs to be provided as an argument to the entrypoint. the 
>>>>>> Flink
>>>>>> documentation for this approach is here:
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/flink/tree/master/flink-container/kubernetes#resuming-from-a-savepoint
>>>>>>
>>>>>> the issue is that taking this approach means that the job will
>>>>>> *always* start from the savepoint provided as the start argument in
>>>>>> the Kubernetes YAML. this includes unplanned restarts of the job manager,
>>>>>> but we'd really prefer any *unplanned* restarts resume for the most
>>>>>> recent checkpoint instead of restarting from the configured savepoint. so
>>>>>> in a sense we want the savepoint argument to be transient, only being 
>>>>>> used
>>>>>> during the initial deployment, but this runs counter to the design of
>>>>>> Kubernetes which always wants to restore a deployment to the "goal state"
>>>>>> as defined in the YAML.
>>>>>>
>>>>>> i hope this helps. if you want more details please let me know, and
>>>>>> thanks again for your time.
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 1:09 PM Hao Sun <ha...@zendesk.com> wrote:
>>>>>>
>>>>>> I think I overlooked it. Good point. I am using Redis to save the
>>>>>> path to my savepoint, I might be able to set a TTL to avoid such issue.
>>>>>>
>>>>>> Hao Sun
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 9:54 AM Yuval Itzchakov <yuva...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Hao,
>>>>>>
>>>>>> I think he's exactly talking about the usecase where the JM/TM
>>>>>> restart and they come back up from the latest savepoint which might be
>>>>>> stale by that time.
>>>>>>
>>>>>> On Tue, 24 Sep 2019, 19:24 Hao Sun, <ha...@zendesk.com> wrote:
>>>>>>
>>>>>> We always make a savepoint before we shutdown the job-cluster. So the
>>>>>> savepoint is always the latest. When we fix a bug or change the job 
>>>>>> graph,
>>>>>> it can resume well.
>>>>>> We only use checkpoints for unplanned downtime, e.g. K8S killed
>>>>>> JM/TM, uncaught exception, etc.
>>>>>>
>>>>>> Maybe I do not understand your use case well, I do not see a need to
>>>>>> start from checkpoint after a bug fix.
>>>>>> From what I know, currently you can use checkpoint as a savepoint as
>>>>>> well
>>>>>>
>>>>>> Hao Sun
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 7:48 AM Yuval Itzchakov <yuva...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> AFAIK there's currently nothing implemented to solve this problem,
>>>>>> but working on a possible fix can be implemented on top of
>>>>>> https://github.com/lyft/flinkk8soperator which already has a pretty
>>>>>> fancy state machine for rolling upgrades. I'd love to be involved as this
>>>>>> is an issue I've been thinking about as well.
>>>>>>
>>>>>> Yuval
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 5:02 PM Sean Hester <
>>>>>> sean.hes...@bettercloud.com> wrote:
>>>>>>
>>>>>> hi all--we've run into a gap (knowledge? design? tbd?) for our use
>>>>>> cases when deploying Flink jobs to start from savepoints using the
>>>>>> job-cluster mode in Kubernetes.
>>>>>>
>>>>>> we're running a ~15 different jobs, all in job-cluster mode, using a
>>>>>> mix of Flink 1.8.1 and 1.9.0, under GKE (Google Kubernetes Engine). these
>>>>>> are all long-running streaming jobs, all essentially acting as
>>>>>> microservices. we're using Helm charts to configure all of our 
>>>>>> deployments.
>>>>>>
>>>>>> we have a number of use cases where we want to restart jobs from a
>>>>>> savepoint to replay recent events, i.e. when we've enhanced the job logic
>>>>>> or fixed a bug. but after the deployment we want to have the job resume
>>>>>> it's "long-running" behavior, where any unplanned restarts resume from 
>>>>>> the
>>>>>> latest checkpoint.
>>>>>>
>>>>>> the issue we run into is that any obvious/standard/idiomatic
>>>>>> Kubernetes deployment includes the savepoint argument in the 
>>>>>> configuration.
>>>>>> if the Job Manager container(s) have an unplanned restart, when they come
>>>>>> back up they will start from the savepoint instead of resuming from the
>>>>>> latest checkpoint. everything is working as configured, but that's not
>>>>>> exactly what we want. we want the savepoint argument to be transient
>>>>>> somehow (only used during the initial deployment), but Kubernetes doesn't
>>>>>> really support the concept of transient configuration.
>>>>>>
>>>>>> i can see a couple of potential solutions that either involve custom
>>>>>> code in the jobs or custom logic in the container (i.e. a custom 
>>>>>> entrypoint
>>>>>> script that records that the configured savepoint has already been used 
>>>>>> in
>>>>>> a file on a persistent volume or GCS, and potentially when/why/by which
>>>>>> deployment). but these seem like unexpected and hacky solutions. before 
>>>>>> we
>>>>>> head down that road i wanted to ask:
>>>>>>
>>>>>>    - is this is already a solved problem that i've missed?
>>>>>>    - is this issue already on the community's radar?
>>>>>>
>>>>>> thanks in advance!
>>>>>>
>>>>>> --
>>>>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865
>>>>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305
>>>>>> <http://www.bettercloud.com/> <http://www.bettercloud.com/>
>>>>>> *Altitude 2019 in San Francisco | Sept. 23 - 25*
>>>>>> It’s not just an IT conference, it’s “a complete learning and
>>>>>> networking experience”
>>>>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Yuval Itzchakov.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865
>>>>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305
>>>>>> <http://www.bettercloud.com/> <http://www.bettercloud.com/>
>>>>>> *Altitude 2019 in San Francisco | Sept. 23 - 25*
>>>>>> It’s not just an IT conference, it’s “a complete learning and
>>>>>> networking experience”
>>>>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865
>>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305
>>>> <http://www.bettercloud.com> <http://www.bettercloud.com>
>>>> *Altitude 2019 in San Francisco | Sept. 23 - 25*
>>>> It’s not just an IT conference, it’s “a complete learning and
>>>> networking experience”
>>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude>
>>>>
>>>>
>
> --
> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865
> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305
> <http://www.bettercloud.com> <http://www.bettercloud.com>
> *Introducing the BetterCloud Integration Center *
> Automate actions across every app and own SaaSOps
> <https://www.bettercloud.com/integrations-webinar/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-integration-center>
>
>

Re: Challenges Deploying Flink With Savepoints On Kubernetes

Reply via email to