Re: Challenges Deploying Flink With Savepoints On Kubernetes

Vijay Bhaskar Thu, 10 Oct 2019 04:25:24 -0700

Thanks Yang. We will try and let you know if any issues arise

Regards
Bhaskar


On Thu, Oct 10, 2019 at 1:53 PM Yang Wang <danrtsey...@gmail.com> wrote:

> @ Hao Sun,
> I have made a confirmation that even we change parallelism and/or modify
> operators, add new operators,
> the flink cluster could also recover from latest checkpoint.
>
> @ Vijay
> a) Some individual jobmanager/taskmanager crashed exceptionally(someother
> jobmanagers
> and taskmanagers are alive), it could recover from the latest checkpoint.
> b) All jobmanagers and taskmanagers fails, it could still recover from the
> latest checkpoint if the cluster-id
> is not changed.
>
> When we enable the HA, The meta of jobgraph and checkpoint is saved on
> zookeeper and the real files are save
> on high-availability storage(HDFS). So when the flink application is
> submitted again with same cluster-id, it could
> recover jobs and checkpoint from zookeeper. I think it has been supported
> for a long time. Maybe you could have a
> try with flink-1.8 or 1.9.
>
> Best,
> Yang
>
>
> Vijay Bhaskar <bhaskar.eba...@gmail.com> 于2019年10月10日周四 下午2:26写道：
>
>> Thanks Yang and Sean. I have couple of questions:
>>
>> 1) Suppose the scenario of , bringing back entire cluster,
>>      a) In that case, at least one job manager out of HA group should be
>> up and running right? or
>>      b) All the job managers fails, then also this works? In that case
>> please let me know the procedure/share the documentation?
>>          How to start from previous check point?
>>          What Flink version onwards this feature is stable?
>>
>> Regards
>> Bhaskar
>>
>>
>> On Wed, Oct 9, 2019 at 8:51 AM Yang Wang <danrtsey...@gmail.com> wrote:
>>
>>> Hi Vijay,
>>>
>>> If you are using HA solution, i think you do not need to specify the
>>> savepoint. Instead the checkpoint is used.
>>> The checkpoint is done automatically and periodically based on your
>>> configuration.When the
>>> jobmanager/taskmanager fails or the whole cluster crashes, it could
>>> always recover from the latest
>>> checkpoint. Does this meed your requirement?
>>>
>>> Best,
>>> Yang
>>>
>>> Sean Hester <sean.hes...@bettercloud.com> 于2019年10月1日周二 上午1:47写道：
>>>
>>>> Vijay,
>>>>
>>>> That is my understanding as well: the HA solution only solves the
>>>> problem up to the point all job managers fail/restart at the same time.
>>>> That's where my original concern was.
>>>>
>>>> But to Aleksandar and Yun's point, running in HA with 2 or 3 Job
>>>> Managers per cluster--as long as they are all deployed to separate GKE
>>>> nodes--would provide a very high uptime/low failure rate, at least on
>>>> paper. It's a promising enough option that we're going to run in HA for a
>>>> month or two and monitor results before we put in any extra work to
>>>> customize the savepoint start-up behavior.
>>>>
>>>> On Fri, Sep 27, 2019 at 2:24 AM Vijay Bhaskar <bhaskar.eba...@gmail.com>
>>>> wrote:
>>>>
>>>>> I don't think HA will help to recover from cluster crash, for that we
>>>>> should take periodic savepoint right? Please correct me in case i am wrong
>>>>>
>>>>> Regards
>>>>> Bhaskar
>>>>>
>>>>> On Fri, Sep 27, 2019 at 11:48 AM Vijay Bhaskar <
>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>
>>>>>> Suppose my cluster got crashed and need to bring up the entire
>>>>>> cluster back? Does HA still helps to run the cluster from latest save
>>>>>> point?
>>>>>>
>>>>>> Regards
>>>>>> Bhaskar
>>>>>>
>>>>>> On Thu, Sep 26, 2019 at 7:44 PM Sean Hester <
>>>>>> sean.hes...@bettercloud.com> wrote:
>>>>>>
>>>>>>> thanks to everyone for all the replies.
>>>>>>>
>>>>>>> i think the original concern here with "just" relying on the HA
>>>>>>> option is that there are some disaster recovery and data center 
>>>>>>> migration
>>>>>>> use cases where the continuity of the job managers is difficult to
>>>>>>> preserve. but those are admittedly very edgy use cases. i think it's
>>>>>>> definitely worth reviewing the SLAs with our site reliability engineers 
>>>>>>> to
>>>>>>> see how likely it would be to completely lose all job managers under an 
>>>>>>> HA
>>>>>>> configuration. that small a risk might be acceptable/preferable to a
>>>>>>> one-off solution.
>>>>>>>
>>>>>>> @Aleksander, would love to learn more about Zookeeper-less HA. i
>>>>>>> think i spotted a thread somewhere between Till and someone (perhaps 
>>>>>>> you)
>>>>>>> about that. feel free to DM me.
>>>>>>>
>>>>>>> thanks again to everyone!
>>>>>>>
>>>>>>> On Thu, Sep 26, 2019 at 7:32 AM Yang Wang <danrtsey...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi, Aleksandar
>>>>>>>>
>>>>>>>> Savepoint option in standalone job cluster is optional. If you want
>>>>>>>> to always recover
>>>>>>>> from the latest checkpoint, just as Aleksandar and Yun Tang said
>>>>>>>> you could use the
>>>>>>>> high-availability configuration. Make sure the cluster-id is not
>>>>>>>> changed, i think the job
>>>>>>>> could recover both at exceptionally crash and restart by
>>>>>>>> expectation.
>>>>>>>>
>>>>>>>> @Aleksandar Mastilovic <amastilo...@sightmachine.com>, we are also
>>>>>>>> have an zookeeper-less high-availability implementation[1].
>>>>>>>> Maybe we could have some discussion and contribute this useful
>>>>>>>> feature to the community.
>>>>>>>>
>>>>>>>> [1].
>>>>>>>> https://docs.google.com/document/d/1Z-VdJlPPEQoWT1WLm5woM4y0bFep4FrgdJ9ipQuRv8g/edit
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Yang
>>>>>>>>
>>>>>>>> Aleksandar Mastilovic <amastilo...@sightmachine.com> 于2019年9月26日周四
>>>>>>>> 上午4:11写道：
>>>>>>>>
>>>>>>>>> Would you guys (Flink devs) be interested in our solution for
>>>>>>>>> zookeeper-less HA? I could ask the managers how they feel about
>>>>>>>>> open-sourcing the improvement.
>>>>>>>>>
>>>>>>>>> On Sep 25, 2019, at 11:49 AM, Yun Tang <myas...@live.com> wrote:
>>>>>>>>>
>>>>>>>>> As Aleksandar said, k8s with HA configuration could solve your
>>>>>>>>> problem. There already have some discussion about how to implement 
>>>>>>>>> such HA
>>>>>>>>> in k8s if we don't have a zookeeper service: FLINK-11105 [1] and
>>>>>>>>> FLINK-12884 [2]. Currently, you might only have to choose zookeeper as
>>>>>>>>> high-availability service.
>>>>>>>>>
>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-11105
>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-12884
>>>>>>>>>
>>>>>>>>> Best
>>>>>>>>> Yun Tang
>>>>>>>>> ------------------------------
>>>>>>>>> *From:* Aleksandar Mastilovic <amastilo...@sightmachine.com>
>>>>>>>>> *Sent:* Thursday, September 26, 2019 1:57
>>>>>>>>> *To:* Sean Hester <sean.hes...@bettercloud.com>
>>>>>>>>> *Cc:* Hao Sun <ha...@zendesk.com>; Yuval Itzchakov <
>>>>>>>>> yuva...@gmail.com>; user <user@flink.apache.org>
>>>>>>>>> *Subject:* Re: Challenges Deploying Flink With Savepoints On
>>>>>>>>> Kubernetes
>>>>>>>>>
>>>>>>>>> Can’t you simply use JobManager in HA mode? It would pick up where
>>>>>>>>> it left off if you don’t provide a Savepoint.
>>>>>>>>>
>>>>>>>>> On Sep 25, 2019, at 6:07 AM, Sean Hester <
>>>>>>>>> sean.hes...@bettercloud.com> wrote:
>>>>>>>>>
>>>>>>>>> thanks for all replies! i'll definitely take a look at the Flink
>>>>>>>>> k8s Operator project.
>>>>>>>>>
>>>>>>>>> i'll try to restate the issue to clarify. this issue is specific
>>>>>>>>> to starting a job from a savepoint in job-cluster mode. in these 
>>>>>>>>> cases the
>>>>>>>>> Job Manager container is configured to run a single Flink job at 
>>>>>>>>> start-up.
>>>>>>>>> the savepoint needs to be provided as an argument to the entrypoint. 
>>>>>>>>> the
>>>>>>>>> Flink documentation for this approach is here:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://github.com/apache/flink/tree/master/flink-container/kubernetes#resuming-from-a-savepoint
>>>>>>>>>
>>>>>>>>> the issue is that taking this approach means that the job will
>>>>>>>>> *always* start from the savepoint provided as the start argument
>>>>>>>>> in the Kubernetes YAML. this includes unplanned restarts of the job
>>>>>>>>> manager, but we'd really prefer any *unplanned* restarts resume
>>>>>>>>> for the most recent checkpoint instead of restarting from the 
>>>>>>>>> configured
>>>>>>>>> savepoint. so in a sense we want the savepoint argument to be 
>>>>>>>>> transient,
>>>>>>>>> only being used during the initial deployment, but this runs counter 
>>>>>>>>> to the
>>>>>>>>> design of Kubernetes which always wants to restore a deployment to the
>>>>>>>>> "goal state" as defined in the YAML.
>>>>>>>>>
>>>>>>>>> i hope this helps. if you want more details please let me know,
>>>>>>>>> and thanks again for your time.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 1:09 PM Hao Sun <ha...@zendesk.com> wrote:
>>>>>>>>>
>>>>>>>>> I think I overlooked it. Good point. I am using Redis to save the
>>>>>>>>> path to my savepoint, I might be able to set a TTL to avoid such 
>>>>>>>>> issue.
>>>>>>>>>
>>>>>>>>> Hao Sun
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 9:54 AM Yuval Itzchakov <yuva...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Hao,
>>>>>>>>>
>>>>>>>>> I think he's exactly talking about the usecase where the JM/TM
>>>>>>>>> restart and they come back up from the latest savepoint which might be
>>>>>>>>> stale by that time.
>>>>>>>>>
>>>>>>>>> On Tue, 24 Sep 2019, 19:24 Hao Sun, <ha...@zendesk.com> wrote:
>>>>>>>>>
>>>>>>>>> We always make a savepoint before we shutdown the job-cluster. So
>>>>>>>>> the savepoint is always the latest. When we fix a bug or change the 
>>>>>>>>> job
>>>>>>>>> graph, it can resume well.
>>>>>>>>> We only use checkpoints for unplanned downtime, e.g. K8S killed
>>>>>>>>> JM/TM, uncaught exception, etc.
>>>>>>>>>
>>>>>>>>> Maybe I do not understand your use case well, I do not see a need
>>>>>>>>> to start from checkpoint after a bug fix.
>>>>>>>>> From what I know, currently you can use checkpoint as a savepoint
>>>>>>>>> as well
>>>>>>>>>
>>>>>>>>> Hao Sun
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 7:48 AM Yuval Itzchakov <yuva...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> AFAIK there's currently nothing implemented to solve this problem,
>>>>>>>>> but working on a possible fix can be implemented on top of
>>>>>>>>> https://github.com/lyft/flinkk8soperator which already has a
>>>>>>>>> pretty fancy state machine for rolling upgrades. I'd love to be 
>>>>>>>>> involved as
>>>>>>>>> this is an issue I've been thinking about as well.
>>>>>>>>>
>>>>>>>>> Yuval
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 5:02 PM Sean Hester <
>>>>>>>>> sean.hes...@bettercloud.com> wrote:
>>>>>>>>>
>>>>>>>>> hi all--we've run into a gap (knowledge? design? tbd?) for our use
>>>>>>>>> cases when deploying Flink jobs to start from savepoints using the
>>>>>>>>> job-cluster mode in Kubernetes.
>>>>>>>>>
>>>>>>>>> we're running a ~15 different jobs, all in job-cluster mode, using
>>>>>>>>> a mix of Flink 1.8.1 and 1.9.0, under GKE (Google Kubernetes Engine). 
>>>>>>>>> these
>>>>>>>>> are all long-running streaming jobs, all essentially acting as
>>>>>>>>> microservices. we're using Helm charts to configure all of our 
>>>>>>>>> deployments.
>>>>>>>>>
>>>>>>>>> we have a number of use cases where we want to restart jobs from a
>>>>>>>>> savepoint to replay recent events, i.e. when we've enhanced the job 
>>>>>>>>> logic
>>>>>>>>> or fixed a bug. but after the deployment we want to have the job 
>>>>>>>>> resume
>>>>>>>>> it's "long-running" behavior, where any unplanned restarts resume 
>>>>>>>>> from the
>>>>>>>>> latest checkpoint.
>>>>>>>>>
>>>>>>>>> the issue we run into is that any obvious/standard/idiomatic
>>>>>>>>> Kubernetes deployment includes the savepoint argument in the 
>>>>>>>>> configuration.
>>>>>>>>> if the Job Manager container(s) have an unplanned restart, when they 
>>>>>>>>> come
>>>>>>>>> back up they will start from the savepoint instead of resuming from 
>>>>>>>>> the
>>>>>>>>> latest checkpoint. everything is working as configured, but that's not
>>>>>>>>> exactly what we want. we want the savepoint argument to be transient
>>>>>>>>> somehow (only used during the initial deployment), but Kubernetes 
>>>>>>>>> doesn't
>>>>>>>>> really support the concept of transient configuration.
>>>>>>>>>
>>>>>>>>> i can see a couple of potential solutions that either involve
>>>>>>>>> custom code in the jobs or custom logic in the container (i.e. a 
>>>>>>>>> custom
>>>>>>>>> entrypoint script that records that the configured savepoint has 
>>>>>>>>> already
>>>>>>>>> been used in a file on a persistent volume or GCS, and potentially
>>>>>>>>> when/why/by which deployment). but these seem like unexpected and 
>>>>>>>>> hacky
>>>>>>>>> solutions. before we head down that road i wanted to ask:
>>>>>>>>>
>>>>>>>>>    - is this is already a solved problem that i've missed?
>>>>>>>>>    - is this issue already on the community's radar?
>>>>>>>>>
>>>>>>>>> thanks in advance!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865
>>>>>>>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305
>>>>>>>>> <http://www.bettercloud.com/> <http://www.bettercloud.com/>
>>>>>>>>> *Altitude 2019 in San Francisco | Sept. 23 - 25*
>>>>>>>>> It’s not just an IT conference, it’s “a complete learning and
>>>>>>>>> networking experience”
>>>>>>>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best Regards,
>>>>>>>>> Yuval Itzchakov.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865
>>>>>>>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305
>>>>>>>>> <http://www.bettercloud.com/> <http://www.bettercloud.com/>
>>>>>>>>> *Altitude 2019 in San Francisco | Sept. 23 - 25*
>>>>>>>>> It’s not just an IT conference, it’s “a complete learning and
>>>>>>>>> networking experience”
>>>>>>>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865
>>>>>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305
>>>>>>> <http://www.bettercloud.com> <http://www.bettercloud.com>
>>>>>>> *Altitude 2019 in San Francisco | Sept. 23 - 25*
>>>>>>> It’s not just an IT conference, it’s “a complete learning and
>>>>>>> networking experience”
>>>>>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude>
>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865
>>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305
>>>> <http://www.bettercloud.com> <http://www.bettercloud.com>
>>>> *Introducing the BetterCloud Integration Center *
>>>> Automate actions across every app and own SaaSOps
>>>> <https://www.bettercloud.com/integrations-webinar/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-integration-center>
>>>>
>>>>

Re: Challenges Deploying Flink With Savepoints On Kubernetes

Reply via email to