Thanks you till. We will try to shift to latest flink version. Regards Bhaskar
On Mon, Oct 14, 2019 at 7:43 PM Till Rohrmann <trohrm...@apache.org> wrote: > Hi Vijay, > > Flink usually writes first the checkpoint data to disk and then writes the > pointer to the files to ZooKeeper. Hence, if you see a ZooKeeper entry, > then the files should be there. I assume that there is no other process > accessing and potentially removing files from the checkpoint directories, > right? > > Have you tried to run one of the latest Flink versions? Flink 1.6.2 is no > longer actively supported by the community. > > Cheers, > Till > > On Fri, Oct 11, 2019 at 11:39 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> > wrote: > >> Apart from these we have other environment and there check point worked >> fine in HA mode with complete cluster restart. But one of the job we are >> seeing an issue, in zookeeper the check point path is retrieved and its >> unable to find the check point path in persistent storage. I am wondering >> why this would happen first of all? >> Is there any sync issue between file writing over persistent path and >> file registration with HA service? For example check point has been >> registered in zookeeper but has not been written yet while restarting the >> cluster? I suspect this kind of problem can happen. We are using flink >> 1.6.2 in production. Is this an issue already known before and fixed >> recently >> >> Regards >> Bhaskar >> >> On Fri, Oct 11, 2019 at 2:08 PM Vijay Bhaskar <bhaskar.eba...@gmail.com> >> wrote: >> >>> We are seeing below logs in production sometime ago, after that we >>> stopped HA. Do you people think HA is enabled properly from the below logs? >>> >>> Regards >>> Bhaskar >>> >>> 2019-09-24 17:40:17,675 INFO >>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >>> Starting ZooKeeperLeaderElectionService >>> ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}. >>> 2019-09-24 17:40:17,675 INFO >>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >>> - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. >>> 2019-09-24 17:40:20,975 WARN akka.remote.transport.netty.NettyTransport >>> - Remote connection to [null] failed with >>> java.net.NoRouteToHostException: No route to host >>> 2019-09-24 17:40:20,976 WARN akka.remote.ReliableDeliverySupervisor >>> - Association with remote system [akka.tcp: >>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123] >>> has failed, address is now gated for [50] ms. Reason: [Association failed >>> with [akka.tcp: >>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]] >>> Caused by: [No route to host] >>> 2019-09-24 17:40:23,976 WARN akka.remote.transport.netty.NettyTransport >>> - Remote connection to [null] failed with >>> java.net.NoRouteToHostException: No route to host >>> 2019-09-24 17:40:23,977 WARN akka.remote.ReliableDeliverySupervisor >>> - Association with remote system [akka.tcp: >>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123] >>> has failed, address is now gated for [50] ms. Reason: [Association failed >>> with [akka.tcp: >>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]] >>> Caused by: [No route to host] >>> 2019-09-24 17:40:26,982 WARN akka.remote.transport.netty.NettyTransport >>> - Remote connection to [null] failed with >>> java.net.NoRouteToHostException: No route to host >>> 2019-09-24 17:40:26,983 WARN akka.remote.ReliableDeliverySupervisor >>> - Association with remote system [akka.tcp: >>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123] >>> has failed, address is now gated for [50] ms. Reason: [Association failed >>> with [akka.tcp: >>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]] >>> Caused by: [No route to host] >>> 2019-09-24 17:40:29,988 WARN akka.remote.transport.netty.NettyTransport >>> - Remote connection to [null] failed with >>> java.net.NoRouteToHostException: No route to host >>> 2019-09-24 17:40:29,988 WARN akka.remote.ReliableDeliverySupervisor >>> - Association with remote system [akka.tcp: >>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123] >>> has failed, address is now gated for [50] ms. Reason: [Association failed >>> with [akka.tcp: >>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]] >>> Caused by: [No route to host] >>> 2019-09-24 17:40:32,994 WARN akka.remote.transport.netty.NettyTransport >>> - Remote connection to [null] failed with >>> java.net.NoRouteToHostException: No route to host >>> 2019-09-24 17:40:32,995 WARN akka.remote.ReliableDeliverySupervisor >>> - Association with remote system [akka.tcp: >>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123] >>> has failed, address is now gated for [50] ms. Reason: [Association failed >>> with [akka.tcp: >>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]] >>> Caused by: [No route to host] >>> 2019-09-24 17:40:36,000 WARN akka.remote.transport.netty.NettyTransport >>> - Remote connection to [null] failed with >>> java.net.NoRouteToHostException: No route to host >>> 2019-09-24 17:40:36,001 WARN akka.remote.ReliableDeliverySupervisor >>> - Association with remote system [akka.tcp: >>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123] >>> has failed, address is now gated for [50] ms. Reason: [Association failed >>> with [akka.tcp: >>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]] >>> Caused by: [No route to host] >>> 2019-09-24 17:40:39,006 WARN akka.remote.transport.netty.NettyTransport >>> - Remote connection to [null] failed with >>> java.net.NoRouteToHostException: No route to host >>> >>> On Fri, Oct 11, 2019 at 9:39 AM Yun Tang <myas...@live.com> wrote: >>> >>>> Hi Hao >>>> >>>> It seems that I misunderstood the background of usage for your cases. >>>> High availability configuration targets for fault tolerance not for general >>>> development evolution. If you want to change your job topology, just follow >>>> the general rule to restore from savepoint/checkpoint, do not rely on HA to >>>> do job migration things. >>>> >>>> Best >>>> Yun Tang >>>> ------------------------------ >>>> *From:* Hao Sun <ha...@zendesk.com> >>>> *Sent:* Friday, October 11, 2019 8:33 >>>> *To:* Yun Tang <myas...@live.com> >>>> *Cc:* Vijay Bhaskar <bhaskar.eba...@gmail.com>; Yang Wang < >>>> danrtsey...@gmail.com>; Sean Hester <sean.hes...@bettercloud.com>; >>>> Aleksandar Mastilovic <amastilo...@sightmachine.com>; Yuval Itzchakov < >>>> yuva...@gmail.com>; user <user@flink.apache.org> >>>> *Subject:* Re: Challenges Deploying Flink With Savepoints On Kubernetes >>>> >>>> Yep I know that option. That's where get me confused as well. In a HA >>>> setup, where do I supply this option (allowNonRestoredState)? >>>> This option requires a savepoint path when I start a flink job I >>>> remember. And HA does not require the path >>>> >>>> Hao Sun >>>> >>>> >>>> On Thu, Oct 10, 2019 at 11:16 AM Yun Tang <myas...@live.com> wrote: >>>> >>>> Just a minor supplement @Hao Sun <ha...@zendesk.com>, if you decided >>>> to drop a operator, don't forget to add --allowNonRestoredState >>>> (short: -n) option [1] >>>> >>>> >>>> [1] >>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#allowing-non-restored-state >>>> >>>> Best >>>> Yun Tang >>>> >>>> ------------------------------ >>>> *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com> >>>> *Sent:* Thursday, October 10, 2019 19:24 >>>> *To:* Yang Wang <danrtsey...@gmail.com> >>>> *Cc:* Sean Hester <sean.hes...@bettercloud.com>; Aleksandar Mastilovic >>>> <amastilo...@sightmachine.com>; Yun Tang <myas...@live.com>; Hao Sun < >>>> ha...@zendesk.com>; Yuval Itzchakov <yuva...@gmail.com>; user < >>>> user@flink.apache.org> >>>> *Subject:* Re: Challenges Deploying Flink With Savepoints On Kubernetes >>>> >>>> Thanks Yang. We will try and let you know if any issues arise >>>> >>>> Regards >>>> Bhaskar >>>> >>>> On Thu, Oct 10, 2019 at 1:53 PM Yang Wang <danrtsey...@gmail.com> >>>> wrote: >>>> >>>> @ Hao Sun, >>>> I have made a confirmation that even we change parallelism and/or >>>> modify operators, add new operators, >>>> the flink cluster could also recover from latest checkpoint. >>>> >>>> @ Vijay >>>> a) Some individual jobmanager/taskmanager crashed >>>> exceptionally(someother jobmanagers >>>> and taskmanagers are alive), it could recover from the latest >>>> checkpoint. >>>> b) All jobmanagers and taskmanagers fails, it could still recover from >>>> the latest checkpoint if the cluster-id >>>> is not changed. >>>> >>>> When we enable the HA, The meta of jobgraph and checkpoint is saved on >>>> zookeeper and the real files are save >>>> on high-availability storage(HDFS). So when the flink application is >>>> submitted again with same cluster-id, it could >>>> recover jobs and checkpoint from zookeeper. I think it has been >>>> supported for a long time. Maybe you could have a >>>> try with flink-1.8 or 1.9. >>>> >>>> Best, >>>> Yang >>>> >>>> >>>> Vijay Bhaskar <bhaskar.eba...@gmail.com> 于2019年10月10日周四 下午2:26写道: >>>> >>>> Thanks Yang and Sean. I have couple of questions: >>>> >>>> 1) Suppose the scenario of , bringing back entire cluster, >>>> a) In that case, at least one job manager out of HA group should >>>> be up and running right? or >>>> b) All the job managers fails, then also this works? In that case >>>> please let me know the procedure/share the documentation? >>>> How to start from previous check point? >>>> What Flink version onwards this feature is stable? >>>> >>>> Regards >>>> Bhaskar >>>> >>>> >>>> On Wed, Oct 9, 2019 at 8:51 AM Yang Wang <danrtsey...@gmail.com> wrote: >>>> >>>> Hi Vijay, >>>> >>>> If you are using HA solution, i think you do not need to specify the >>>> savepoint. Instead the checkpoint is used. >>>> The checkpoint is done automatically and periodically based on your >>>> configuration.When the >>>> jobmanager/taskmanager fails or the whole cluster crashes, it could >>>> always recover from the latest >>>> checkpoint. Does this meed your requirement? >>>> >>>> Best, >>>> Yang >>>> >>>> Sean Hester <sean.hes...@bettercloud.com> 于2019年10月1日周二 上午1:47写道: >>>> >>>> Vijay, >>>> >>>> That is my understanding as well: the HA solution only solves the >>>> problem up to the point all job managers fail/restart at the same time. >>>> That's where my original concern was. >>>> >>>> But to Aleksandar and Yun's point, running in HA with 2 or 3 Job >>>> Managers per cluster--as long as they are all deployed to separate GKE >>>> nodes--would provide a very high uptime/low failure rate, at least on >>>> paper. It's a promising enough option that we're going to run in HA for a >>>> month or two and monitor results before we put in any extra work to >>>> customize the savepoint start-up behavior. >>>> >>>> On Fri, Sep 27, 2019 at 2:24 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> >>>> wrote: >>>> >>>> I don't think HA will help to recover from cluster crash, for that we >>>> should take periodic savepoint right? Please correct me in case i am wrong >>>> >>>> Regards >>>> Bhaskar >>>> >>>> On Fri, Sep 27, 2019 at 11:48 AM Vijay Bhaskar < >>>> bhaskar.eba...@gmail.com> wrote: >>>> >>>> Suppose my cluster got crashed and need to bring up the entire cluster >>>> back? Does HA still helps to run the cluster from latest save point? >>>> >>>> Regards >>>> Bhaskar >>>> >>>> On Thu, Sep 26, 2019 at 7:44 PM Sean Hester < >>>> sean.hes...@bettercloud.com> wrote: >>>> >>>> thanks to everyone for all the replies. >>>> >>>> i think the original concern here with "just" relying on the HA option >>>> is that there are some disaster recovery and data center migration use >>>> cases where the continuity of the job managers is difficult to preserve. >>>> but those are admittedly very edgy use cases. i think it's definitely worth >>>> reviewing the SLAs with our site reliability engineers to see how likely it >>>> would be to completely lose all job managers under an HA configuration. >>>> that small a risk might be acceptable/preferable to a one-off solution. >>>> >>>> @Aleksander, would love to learn more about Zookeeper-less HA. i >>>> think i spotted a thread somewhere between Till and someone (perhaps you) >>>> about that. feel free to DM me. >>>> >>>> thanks again to everyone! >>>> >>>> On Thu, Sep 26, 2019 at 7:32 AM Yang Wang <danrtsey...@gmail.com> >>>> wrote: >>>> >>>> Hi, Aleksandar >>>> >>>> Savepoint option in standalone job cluster is optional. If you want to >>>> always recover >>>> from the latest checkpoint, just as Aleksandar and Yun Tang said you >>>> could use the >>>> high-availability configuration. Make sure the cluster-id is not >>>> changed, i think the job >>>> could recover both at exceptionally crash and restart by expectation. >>>> >>>> @Aleksandar Mastilovic <amastilo...@sightmachine.com>, we are also >>>> have an zookeeper-less high-availability implementation[1]. >>>> Maybe we could have some discussion and contribute this useful feature >>>> to the community. >>>> >>>> [1]. >>>> https://docs.google.com/document/d/1Z-VdJlPPEQoWT1WLm5woM4y0bFep4FrgdJ9ipQuRv8g/edit >>>> >>>> Best, >>>> Yang >>>> >>>> Aleksandar Mastilovic <amastilo...@sightmachine.com> 于2019年9月26日周四 >>>> 上午4:11写道: >>>> >>>> Would you guys (Flink devs) be interested in our solution for >>>> zookeeper-less HA? I could ask the managers how they feel about >>>> open-sourcing the improvement. >>>> >>>> On Sep 25, 2019, at 11:49 AM, Yun Tang <myas...@live.com> wrote: >>>> >>>> As Aleksandar said, k8s with HA configuration could solve your problem. >>>> There already have some discussion about how to implement such HA in k8s if >>>> we don't have a zookeeper service: FLINK-11105 [1] and FLINK-12884 [2]. >>>> Currently, you might only have to choose zookeeper as high-availability >>>> service. >>>> >>>> [1] https://issues.apache.org/jira/browse/FLINK-11105 >>>> [2] https://issues.apache.org/jira/browse/FLINK-12884 >>>> >>>> Best >>>> Yun Tang >>>> ------------------------------ >>>> *From:* Aleksandar Mastilovic <amastilo...@sightmachine.com> >>>> *Sent:* Thursday, September 26, 2019 1:57 >>>> *To:* Sean Hester <sean.hes...@bettercloud.com> >>>> *Cc:* Hao Sun <ha...@zendesk.com>; Yuval Itzchakov <yuva...@gmail.com>; >>>> user <user@flink.apache.org> >>>> *Subject:* Re: Challenges Deploying Flink With Savepoints On Kubernetes >>>> >>>> Can’t you simply use JobManager in HA mode? It would pick up where it >>>> left off if you don’t provide a Savepoint. >>>> >>>> On Sep 25, 2019, at 6:07 AM, Sean Hester <sean.hes...@bettercloud.com> >>>> wrote: >>>> >>>> thanks for all replies! i'll definitely take a look at the Flink k8s >>>> Operator project. >>>> >>>> i'll try to restate the issue to clarify. this issue is specific to >>>> starting a job from a savepoint in job-cluster mode. in these cases the Job >>>> Manager container is configured to run a single Flink job at start-up. the >>>> savepoint needs to be provided as an argument to the entrypoint. the Flink >>>> documentation for this approach is here: >>>> >>>> >>>> https://github.com/apache/flink/tree/master/flink-container/kubernetes#resuming-from-a-savepoint >>>> >>>> the issue is that taking this approach means that the job will *always* >>>> start from the savepoint provided as the start argument in the >>>> Kubernetes YAML. this includes unplanned restarts of the job manager, but >>>> we'd really prefer any *unplanned* restarts resume for the most recent >>>> checkpoint instead of restarting from the configured savepoint. so in a >>>> sense we want the savepoint argument to be transient, only being used >>>> during the initial deployment, but this runs counter to the design of >>>> Kubernetes which always wants to restore a deployment to the "goal state" >>>> as defined in the YAML. >>>> >>>> i hope this helps. if you want more details please let me know, and >>>> thanks again for your time. >>>> >>>> >>>> On Tue, Sep 24, 2019 at 1:09 PM Hao Sun <ha...@zendesk.com> wrote: >>>> >>>> I think I overlooked it. Good point. I am using Redis to save the path >>>> to my savepoint, I might be able to set a TTL to avoid such issue. >>>> >>>> Hao Sun >>>> >>>> >>>> On Tue, Sep 24, 2019 at 9:54 AM Yuval Itzchakov <yuva...@gmail.com> >>>> wrote: >>>> >>>> Hi Hao, >>>> >>>> I think he's exactly talking about the usecase where the JM/TM restart >>>> and they come back up from the latest savepoint which might be stale by >>>> that time. >>>> >>>> On Tue, 24 Sep 2019, 19:24 Hao Sun, <ha...@zendesk.com> wrote: >>>> >>>> We always make a savepoint before we shutdown the job-cluster. So the >>>> savepoint is always the latest. When we fix a bug or change the job graph, >>>> it can resume well. >>>> We only use checkpoints for unplanned downtime, e.g. K8S killed JM/TM, >>>> uncaught exception, etc. >>>> >>>> Maybe I do not understand your use case well, I do not see a need to >>>> start from checkpoint after a bug fix. >>>> From what I know, currently you can use checkpoint as a savepoint as >>>> well >>>> >>>> Hao Sun >>>> >>>> >>>> On Tue, Sep 24, 2019 at 7:48 AM Yuval Itzchakov <yuva...@gmail.com> >>>> wrote: >>>> >>>> AFAIK there's currently nothing implemented to solve this problem, but >>>> working on a possible fix can be implemented on top of >>>> https://github.com/lyft/flinkk8soperator which already has a pretty >>>> fancy state machine for rolling upgrades. I'd love to be involved as this >>>> is an issue I've been thinking about as well. >>>> >>>> Yuval >>>> >>>> On Tue, Sep 24, 2019 at 5:02 PM Sean Hester < >>>> sean.hes...@bettercloud.com> wrote: >>>> >>>> hi all--we've run into a gap (knowledge? design? tbd?) for our use >>>> cases when deploying Flink jobs to start from savepoints using the >>>> job-cluster mode in Kubernetes. >>>> >>>> we're running a ~15 different jobs, all in job-cluster mode, using a >>>> mix of Flink 1.8.1 and 1.9.0, under GKE (Google Kubernetes Engine). these >>>> are all long-running streaming jobs, all essentially acting as >>>> microservices. we're using Helm charts to configure all of our deployments. >>>> >>>> we have a number of use cases where we want to restart jobs from a >>>> savepoint to replay recent events, i.e. when we've enhanced the job logic >>>> or fixed a bug. but after the deployment we want to have the job resume >>>> it's "long-running" behavior, where any unplanned restarts resume from the >>>> latest checkpoint. >>>> >>>> the issue we run into is that any obvious/standard/idiomatic Kubernetes >>>> deployment includes the savepoint argument in the configuration. if the Job >>>> Manager container(s) have an unplanned restart, when they come back up they >>>> will start from the savepoint instead of resuming from the latest >>>> checkpoint. everything is working as configured, but that's not exactly >>>> what we want. we want the savepoint argument to be transient somehow (only >>>> used during the initial deployment), but Kubernetes doesn't really support >>>> the concept of transient configuration. >>>> >>>> i can see a couple of potential solutions that either involve custom >>>> code in the jobs or custom logic in the container (i.e. a custom entrypoint >>>> script that records that the configured savepoint has already been used in >>>> a file on a persistent volume or GCS, and potentially when/why/by which >>>> deployment). but these seem like unexpected and hacky solutions. before we >>>> head down that road i wanted to ask: >>>> >>>> - is this is already a solved problem that i've missed? >>>> - is this issue already on the community's radar? >>>> >>>> thanks in advance! >>>> >>>> -- >>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865 >>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305 >>>> <http://www.bettercloud.com/> <http://www.bettercloud.com/> >>>> *Altitude 2019 in San Francisco | Sept. 23 - 25* >>>> It’s not just an IT conference, it’s “a complete learning and >>>> networking experience” >>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude> >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Yuval Itzchakov. >>>> >>>> >>>> >>>> -- >>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865 >>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305 >>>> <http://www.bettercloud.com/> <http://www.bettercloud.com/> >>>> *Altitude 2019 in San Francisco | Sept. 23 - 25* >>>> It’s not just an IT conference, it’s “a complete learning and >>>> networking experience” >>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude> >>>> >>>> >>>> >>>> >>>> -- >>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865 >>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305 >>>> <http://www.bettercloud.com> <http://www.bettercloud.com> >>>> *Altitude 2019 in San Francisco | Sept. 23 - 25* >>>> It’s not just an IT conference, it’s “a complete learning and >>>> networking experience” >>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude> >>>> >>>> >>>> >>>> -- >>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865 >>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305 >>>> <http://www.bettercloud.com> <http://www.bettercloud.com> >>>> *Introducing the BetterCloud Integration Center * >>>> Automate actions across every app and own SaaSOps >>>> <https://www.bettercloud.com/integrations-webinar/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-integration-center> >>>> >>>>