Thanks you till. We will try to shift to latest flink version.

Regards
Bhaskar

On Mon, Oct 14, 2019 at 7:43 PM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Vijay,
>
> Flink usually writes first the checkpoint data to disk and then writes the
> pointer to the files to ZooKeeper. Hence, if you see a ZooKeeper entry,
> then the files should be there. I assume that there is no other process
> accessing and potentially removing files from the checkpoint directories,
> right?
>
> Have you tried to run one of the latest Flink versions? Flink 1.6.2 is no
> longer actively supported by the community.
>
> Cheers,
> Till
>
> On Fri, Oct 11, 2019 at 11:39 AM Vijay Bhaskar <bhaskar.eba...@gmail.com>
> wrote:
>
>> Apart from these we have other environment and there check point worked
>> fine in HA mode with complete cluster restart. But one of the job we are
>> seeing an issue, in zookeeper the check point path is retrieved and its
>> unable to find the check point path in persistent storage. I am wondering
>> why this would happen first of all?
>> Is there any sync issue between file writing over persistent path and
>> file registration with HA service? For example check point has been
>> registered in zookeeper but has not been written yet while restarting the
>> cluster?  I suspect this kind of problem can happen. We are using flink
>> 1.6.2 in production. Is this an issue already known before and fixed
>> recently
>>
>> Regards
>> Bhaskar
>>
>> On Fri, Oct 11, 2019 at 2:08 PM Vijay Bhaskar <bhaskar.eba...@gmail.com>
>> wrote:
>>
>>> We are seeing below logs in production sometime ago, after that we
>>> stopped HA. Do you people think HA is enabled properly from the below logs?
>>>
>>> Regards
>>> Bhaskar
>>>
>>> 2019-09-24 17:40:17,675 INFO
>>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>> Starting ZooKeeperLeaderElectionService
>>> ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
>>> 2019-09-24 17:40:17,675 INFO
>>>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
>>> 2019-09-24 17:40:20,975 WARN  akka.remote.transport.netty.NettyTransport
>>>                    - Remote connection to [null] failed with
>>> java.net.NoRouteToHostException: No route to host
>>> 2019-09-24 17:40:20,976 WARN  akka.remote.ReliableDeliverySupervisor
>>>                        - Association with remote system [akka.tcp:
>>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]
>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>> with [akka.tcp:
>>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]]
>>> Caused by: [No route to host]
>>> 2019-09-24 17:40:23,976 WARN  akka.remote.transport.netty.NettyTransport
>>>                    - Remote connection to [null] failed with
>>> java.net.NoRouteToHostException: No route to host
>>> 2019-09-24 17:40:23,977 WARN  akka.remote.ReliableDeliverySupervisor
>>>                        - Association with remote system [akka.tcp:
>>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]
>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>> with [akka.tcp:
>>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]]
>>> Caused by: [No route to host]
>>> 2019-09-24 17:40:26,982 WARN  akka.remote.transport.netty.NettyTransport
>>>                    - Remote connection to [null] failed with
>>> java.net.NoRouteToHostException: No route to host
>>> 2019-09-24 17:40:26,983 WARN  akka.remote.ReliableDeliverySupervisor
>>>                        - Association with remote system [akka.tcp:
>>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]
>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>> with [akka.tcp:
>>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]]
>>> Caused by: [No route to host]
>>> 2019-09-24 17:40:29,988 WARN  akka.remote.transport.netty.NettyTransport
>>>                    - Remote connection to [null] failed with
>>> java.net.NoRouteToHostException: No route to host
>>> 2019-09-24 17:40:29,988 WARN  akka.remote.ReliableDeliverySupervisor
>>>                        - Association with remote system [akka.tcp:
>>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]
>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>> with [akka.tcp:
>>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]]
>>> Caused by: [No route to host]
>>> 2019-09-24 17:40:32,994 WARN  akka.remote.transport.netty.NettyTransport
>>>                    - Remote connection to [null] failed with
>>> java.net.NoRouteToHostException: No route to host
>>> 2019-09-24 17:40:32,995 WARN  akka.remote.ReliableDeliverySupervisor
>>>                        - Association with remote system [akka.tcp:
>>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]
>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>> with [akka.tcp:
>>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]]
>>> Caused by: [No route to host]
>>> 2019-09-24 17:40:36,000 WARN  akka.remote.transport.netty.NettyTransport
>>>                    - Remote connection to [null] failed with
>>> java.net.NoRouteToHostException: No route to host
>>> 2019-09-24 17:40:36,001 WARN  akka.remote.ReliableDeliverySupervisor
>>>                        - Association with remote system [akka.tcp:
>>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]
>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>> with [akka.tcp:
>>> //fl...@service-flink-jobmanager-1.cloudsecure.svc.cluster.local:6123]]
>>> Caused by: [No route to host]
>>> 2019-09-24 17:40:39,006 WARN  akka.remote.transport.netty.NettyTransport
>>>                    - Remote connection to [null] failed with
>>> java.net.NoRouteToHostException: No route to host
>>>
>>> On Fri, Oct 11, 2019 at 9:39 AM Yun Tang <myas...@live.com> wrote:
>>>
>>>> Hi Hao
>>>>
>>>> It seems that I misunderstood the background of usage for your cases.
>>>> High availability configuration targets for fault tolerance not for general
>>>> development evolution. If you want to change your job topology, just follow
>>>> the general rule to restore from savepoint/checkpoint, do not rely on HA to
>>>> do job migration things.
>>>>
>>>> Best
>>>> Yun Tang
>>>> ------------------------------
>>>> *From:* Hao Sun <ha...@zendesk.com>
>>>> *Sent:* Friday, October 11, 2019 8:33
>>>> *To:* Yun Tang <myas...@live.com>
>>>> *Cc:* Vijay Bhaskar <bhaskar.eba...@gmail.com>; Yang Wang <
>>>> danrtsey...@gmail.com>; Sean Hester <sean.hes...@bettercloud.com>;
>>>> Aleksandar Mastilovic <amastilo...@sightmachine.com>; Yuval Itzchakov <
>>>> yuva...@gmail.com>; user <user@flink.apache.org>
>>>> *Subject:* Re: Challenges Deploying Flink With Savepoints On Kubernetes
>>>>
>>>> Yep I know that option. That's where get me confused as well. In a HA
>>>> setup, where do I supply this option (allowNonRestoredState)?
>>>> This option requires a savepoint path when I start a flink job I
>>>> remember. And HA does not require the path
>>>>
>>>> Hao Sun
>>>>
>>>>
>>>> On Thu, Oct 10, 2019 at 11:16 AM Yun Tang <myas...@live.com> wrote:
>>>>
>>>> Just a minor supplement @Hao Sun <ha...@zendesk.com>, if you decided
>>>> to drop a operator, don't forget to add --allowNonRestoredState
>>>> (short: -n) option [1]
>>>>
>>>>
>>>> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#allowing-non-restored-state
>>>>
>>>> Best
>>>> Yun Tang
>>>>
>>>> ------------------------------
>>>> *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com>
>>>> *Sent:* Thursday, October 10, 2019 19:24
>>>> *To:* Yang Wang <danrtsey...@gmail.com>
>>>> *Cc:* Sean Hester <sean.hes...@bettercloud.com>; Aleksandar Mastilovic
>>>> <amastilo...@sightmachine.com>; Yun Tang <myas...@live.com>; Hao Sun <
>>>> ha...@zendesk.com>; Yuval Itzchakov <yuva...@gmail.com>; user <
>>>> user@flink.apache.org>
>>>> *Subject:* Re: Challenges Deploying Flink With Savepoints On Kubernetes
>>>>
>>>> Thanks Yang. We will try and let you know if any issues arise
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>> On Thu, Oct 10, 2019 at 1:53 PM Yang Wang <danrtsey...@gmail.com>
>>>> wrote:
>>>>
>>>> @ Hao Sun,
>>>> I have made a confirmation that even we change parallelism and/or
>>>> modify operators, add new operators,
>>>> the flink cluster could also recover from latest checkpoint.
>>>>
>>>> @ Vijay
>>>> a) Some individual jobmanager/taskmanager crashed
>>>> exceptionally(someother jobmanagers
>>>> and taskmanagers are alive), it could recover from the latest
>>>> checkpoint.
>>>> b) All jobmanagers and taskmanagers fails, it could still recover from
>>>> the latest checkpoint if the cluster-id
>>>> is not changed.
>>>>
>>>> When we enable the HA, The meta of jobgraph and checkpoint is saved on
>>>> zookeeper and the real files are save
>>>> on high-availability storage(HDFS). So when the flink application is
>>>> submitted again with same cluster-id, it could
>>>> recover jobs and checkpoint from zookeeper. I think it has been
>>>> supported for a long time. Maybe you could have a
>>>> try with flink-1.8 or 1.9.
>>>>
>>>> Best,
>>>> Yang
>>>>
>>>>
>>>> Vijay Bhaskar <bhaskar.eba...@gmail.com> 于2019年10月10日周四 下午2:26写道:
>>>>
>>>> Thanks Yang and Sean. I have couple of questions:
>>>>
>>>> 1) Suppose the scenario of , bringing back entire cluster,
>>>>      a) In that case, at least one job manager out of HA group should
>>>> be up and running right? or
>>>>      b) All the job managers fails, then also this works? In that case
>>>> please let me know the procedure/share the documentation?
>>>>          How to start from previous check point?
>>>>          What Flink version onwards this feature is stable?
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>>
>>>> On Wed, Oct 9, 2019 at 8:51 AM Yang Wang <danrtsey...@gmail.com> wrote:
>>>>
>>>> Hi Vijay,
>>>>
>>>> If you are using HA solution, i think you do not need to specify the
>>>> savepoint. Instead the checkpoint is used.
>>>> The checkpoint is done automatically and periodically based on your
>>>> configuration.When the
>>>> jobmanager/taskmanager fails or the whole cluster crashes, it could
>>>> always recover from the latest
>>>> checkpoint. Does this meed your requirement?
>>>>
>>>> Best,
>>>> Yang
>>>>
>>>> Sean Hester <sean.hes...@bettercloud.com> 于2019年10月1日周二 上午1:47写道:
>>>>
>>>> Vijay,
>>>>
>>>> That is my understanding as well: the HA solution only solves the
>>>> problem up to the point all job managers fail/restart at the same time.
>>>> That's where my original concern was.
>>>>
>>>> But to Aleksandar and Yun's point, running in HA with 2 or 3 Job
>>>> Managers per cluster--as long as they are all deployed to separate GKE
>>>> nodes--would provide a very high uptime/low failure rate, at least on
>>>> paper. It's a promising enough option that we're going to run in HA for a
>>>> month or two and monitor results before we put in any extra work to
>>>> customize the savepoint start-up behavior.
>>>>
>>>> On Fri, Sep 27, 2019 at 2:24 AM Vijay Bhaskar <bhaskar.eba...@gmail.com>
>>>> wrote:
>>>>
>>>> I don't think HA will help to recover from cluster crash, for that we
>>>> should take periodic savepoint right? Please correct me in case i am wrong
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>> On Fri, Sep 27, 2019 at 11:48 AM Vijay Bhaskar <
>>>> bhaskar.eba...@gmail.com> wrote:
>>>>
>>>> Suppose my cluster got crashed and need to bring up the entire cluster
>>>> back? Does HA still helps to run the cluster from latest save point?
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>> On Thu, Sep 26, 2019 at 7:44 PM Sean Hester <
>>>> sean.hes...@bettercloud.com> wrote:
>>>>
>>>> thanks to everyone for all the replies.
>>>>
>>>> i think the original concern here with "just" relying on the HA option
>>>> is that there are some disaster recovery and data center migration use
>>>> cases where the continuity of the job managers is difficult to preserve.
>>>> but those are admittedly very edgy use cases. i think it's definitely worth
>>>> reviewing the SLAs with our site reliability engineers to see how likely it
>>>> would be to completely lose all job managers under an HA configuration.
>>>> that small a risk might be acceptable/preferable to a one-off solution.
>>>>
>>>> @Aleksander, would love to learn more about Zookeeper-less HA. i
>>>> think i spotted a thread somewhere between Till and someone (perhaps you)
>>>> about that. feel free to DM me.
>>>>
>>>> thanks again to everyone!
>>>>
>>>> On Thu, Sep 26, 2019 at 7:32 AM Yang Wang <danrtsey...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi, Aleksandar
>>>>
>>>> Savepoint option in standalone job cluster is optional. If you want to
>>>> always recover
>>>> from the latest checkpoint, just as Aleksandar and Yun Tang said you
>>>> could use the
>>>> high-availability configuration. Make sure the cluster-id is not
>>>> changed, i think the job
>>>> could recover both at exceptionally crash and restart by expectation.
>>>>
>>>> @Aleksandar Mastilovic <amastilo...@sightmachine.com>, we are also
>>>> have an zookeeper-less high-availability implementation[1].
>>>> Maybe we could have some discussion and contribute this useful feature
>>>> to the community.
>>>>
>>>> [1].
>>>> https://docs.google.com/document/d/1Z-VdJlPPEQoWT1WLm5woM4y0bFep4FrgdJ9ipQuRv8g/edit
>>>>
>>>> Best,
>>>> Yang
>>>>
>>>> Aleksandar Mastilovic <amastilo...@sightmachine.com> 于2019年9月26日周四
>>>> 上午4:11写道:
>>>>
>>>> Would you guys (Flink devs) be interested in our solution for
>>>> zookeeper-less HA? I could ask the managers how they feel about
>>>> open-sourcing the improvement.
>>>>
>>>> On Sep 25, 2019, at 11:49 AM, Yun Tang <myas...@live.com> wrote:
>>>>
>>>> As Aleksandar said, k8s with HA configuration could solve your problem.
>>>> There already have some discussion about how to implement such HA in k8s if
>>>> we don't have a zookeeper service: FLINK-11105 [1] and FLINK-12884 [2].
>>>> Currently, you might only have to choose zookeeper as high-availability
>>>> service.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-11105
>>>> [2] https://issues.apache.org/jira/browse/FLINK-12884
>>>>
>>>> Best
>>>> Yun Tang
>>>> ------------------------------
>>>> *From:* Aleksandar Mastilovic <amastilo...@sightmachine.com>
>>>> *Sent:* Thursday, September 26, 2019 1:57
>>>> *To:* Sean Hester <sean.hes...@bettercloud.com>
>>>> *Cc:* Hao Sun <ha...@zendesk.com>; Yuval Itzchakov <yuva...@gmail.com>;
>>>> user <user@flink.apache.org>
>>>> *Subject:* Re: Challenges Deploying Flink With Savepoints On Kubernetes
>>>>
>>>> Can’t you simply use JobManager in HA mode? It would pick up where it
>>>> left off if you don’t provide a Savepoint.
>>>>
>>>> On Sep 25, 2019, at 6:07 AM, Sean Hester <sean.hes...@bettercloud.com>
>>>> wrote:
>>>>
>>>> thanks for all replies! i'll definitely take a look at the Flink k8s
>>>> Operator project.
>>>>
>>>> i'll try to restate the issue to clarify. this issue is specific to
>>>> starting a job from a savepoint in job-cluster mode. in these cases the Job
>>>> Manager container is configured to run a single Flink job at start-up. the
>>>> savepoint needs to be provided as an argument to the entrypoint. the Flink
>>>> documentation for this approach is here:
>>>>
>>>>
>>>> https://github.com/apache/flink/tree/master/flink-container/kubernetes#resuming-from-a-savepoint
>>>>
>>>> the issue is that taking this approach means that the job will *always*
>>>>  start from the savepoint provided as the start argument in the
>>>> Kubernetes YAML. this includes unplanned restarts of the job manager, but
>>>> we'd really prefer any *unplanned* restarts resume for the most recent
>>>> checkpoint instead of restarting from the configured savepoint. so in a
>>>> sense we want the savepoint argument to be transient, only being used
>>>> during the initial deployment, but this runs counter to the design of
>>>> Kubernetes which always wants to restore a deployment to the "goal state"
>>>> as defined in the YAML.
>>>>
>>>> i hope this helps. if you want more details please let me know, and
>>>> thanks again for your time.
>>>>
>>>>
>>>> On Tue, Sep 24, 2019 at 1:09 PM Hao Sun <ha...@zendesk.com> wrote:
>>>>
>>>> I think I overlooked it. Good point. I am using Redis to save the path
>>>> to my savepoint, I might be able to set a TTL to avoid such issue.
>>>>
>>>> Hao Sun
>>>>
>>>>
>>>> On Tue, Sep 24, 2019 at 9:54 AM Yuval Itzchakov <yuva...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Hao,
>>>>
>>>> I think he's exactly talking about the usecase where the JM/TM restart
>>>> and they come back up from the latest savepoint which might be stale by
>>>> that time.
>>>>
>>>> On Tue, 24 Sep 2019, 19:24 Hao Sun, <ha...@zendesk.com> wrote:
>>>>
>>>> We always make a savepoint before we shutdown the job-cluster. So the
>>>> savepoint is always the latest. When we fix a bug or change the job graph,
>>>> it can resume well.
>>>> We only use checkpoints for unplanned downtime, e.g. K8S killed JM/TM,
>>>> uncaught exception, etc.
>>>>
>>>> Maybe I do not understand your use case well, I do not see a need to
>>>> start from checkpoint after a bug fix.
>>>> From what I know, currently you can use checkpoint as a savepoint as
>>>> well
>>>>
>>>> Hao Sun
>>>>
>>>>
>>>> On Tue, Sep 24, 2019 at 7:48 AM Yuval Itzchakov <yuva...@gmail.com>
>>>> wrote:
>>>>
>>>> AFAIK there's currently nothing implemented to solve this problem, but
>>>> working on a possible fix can be implemented on top of
>>>> https://github.com/lyft/flinkk8soperator which already has a pretty
>>>> fancy state machine for rolling upgrades. I'd love to be involved as this
>>>> is an issue I've been thinking about as well.
>>>>
>>>> Yuval
>>>>
>>>> On Tue, Sep 24, 2019 at 5:02 PM Sean Hester <
>>>> sean.hes...@bettercloud.com> wrote:
>>>>
>>>> hi all--we've run into a gap (knowledge? design? tbd?) for our use
>>>> cases when deploying Flink jobs to start from savepoints using the
>>>> job-cluster mode in Kubernetes.
>>>>
>>>> we're running a ~15 different jobs, all in job-cluster mode, using a
>>>> mix of Flink 1.8.1 and 1.9.0, under GKE (Google Kubernetes Engine). these
>>>> are all long-running streaming jobs, all essentially acting as
>>>> microservices. we're using Helm charts to configure all of our deployments.
>>>>
>>>> we have a number of use cases where we want to restart jobs from a
>>>> savepoint to replay recent events, i.e. when we've enhanced the job logic
>>>> or fixed a bug. but after the deployment we want to have the job resume
>>>> it's "long-running" behavior, where any unplanned restarts resume from the
>>>> latest checkpoint.
>>>>
>>>> the issue we run into is that any obvious/standard/idiomatic Kubernetes
>>>> deployment includes the savepoint argument in the configuration. if the Job
>>>> Manager container(s) have an unplanned restart, when they come back up they
>>>> will start from the savepoint instead of resuming from the latest
>>>> checkpoint. everything is working as configured, but that's not exactly
>>>> what we want. we want the savepoint argument to be transient somehow (only
>>>> used during the initial deployment), but Kubernetes doesn't really support
>>>> the concept of transient configuration.
>>>>
>>>> i can see a couple of potential solutions that either involve custom
>>>> code in the jobs or custom logic in the container (i.e. a custom entrypoint
>>>> script that records that the configured savepoint has already been used in
>>>> a file on a persistent volume or GCS, and potentially when/why/by which
>>>> deployment). but these seem like unexpected and hacky solutions. before we
>>>> head down that road i wanted to ask:
>>>>
>>>>    - is this is already a solved problem that i've missed?
>>>>    - is this issue already on the community's radar?
>>>>
>>>> thanks in advance!
>>>>
>>>> --
>>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865
>>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305
>>>> <http://www.bettercloud.com/> <http://www.bettercloud.com/>
>>>> *Altitude 2019 in San Francisco | Sept. 23 - 25*
>>>> It’s not just an IT conference, it’s “a complete learning and
>>>> networking experience”
>>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Yuval Itzchakov.
>>>>
>>>>
>>>>
>>>> --
>>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865
>>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305
>>>> <http://www.bettercloud.com/> <http://www.bettercloud.com/>
>>>> *Altitude 2019 in San Francisco | Sept. 23 - 25*
>>>> It’s not just an IT conference, it’s “a complete learning and
>>>> networking experience”
>>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865
>>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305
>>>> <http://www.bettercloud.com> <http://www.bettercloud.com>
>>>> *Altitude 2019 in San Francisco | Sept. 23 - 25*
>>>> It’s not just an IT conference, it’s “a complete learning and
>>>> networking experience”
>>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude>
>>>>
>>>>
>>>>
>>>> --
>>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865
>>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305
>>>> <http://www.bettercloud.com> <http://www.bettercloud.com>
>>>> *Introducing the BetterCloud Integration Center *
>>>> Automate actions across every app and own SaaSOps
>>>> <https://www.bettercloud.com/integrations-webinar/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-integration-center>
>>>>
>>>>

Reply via email to