Sorry to forget the version, we run flink 1.7 on yarn in a ha mode.

On Fri, Oct 11, 2019 at 12:02 PM Joshua Fan <joshuafat...@gmail.com> wrote:

> Hi Till
>
> After got your advice, I checked the log again. It seems not wholely the
> same as the condition you mentioned.
>
> I would like to summarize the story in the belowed log.
>
> Once a time, the zk connection  was not stable, so there happened 3 times
> suspended-reconnected.
>
> After the first suspended-reconnected, the Minidispatcher tried to recover
> all jobs.
>
> Then the second suspended-reconnected came, after this reconnected, there
> happened a 'The heartbeat of JobManager with id
> dbad79e0173c5658b029fba4d70e8084 timed out', and in this turn, the
> Minidispatcher didn't try to recover the job.
>
> Due to the zk connection did not recover, the third suspended-reconnected
> came, after the zk reconnected, the Minidispatcher did not try to recover
> job ,but just repeated throw FencingTokenException, the AM was hanging, our
> alarm-system just
> found the job was gone, but can not get a final state of the job. And the
> FencingTokenException was ongoing for nearly one day long before we killed
> the AM.
>
> the whole log is attached.
>
> Thanks
>
> Joshua
>
> On Fri, Oct 11, 2019 at 10:35 AM Hanson, Bruce <bruce.han...@here.com>
> wrote:
>
>> Hi Till and Fabian,
>>
>>
>>
>> My apologies for taking a week to reply; it took some time to reproduce
>> the issue with debug logging. I’ve attached logs from a two minute period
>> when the problem happened. I’m just sending this to you two to avoid
>> sending the log file all over the place. If you’d like to have our
>> conversation in the user group mailing list, that’s fine.
>>
>>
>>
>> The job was submitted by using the job manager REST api starting at
>> 20:33:46.262 and finishing at 20:34:01.547. This worked normally, and the
>> job started running. We then run a monitor that polls the /overview
>> endpoint of the JM REST api. This started polling at 20:34:31.380 and
>> resulted in the JM throwing the FencingTokenException at 20:34:31:393, and
>> the JM returned a 500 to our monitor. This will happen every time we poll
>> until the monitor times out and then we tear down the cluster, even though
>> the job is running, we can’t tell that it is. This is somewhat rare,
>> happening maybe 5% of the time.
>>
>>
>>
>> We’re running Flink 1.7.1. This issue only happens when we run in Job
>> Manager High Availability mode. We provision two Job Managers, a 3-node
>> zookeeper cluster, task managers and our monitor all in their own
>> Kubernetes namespace. I can send you Zookeeper logs too if that would be
>> helpful.
>>
>>
>>
>> Thanks in advance for any help you can provide!
>>
>>
>>
>> -Bruce
>>
>> --
>>
>>
>>
>>
>>
>> *From: *Till Rohrmann <trohrm...@apache.org>
>> *Date: *Wednesday, October 2, 2019 at 6:10 AM
>> *To: *Fabian Hueske <fhue...@gmail.com>
>> *Cc: *"Hanson, Bruce" <bruce.han...@here.com>, "user@flink.apache.org" <
>> user@flink.apache.org>
>> *Subject: *Re: Fencing token exceptions from Job Manager High
>> Availability mode
>>
>>
>>
>> Hi Bruce, are you able to provide us with the full debug logs? From the
>> excerpt itself it is hard to tell what is going on.
>>
>>
>>
>> Cheers,
>>
>> Till
>>
>>
>>
>> On Wed, Oct 2, 2019 at 2:24 PM Fabian Hueske <fhue...@gmail.com> wrote:
>>
>> Hi Bruce,
>>
>>
>>
>> I haven't seen such an exception yet, but maybe Till (in CC) can help.
>>
>>
>>
>> Best,
>>
>> Fabian
>>
>>
>>
>> Am Di., 1. Okt. 2019 um 05:51 Uhr schrieb Hanson, Bruce <
>> bruce.han...@here.com>:
>>
>> Hi all,
>>
>>
>>
>> We are running some of our Flink jobs with Job Manager High Availability.
>> Occasionally we get a cluster that comes up improperly and doesn’t respond.
>> Attempts to submit the job seem to hang and when we hit the /overview REST
>> endpoint in the Job Manager we get a 500 error and a fencing token
>> exception like this:
>>
>>
>>
>> *2019-09-21 05:04:07.785 [flink-akka.actor.default-dispatcher-4428]
>> level=ERROR o.a.f.runtime.rest.handler.cluster.ClusterOverviewHandler  -
>> Implementation error: Unhandled exception.*
>>
>> *org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing
>> token not set: Ignoring message LocalFencedMessage(null,
>> LocalRpcInvocation(requestResourceOverview(Time))) sent to
>> akka.tcp://fl...@job-ef80a156-3350-4e85-8761-b0e42edc346f-jm-0.job-ef80a156-3350-4e85-8761-b0e42edc346f-jm-svc.olp-here-test-j-ef80a156-3350-4e85-8761-b0e42edc346f.svc.cluster.local:6126/user/resourcemanager
>> because the fencing token is null.*
>>
>> *        at
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:59)*
>>
>> *        at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)*
>>
>> *        at
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)*
>>
>> *        at
>> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)*
>>
>> *        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)*
>>
>> *        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)*
>>
>> *        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)*
>>
>> *        at akka.actor.ActorCell.invoke(ActorCell.scala:495)*
>>
>> *        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)*
>>
>> *        at akka.dispatch.Mailbox.run(Mailbox.scala:224)*
>>
>> *        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)*
>>
>> *        at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)*
>>
>> *        at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)*
>>
>> *        at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)*
>>
>> *        at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)*
>>
>>
>>
>>
>>
>> We are running Flink 1.7.1 in Kubernetes and run each job in its own
>> namespace with a three-node Zookeeper cluster and two Job Managers, plus
>> one or more Task Managers. I have been able to replicate the issue, but
>> don’t find any difference in the logs between a failing cluster and a good
>> one.
>>
>>
>>
>> Does anyone here have any ideas as to what’s happening, or what I should
>> be looking for?
>>
>>
>>
>> -Bruce
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> [image: cid:image001.png@01D2B473.0F7F85E0]
>>
>>
>>
>> *Bruce Hanson*
>>
>> *Principal Engineer*
>>
>> *M: +1 425 681 0422*
>>
>>
>>
>> HERE Technologies
>>
>> 701 Pike Street, Suite 2000
>>
>> Seattle, WA 98101 USA
>>
>> *47° 36' 41" N 122° 19' 57" W*
>>
>>
>>
>> [image: cid:image002.png@01D2B473.0F7F85E0] <http://360.here.com/>    [image:
>> cid:image003.png@01D2B473.0F7F85E0]
>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.twitter.com%2Fhere&data=01%7C01%7C%7C4f67a659ffdf4a9aa6ed08d74739d553%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=nRHXS3zhj3%2B9yNIPJOdXrsPuSMOvVKIhJIzXqS1aF14%3D&reserved=0>
>>    [image: cid:image004.png@01D2B473.0F7F85E0]
>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.facebook.com%2Fhere&data=01%7C01%7C%7C4f67a659ffdf4a9aa6ed08d74739d553%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=7X4UVPB6mfZinwV9kpVLiiLNc23DpmP558jh5ObRbAE%3D&reserved=0>
>>     [image: cid:image005.png@01D2B473.0F7F85E0]
>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fheremaps&data=01%7C01%7C%7C4f67a659ffdf4a9aa6ed08d74739d553%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=i7DJedE1zXA2or4XXG5xyJkMynSaln0OVbAruhQjuYc%3D&reserved=0>
>>     [image: cid:image006.png@01D2B473.0F7F85E0]
>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.instagram.com%2Fhere%2F&data=01%7C01%7C%7C4f67a659ffdf4a9aa6ed08d74739d553%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=oKDxDaJePFZ6MnPsHU%2FXprRNetgawPvdX%2BzRmo8tcVo%3D&reserved=0>
>>
>>
>>
>>
>>
>>

Reply via email to