Hi Marcus,

thanks for reaching out with your problem.
I'm not very experienced with the HA setup, but Till (in CC) might be able
to help you.

Best, Fabian

2017-09-14 16:57 GMT+02:00 Marcus Clendenin <[email protected]>:

> Hi all,
>
>
>
> I am having an issue where one of our task managers that is running in
> high availability mode is timing out on the connection to zookeeper. This
> is causing it to retry the connection to zookeeper, which succeeds. The
> issue is once the taskmanager is back connected to zookeeper it is then
> unable to connect to the Job manager. Does anybody know why this is
> happening? This is on flink 1.3.1 with checkpointing using RocksDB
>
>
>
> Stack Trace:
>
> 2017-09-14 09:35:16,033 INFO  org.apache.zookeeper.
> ClientCnxn                               - Client session timed out, have
> not heard from server in 79531ms for sessionid 0x15e428f9953001f, closing
> socket connection and attempting reconnect
>
> 2017-09-14 09:35:17,170 INFO  org.apache.flink.shaded.org.
> apache.curator.framework.state.ConnectionStateManager  - State change:
> SUSPENDED
>
> 2017-09-14 09:35:17,528 WARN  org.apache.flink.runtime.leaderretrieval.
> ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper suspended. Can
> no longer retrieve the leader from ZooKeeper.
>
> 2017-09-14 09:35:17,796 WARN  org.apache.zookeeper.
> ClientCnxn                               - SASL configuration failed:
> javax.security.auth.login.LoginException: unable to find LoginModule
> class: org.apache.kafka.common.security.plain.PlainLoginModule Will
> continue connection to Zookeeper server without SASL authentication, if
> Zookeeper server allows it.
>
> 2017-09-14 09:35:17,796 INFO  org.apache.zookeeper.
> ClientCnxn                               - Opening socket connection to
> server zookeeper21-01/00.000.00.000:2181
>
> 2017-09-14 09:35:17,798 INFO  org.apache.zookeeper.ClientCnxn
>            - Socket connection established to zookeeper21-01/00.000.00.
> 000:2181, initiating session
>
> 2017-09-14 09:35:17,958 ERROR 
> org.apache.flink.shaded.org.apache.curator.ConnectionState
> - Authentication failed
>
> 2017-09-14 09:35:18,261 WARN  akka.remote.RemoteWatcher
>                                 - Detected unreachable:
> [akka.tcp://flink@jobmanager1:36491]
>
> 2017-09-14 09:35:18,433 INFO  org.apache.flink.shaded.org.
> apache.curator.framework.state.ConnectionStateManager  - State change:
> LOST
>
> 2017-09-14 09:35:18,433 INFO  org.apache.zookeeper.
> ClientCnxn                               - Unable to reconnect to
> ZooKeeper service, session 0x15e428f9953001f has expired, closing socket
> connection
>
> 2017-09-14 09:35:18,433 WARN  
> org.apache.flink.shaded.org.apache.curator.ConnectionState
> - Session expired event received
>
> 2017-09-14 09:35:18,433 WARN  org.apache.flink.runtime.leaderretrieval.
> ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper lost. Can no
> longer retrieve the leader from ZooKeeper.
>
> 2017-09-14 09:35:18,693 INFO  org.apache.zookeeper.
> ZooKeeper                                - Initiating client connection,
> connectString=zookeeper21-01:2181,zookeeper21-02:2181,zookeeper21-03:2181,
> zookeeper22-01:2181,zookeeper22-02:2181 sessionTimeout=60000
> watcher=org.apache.flink.shaded.org.apache.curator.
> ConnectionState@781f10f2
>
> 2017-09-14 09:35:18,757 INFO  org.apache.zookeeper.
> ClientCnxn                               - EventThread shut down
>
> 2017-09-14 09:35:19,354 WARN  org.apache.zookeeper.
> ClientCnxn                               - SASL configuration failed:
> javax.security.auth.login.LoginException: unable to find LoginModule
> class: org.apache.kafka.common.security.plain.PlainLoginModule Will
> continue connection to Zookeeper server without SASL authentication, if
> Zookeeper server allows it.
>
> 2017-09-14 09:35:19,354 INFO  org.apache.zookeeper.
> ClientCnxn                               - Opening socket connection to
> server zookeeper1/00.000.00.000:2181
>
> 2017-09-14 09:35:19,354 ERROR 
> org.apache.flink.shaded.org.apache.curator.ConnectionState
> - Authentication failed
>
> 2017-09-14 09:35:19,355 INFO  org.apache.zookeeper.
> ClientCnxn                               - Socket connection established
> to zookeeper1/00.000.00.000:2181, initiating session
>
> 2017-09-14 09:35:19,358 INFO  org.apache.zookeeper.
> ClientCnxn                               - Session establishment complete
> on server zookeeper1/00.000.00.000:2181, sessionid = 0x45e446247000012,
> negotiated timeout = 60000
>
> 2017-09-14 09:35:19,358 INFO  org.apache.flink.shaded.org.
> apache.curator.framework.state.ConnectionStateManager  - State change:
> RECONNECTED
>
> 2017-09-14 09:35:19,359 INFO  org.apache.flink.runtime.leaderretrieval.
> ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper was
> reconnected. Leader retrieval can be restarted.
>
> 2017-09-14 09:35:21,494 INFO  org.apache.flink.runtime.
> taskmanager.TaskManager              - TaskManager
> akka://flink/user/taskmanager disconnects from JobManager
> akka.tcp://flink@jobmanager1:36491/user/jobmanager: JobManager is no
> longer reachable
>
> 2017-09-14 09:35:21,724 INFO  org.apache.flink.runtime.
> taskmanager.TaskManager              - Cancelling all computations and
> discarding all cached data.
>
> 2017-09-14 09:35:21,856 INFO  org.apache.flink.runtime.
> taskmanager.Task                     - Attempting to fail task externally
> Map (2/3) (13599aa15283f8c5af1df477cd290629).
>
> 2017-09-14 09:35:21,856 INFO  org.apache.flink.runtime.
> taskmanager.Task                     - Map (2/3) (
> 13599aa15283f8c5af1df477cd290629) switched from RUNNING to FAILED.
>
> java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects
> from JobManager akka.tcp://flink@jobmanager1:36491/user/jobmanager:
> JobManager is no longer reachable
>
> at org.apache.flink.runtime.taskmanager.TaskManager.
> handleJobManagerDisconnect(TaskManager.scala:1095)
>
>         at org.apache.flink.runtime.taskmanager.TaskManager$$
> anonfun$handleMessage$1.applyOrElse(TaskManager.scala:311)
>
>         at scala.runtime.AbstractPartialFunction.apply(
> AbstractPartialFunction.scala:36)
>
>         at org.apache.flink.runtime.LeaderSessionMessageFilter$$
> anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
>
>         at scala.runtime.AbstractPartialFunction.apply(
> AbstractPartialFunction.scala:36)
>
>         at org.apache.flink.runtime.LogMessages$$anon$1.apply(
> LogMessages.scala:33)
>
>         at org.apache.flink.runtime.LogMessages$$anon$1.apply(
> LogMessages.scala:28)
>
>         at scala.PartialFunction$class.applyOrElse(PartialFunction.
> scala:123)
>
>         at org.apache.flink.runtime.LogMessages$$anon$1.
> applyOrElse(LogMessages.scala:28)
>
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>
>         at org.apache.flink.runtime.taskmanager.TaskManager.
> aroundReceive(TaskManager.scala:120)
>
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>
>         at akka.actor.dungeon.DeathWatch$class.receivedTerminated(
> DeathWatch.scala:44)
>
>         at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>
>         at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>
>         at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>
>         at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>
>         at akka.dispatch.ForkJoinExecutorConfigurator$
> AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>
>         at scala.concurrent.forkjoin.ForkJoinTask.doExec(
> ForkJoinTask.java:260)
>
>         at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
> runTask(ForkJoinPool.java:1339)
>
>         at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
> ForkJoinPool.java:1979)
>
>         at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
> ForkJoinWorkerThread.java:107)
>
> 2017-09-14 09:35:21,861 INFO  org.apache.flink.runtime.
> taskmanager.Task                     - Triggering cancellation of task
> code Map (2/3) (13599aa15283f8c5af1df477cd290629).
>
> 2017-09-14 09:35:21,861 INFO  org.apache.flink.runtime.
> taskmanager.Task                     - Attempting to fail task externally
> Timestamps/Watermarks (2/3) (9cf3d208a85e4d88fffd93d0b8152d83).
>
> 2017-09-14 09:35:21,861 INFO  org.apache.flink.runtime.
> taskmanager.Task                     - Timestamps/Watermarks (2/3) (
> 9cf3d208a85e4d88fffd93d0b8152d83) switched from RUNNING to FAILED.
>
> java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects
> from JobManager akka.tcp://flink@jobmanager1:36491/user/jobmanager:
> JobManager is no longer reachable
>
>         at org.apache.flink.runtime.taskmanager.TaskManager.
> handleJobManagerDisconnect(TaskManager.scala:1095)
>

Reply via email to