Hi Harshith, The truncated log is not enough. Can you share the complete logs? If that's not possible, I'd like to see the beginning of the log files where the cluster configuration is logged.
The TaskManager tries to connect to the leader that is advertised in ZooKeeper. In your case the "cluster" hostname is advertised which hints a problem in your Flink configuration. Best, Gary On Thu, Mar 14, 2019 at 4:54 PM Kumar Bolar, Harshith <hk...@arity.com> wrote: > Hi Gary, > > > > I’ve attached the relevant portions of the JM and TM logs. > > > > *Job Manager Logs:* > > 2019-03-14 11:38:28,257 INFO > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager > - State change: CONNECTED > 2019-03-14 11:38:28,309 INFO > org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined > location of main cluster component log file: > /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.log > 2019-03-14 11:38:28,309 INFO > org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined > location of main cluster component stdout file: > /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.out > 2019-03-14 11:38:28,527 INFO > org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Rest > endpoint listening at cluster:8080 > 2019-03-14 11:38:28,527 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Starting ZooKeeperLeaderElectionService > ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}. > 2019-03-14 11:38:28,574 INFO > org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Web > frontend listening at http://cluster:8080. > 2019-03-14 11:38:28,613 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting > RPC endpoint for > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at > akka://flink/user/resourcemanager . > 2019-03-14 11:38:28,674 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting > RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher > at akka://flink/user/dispatcher . > 2019-03-14 11:38:28,691 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Starting ZooKeeperLeaderElectionService > ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}. > 2019-03-14 11:38:28,694 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. > 2019-03-14 11:38:28,698 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Starting ZooKeeperLeaderElectionService > ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}. > 2019-03-14 11:38:28,700 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. > 2019-03-14 11:38:28,818 WARN > akka.remote.ReliableDeliverySupervisor - Association > with remote system [akka.tcp://flink@cluster:22671] has failed, address > is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink@cluster:22671]] Caused by: [cluster] > 2019-03-14 11:39:09,010 INFO > org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - > http://cluster:8080 was granted leadership with > leaderSessionID=bbe408fc-ef93-4328-abeb-85323db7aef7 > 2019-03-14 11:39:09,010 INFO > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - > ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager was > granted leadership with fencing token ae4c0d30d0d65a0c41565360667e48fb > 2019-03-14 11:39:09,011 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - > Starting the SlotManager. > 2019-03-14 11:39:09,012 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher > akka.tcp://flink@cluster:31794/user/dispatcher was granted leadership > with fencing token c852ada2-5fd4-4ff8-80ab-c2cdd85a75d9 > 2019-03-14 11:39:09,017 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering > all persisted jobs. > > *Task Manager Logs:* > > 2019-03-14 11:42:35,790 INFO > org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager > uses directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f for spill > files. > 2019-03-14 11:42:35,820 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages > have a max timeout of 10000 ms > 2019-03-14 11:42:35,839 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting > RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at > akka://flink/user/taskmanager_0 . > 2019-03-14 11:42:35,853 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. > 2019-03-14 11:42:35,854 INFO > org.apache.flink.runtime.taskexecutor.JobLeaderService - Start job > leader service. > 2019-03-14 11:42:35,855 INFO > org.apache.flink.runtime.filecache.FileCache - User file > cache uses directory > /tmp/flink-dist-cache-a7f67948-ab57-4cd9-b2a6-0361b53ecd26 > 2019-03-14 11:42:35,871 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting > to ResourceManager akka.tcp://flink@cluster > :31794/user/resourcemanager(ae4c0d30d0d65a0c41565360667e48fb). > 2019-03-14 11:42:35,963 WARN > akka.remote.ReliableDeliverySupervisor - Association > with remote system [akka.tcp://flink@cluster:31794] has failed, address > is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink@cluster:31794]] Caused by: [cluster: Name or service > not known] > 2019-03-14 11:42:35,964 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not > resolve ResourceManager address > akka.tcp://flink@cluster:31794/user/resourcemanager, > retrying in 10000 ms: Could not connect to rpc endpoint under address > akka.tcp://flink@cluster:31794/user/resourcemanager.. > 2019-03-14 11:47:35,895 ERROR > org.apache.flink.runtime.taskexecutor.TaskExecutor - Fatal error > occurred in TaskExecutor akka.tcp:// > fl...@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0. > org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: > Could not register at the ResourceManager within the specified maximum > registration duration 300000 ms. This indicates a problem with this > instance. Terminating now. > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) > at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 2019-03-14 11:47:35,897 ERROR > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Fatal error > occurred while executing the TaskManager. Shutting it down... > org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: > Could not register at the ResourceManager within the specified maximum > registration duration 300000 ms. This indicates a problem with this > instance. Terminating now. > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) > at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 2019-03-14 11:47:35,904 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Stopping > TaskExecutor akka.tcp:// > fl...@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0. > 2019-03-14 11:47:35,904 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. > 2019-03-14 11:47:35,904 INFO > org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager - > Shutting down TaskExecutorLocalStateStoresManager. > 2019-03-14 11:47:35,908 INFO > org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager > removed spill file directory > /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f > 2019-03-14 11:47:35,908 INFO > org.apache.flink.runtime.io.network.NetworkEnvironment - Shutting > down the network environment and its components. > 2019-03-14 11:47:35,914 INFO > org.apache.flink.runtime.io.network.netty.NettyClient - Successful > shutdown (took 5 ms). > 2019-03-14 11:47:35,917 INFO > org.apache.flink.runtime.io.network.netty.NettyServer - Successful > shutdown (took 2 ms). > 2019-03-14 11:47:35,925 INFO > org.apache.flink.runtime.taskexecutor.JobLeaderService - Stop job > leader service. > 2019-03-14 11:47:35,931 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Stopped > TaskExecutor akka.tcp:// > fl...@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0. > 2019-03-14 11:47:35,931 INFO > org.apache.flink.runtime.blob.PermanentBlobCache - Shutting > down BLOB cache > 2019-03-14 11:47:35,933 INFO > org.apache.flink.runtime.blob.TransientBlobCache - Shutting > down BLOB cache > 2019-03-14 11:47:35,943 INFO > org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl > - backgroundOperationsLoop exiting > 2019-03-14 11:47:35,950 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - > Session: 0x26977a24c4e0018 closed > 2019-03-14 11:47:35,950 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - > EventThread shut down for session: 0x26977a24c4e0018 > 2019-03-14 11:47:35,950 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopping > Akka RPC service. > 2019-03-14 11:47:35,952 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting > down remote daemon. > 2019-03-14 11:47:35,952 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote > daemon shut down; proceeding with flushing remote transports. > 2019-03-14 11:47:35,959 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting > down remote daemon. > 2019-03-14 11:47:35,966 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote > daemon shut down; proceeding with flushing remote transports. > 2019-03-14 11:47:35,983 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting > shut down. > 2019-03-14 11:47:35,984 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting > shut down. > 2019-03-14 11:47:35,992 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopped > Akka RPC service. > > > > > > *From: *Gary Yao <g...@ververica.com> > *Date: *Thursday, 14 March 2019 at 9:06 PM > *To: *Harshith Kumar Bolar <hk...@arity.com> > *Cc: *user <user@flink.apache.org> > *Subject: *[External] Re: Flink 1.7.2: Task Manager not able to connect > to Job Manager > > > > Hi Harshith, > > > > Can you share JM and TM logs? > > > > Best, > > Gary > > > > On Thu, Mar 14, 2019 at 3:42 PM Kumar Bolar, Harshith <hk...@arity.com> > wrote: > > Hi all, > > > > I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2 > > > > When I bring up the cluster, the task managers refuse to connect to the > job managers with the following error. > > > > 2019-03-14 10:34:41,551 WARN > akka.remote.ReliableDeliverySupervisor > > - Association with remote system [akka.tcp://flink@cluster:22671] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@cluster:22671]] Caused by: [cluster: Name or > service not known] > > > > Now, this works correctly if I add the following line into > the /etc/hosts file. > > > > x.x.x.x job-manager-address.com > <https://urldefense.proofpoint.com/v2/url?u=http-3A__job-2Dmanager-2Daddress.com&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=04EWFpDL8G7AOCUH79K-QVwPa3NSJj7u4Qanpbrx0tg&s=KDu-Fxq2rWtLq1EmNp0DOuK0yWC6GyHwvhpbyQ8hRQg&e=> > cluster > > > > Why is Flink 1.7.2 connecting to JM using cluster in the address? Flink > 1.4.2 used to have the job manager's address instead of the word cluster. > > > > Thanks, > > Harshith > > > >