Hi I am trying to build a cluster for flink with 1 master and 2 workers. The program is working fine locally. The messages are read from Kafka and just printed on STDOUT.
The cluster is successfully created and UI is also shows all config. But the job fails to execute on the cluster. Here are few exceptions I see in the log files File : flink-root-standalonesession 2020-01-29 19:55:00,348 INFO akka.remote.transport.ProtocolStateActor - No response from remote for outbound association. Associate timed out after [20000 ms]. 2020-01-29 19:55:00,350 INFO akka.remote.transport.ProtocolStateActor - No response from remote for outbound association. Associate timed out after [20000 ms]. 2020-01-29 19:55:00,350 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink-metrics@ip:39493] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@ip:39493]] Caused by: [No response from remote for outbound association. Associate timed out after [20000 ms].] 2020-01-29 19:55:00,350 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink-metrics@ip:34094] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@ip:34094]] Caused by: [No response f rom remote for outbound association. Associate timed out after [20000 ms].] 2020-01-29 19:55:00,359 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException: connection timed out: /ip:39493 2020-01-29 19:55:00,359 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException: connection timed out: /ip:34094 2020-01-29 19:58:21,880 ERROR org.apache.flink.runtime.rest.handler.taskmanager.TaskManagerLogFileHandler - Failed to transfer file from TaskExecutor a7abe6e294fa3ae4129fd695f7309a36. java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/resourcemanager#5385019]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply. File : flink-root-client-ip 2020-01-29 19:48:10,566 WARN org.apache.flink.client.cli.CliFrontend - Could not load CLI class org.apache.flink.yarn.cli.FlinkYarnSessionCli. java.lang.NoClassDefFoundError: org/apache/hadoop/yarn/exceptions/YarnException at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1187) at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1147) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1072) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.yarn.exceptions.YarnException at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 5 more 2020-01-29 19:48:10,663 INFO org.apache.flink.core.fs.FileSystem - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available. 2020-01-29 19:48:10,856 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath. 2020-01-29 19:48:10,874 INFO org.apache.flink.runtime.security.SecurityUtils - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath. 2020-01-29 19:48:10,875 INFO org.apache.flink.client.cli.CliFrontend - Running 'run' command. 2020-01-29 19:48:10,881 INFO org.apache.flink.client.cli.CliFrontend - Building program from JAR file 2020-01-29 19:48:10,965 INFO org.apache.flink.configuration.Configuration - Config uses fallback configuration key 'jobmanager.rpc.address' instead of key 'rest.address' 2020-01-29 19:48:11,160 INFO org.apache.flink.runtime.rest.RestClient - Rest client endpoint started. 2020-01-29 19:48:11,163 INFO org.apache.flink.client.cli.CliFrontend - Starting execution of program 2020-01-29 19:48:11,163 INFO org.apache.flink.client.program.rest.RestClusterClient - Starting program in interactive mode (detached: false) 2020-01-29 19:48:11,306 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, ip 2020-01-29 19:48:11,306 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123 2020-01-29 19:48:11,307 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 1024m 2020-01-29 19:48:11,307 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 1024m 2020-01-29 19:48:11,307 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 2 2020-01-29 19:48:11,307 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 4 2020-01-29 19:48:11,307 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.execution.failover-strategy, region 2020-01-29 19:48:11,307 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: io.tmp.dirs, /tmp/flink 2020-01-29 19:48:11,311 INFO org.apache.flink.client.program.rest.RestClusterClient - Submitting job 4f4cce35db3f37cae310f272ec88a303 (detached: false). 2020-01-29 20:05:13,170 INFO org.apache.flink.runtime.rest.RestClient - Shutting down rest endpoint. 2020-01-29 20:05:13,172 INFO org.apache.flink.runtime.rest.RestClient - Rest endpoint shutdown complete. 2020-01-29 20:05:13,172 ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command. org.apache.flink.client.program.ProgramInvocationException: Job failed. (JobID: 4f4cce35db3f37cae310f272ec88a303) at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:262) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338) at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:60) at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1507) at com.saavn.flink.SongCountStreamingJob.main(SongCountStreamingJob.java:79) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274) at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746) at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010) at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083) Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed. at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259) ... 18 more Caused by: java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id a7abe6e294fa3ae4129fd695f7309a36 timed out. at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149) at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:109) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) Flink version : flink-1.9.1 OS : CentOS Linux release 7.6.1810 (Core) Is this related to this issue : https://issues.apache.org/jira/browse/FLINK-11143 Can somebody throw some light on this ?