Hey Milind, can you additionally also set
metrics.internal.query-service.port to the range? Best, Robert On Fri, Feb 7, 2020 at 8:35 PM Milind Vaidya <kava...@gmail.com> wrote: > I tried setting that option but did not work. > > 2020-02-07 19:28:45,999 INFO > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - > Registering TaskManager with ResourceID 32fb9e7dcc9d41917bce38a2d5bb0093 > (akka.tcp://flink@ip-1:34718/user/taskmanager_0) at ResourceManager > 2020-02-07 19:28:46,425 INFO > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - > Registering TaskManager with ResourceID 8c402a9c039d3c33466631510c48b552 > (akka.tcp://flink@ip-2:37120/user/taskmanager_0) at ResourceManager > > I have setting as follows > > taskmanager.rpc.port : 50100-50200 > blob.server.port : 50201-50300 > > So how to control the port for TaskManager ? Inspite above setting the > task managers are being scheduled at ports 34718 and 37120. > > Thanks, > Milind > > > > On Thu, Feb 6, 2020 at 5:25 PM Yang Wang <danrtsey...@gmail.com> wrote: > >> Maybe you forget to limit the blob server port(blob.server.port) to the >> range. >> >> >> Best, >> Yang >> >> Milind Vaidya <kava...@gmail.com> 于2020年2月7日周五 上午7:03写道: >> >>> I figured out that it was problem with the ports. 39493/34094 were not >>> accessible. So to get this working I opened all the ports 0-65535 for the >>> security group. >>> >>> How do I control that if I want to open only certain range of ports ? >>> >>> Is "taskmanager.rpc.port" the right parameter to set ? I did try and >>> set this to certain port range, but did not work. >>> >>> Thanks >>> Milind >>> >>> On Wed, Feb 5, 2020 at 11:22 AM Milind Vaidya <kava...@gmail.com> wrote: >>> >>>> >>>> >>>> >>>> The cluster is set up on AWS with 1 Job manager and 2 task managers. >>>> They all belong to same security group with 6123, 8081, 50100 - 50200 >>>> ports having access granted >>>> >>>> Job manager config is as follows : >>>> >>>> FLINK_PLUGINS_DIR : >>>> /usr/local/flink-1.9.1/plugins >>>> io.tmp.dirs : /tmp/flink >>>> jobmanager.execution.failover-strategy : region >>>> jobmanager.heap.size : 1024m >>>> jobmanager.rpc.address : 10.0.16.10 >>>> jobmanager.rpc.port : 6123 >>>> jobstore.cache-size : 52428800 >>>> jobstore.expiration-time : 3600 >>>> parallelism.default : 4 >>>> slot.idle.timeout : 50000 >>>> slot.request.timeout : 300000 >>>> task.cancellation.interval : 30000 >>>> task.cancellation.timeout : 180000 >>>> task.cancellation.timers.timeout : 7500 >>>> taskmanager.exit-on-fatal-akka-error : false >>>> taskmanager.heap.size : 1024m >>>> taskmanager.network.bind-policy : "ip" >>>> taskmanager.numberOfTaskSlots : 2 >>>> taskmanager.registration.initial-backoff: 500ms >>>> taskmanager.registration.timeout : 5min >>>> taskmanager.rpc.port : 50100-50200 >>>> web.tmpdir : >>>> /tmp/flink-web-74cce811-17c0-411e-9d11-6d91edd2e9b0 >>>> >>>> >>>> >>>> I have summarised the more details in a stack overflow question where >>>> it is easier to put the various details. >>>> >>>> https://stackoverflow.com/questions/60082479/flink-1-9-standalone-cluster-failed-to-transfer-file-from-taskexecutor-id >>>> >>>> On Wed, Feb 5, 2020 at 2:25 AM Robert Metzger <rmetz...@apache.org> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I don't think this is a bug. It looks like the machines can not talk >>>>> to each other. Can you validate that all the machines can talk to each >>>>> other on the ports used by Flink (6123, 8081, ...) >>>>> If that doesn't help: >>>>> - How is the network set up? >>>>> - Are you running physical machines / VMs / containers? >>>>> - Is there a firewall involved? >>>>> >>>>> Best, >>>>> Robert >>>>> >>>>> >>>>> On Fri, Jan 31, 2020 at 7:25 PM Milind Vaidya <kava...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi >>>>>> >>>>>> I am trying to build a cluster for flink with 1 master and 2 workers. >>>>>> The program is working fine locally. The messages are read from Kafka >>>>>> and just printed on STDOUT. >>>>>> >>>>>> The cluster is successfully created and UI is also shows all config. >>>>>> But the job fails to execute on the cluster. >>>>>> >>>>>> Here are few exceptions I see in the log files >>>>>> >>>>>> File : flink-root-standalonesession >>>>>> >>>>>> 2020-01-29 19:55:00,348 INFO >>>>>> akka.remote.transport.ProtocolStateActor - No >>>>>> response from remote for outbound association. Associate timed out after >>>>>> [20000 ms]. >>>>>> 2020-01-29 19:55:00,350 INFO >>>>>> akka.remote.transport.ProtocolStateActor - No >>>>>> response from remote for outbound association. Associate timed out after >>>>>> [20000 ms]. >>>>>> 2020-01-29 19:55:00,350 WARN akka.remote.ReliableDeliverySupervisor >>>>>> - Association with remote system >>>>>> [akka.tcp://flink-metrics@ip:39493] has failed, address is now gated >>>>>> for [50] ms. Reason: [Association failed with >>>>>> [akka.tcp://flink-metrics@ip:39493]] >>>>>> Caused by: [No response >>>>>> from remote for outbound association. Associate timed out after >>>>>> [20000 ms].] >>>>>> 2020-01-29 19:55:00,350 WARN akka.remote.ReliableDeliverySupervisor >>>>>> - Association with remote system >>>>>> [akka.tcp://flink-metrics@ip:34094] has failed, address is now gated >>>>>> for [50] ms. Reason: [Association failed with >>>>>> [akka.tcp://flink-metrics@ip:34094]] >>>>>> Caused by: [No response f >>>>>> rom remote for outbound association. Associate timed out after [20000 >>>>>> ms].] >>>>>> 2020-01-29 19:55:00,359 WARN >>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>> connection to [null] failed with >>>>>> org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException: >>>>>> connection timed out: /ip:39493 >>>>>> 2020-01-29 19:55:00,359 WARN >>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>> connection to [null] failed with >>>>>> org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException: >>>>>> connection timed out: /ip:34094 >>>>>> 2020-01-29 19:58:21,880 ERROR >>>>>> org.apache.flink.runtime.rest.handler.taskmanager.TaskManagerLogFileHandler >>>>>> - Failed to transfer file from TaskExecutor >>>>>> a7abe6e294fa3ae4129fd695f7309a36. >>>>>> java.util.concurrent.CompletionException: >>>>>> akka.pattern.AskTimeoutException: Ask timed out on >>>>>> [Actor[akka://flink/user/resourcemanager#5385019]] after [10000 ms]. >>>>>> Message of type >>>>>> [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. >>>>>> A typical reason for `AskTimeoutException` is that the recipient actor >>>>>> didn't send a reply. >>>>>> >>>>>> >>>>>> File : flink-root-client-ip >>>>>> >>>>>> >>>>>> 2020-01-29 19:48:10,566 WARN org.apache.flink.client.cli.CliFrontend >>>>>> - Could not load CLI class >>>>>> org.apache.flink.yarn.cli.FlinkYarnSessionCli. >>>>>> java.lang.NoClassDefFoundError: >>>>>> org/apache/hadoop/yarn/exceptions/YarnException >>>>>> at java.lang.Class.forName0(Native Method) >>>>>> at java.lang.Class.forName(Class.java:264) >>>>>> at >>>>>> org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1187) >>>>>> at >>>>>> org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1147) >>>>>> at >>>>>> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1072) >>>>>> Caused by: java.lang.ClassNotFoundException: >>>>>> org.apache.hadoop.yarn.exceptions.YarnException >>>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:382) >>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) >>>>>> at >>>>>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) >>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357) >>>>>> ... 5 more >>>>>> 2020-01-29 19:48:10,663 INFO org.apache.flink.core.fs.FileSystem >>>>>> - Hadoop is not in the classpath/dependencies. The >>>>>> extended set of supported File Systems via Hadoop is not available. >>>>>> 2020-01-29 19:48:10,856 INFO >>>>>> org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot >>>>>> create Hadoop Security Module because Hadoop cannot be found in the >>>>>> Classpath. >>>>>> 2020-01-29 19:48:10,874 INFO >>>>>> org.apache.flink.runtime.security.SecurityUtils - Cannot >>>>>> install HadoopSecurityContext because Hadoop cannot be found in the >>>>>> Classpath. >>>>>> 2020-01-29 19:48:10,875 INFO org.apache.flink.client.cli.CliFrontend >>>>>> - Running 'run' command. >>>>>> 2020-01-29 19:48:10,881 INFO org.apache.flink.client.cli.CliFrontend >>>>>> - Building program from JAR file >>>>>> 2020-01-29 19:48:10,965 INFO >>>>>> org.apache.flink.configuration.Configuration - Config >>>>>> uses fallback configuration key 'jobmanager.rpc.address' instead of key >>>>>> 'rest.address' >>>>>> 2020-01-29 19:48:11,160 INFO >>>>>> org.apache.flink.runtime.rest.RestClient - Rest >>>>>> client endpoint started. >>>>>> 2020-01-29 19:48:11,163 INFO org.apache.flink.client.cli.CliFrontend >>>>>> - Starting execution of program >>>>>> 2020-01-29 19:48:11,163 INFO >>>>>> org.apache.flink.client.program.rest.RestClusterClient - Starting >>>>>> program in interactive mode (detached: false) >>>>>> 2020-01-29 19:48:11,306 INFO >>>>>> org.apache.flink.configuration.GlobalConfiguration - Loading >>>>>> configuration property: jobmanager.rpc.address, ip >>>>>> 2020-01-29 19:48:11,306 INFO >>>>>> org.apache.flink.configuration.GlobalConfiguration - Loading >>>>>> configuration property: jobmanager.rpc.port, 6123 >>>>>> 2020-01-29 19:48:11,307 INFO >>>>>> org.apache.flink.configuration.GlobalConfiguration - Loading >>>>>> configuration property: jobmanager.heap.size, 1024m >>>>>> 2020-01-29 19:48:11,307 INFO >>>>>> org.apache.flink.configuration.GlobalConfiguration - Loading >>>>>> configuration property: taskmanager.heap.size, 1024m >>>>>> 2020-01-29 19:48:11,307 INFO >>>>>> org.apache.flink.configuration.GlobalConfiguration - Loading >>>>>> configuration property: taskmanager.numberOfTaskSlots, 2 >>>>>> 2020-01-29 19:48:11,307 INFO >>>>>> org.apache.flink.configuration.GlobalConfiguration - Loading >>>>>> configuration property: parallelism.default, 4 >>>>>> 2020-01-29 19:48:11,307 INFO >>>>>> org.apache.flink.configuration.GlobalConfiguration - Loading >>>>>> configuration property: jobmanager.execution.failover-strategy, region >>>>>> 2020-01-29 19:48:11,307 INFO >>>>>> org.apache.flink.configuration.GlobalConfiguration - Loading >>>>>> configuration property: io.tmp.dirs, /tmp/flink >>>>>> 2020-01-29 19:48:11,311 INFO >>>>>> org.apache.flink.client.program.rest.RestClusterClient - >>>>>> Submitting >>>>>> job 4f4cce35db3f37cae310f272ec88a303 (detached: false). >>>>>> 2020-01-29 20:05:13,170 INFO >>>>>> org.apache.flink.runtime.rest.RestClient - Shutting >>>>>> down rest endpoint. >>>>>> 2020-01-29 20:05:13,172 INFO >>>>>> org.apache.flink.runtime.rest.RestClient - Rest >>>>>> endpoint shutdown complete. >>>>>> 2020-01-29 20:05:13,172 ERROR org.apache.flink.client.cli.CliFrontend >>>>>> - Error while running the command. >>>>>> org.apache.flink.client.program.ProgramInvocationException: Job >>>>>> failed. (JobID: 4f4cce35db3f37cae310f272ec88a303) >>>>>> at >>>>>> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:262) >>>>>> at >>>>>> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338) >>>>>> at >>>>>> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:60) >>>>>> at >>>>>> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1507) >>>>>> at >>>>>> com.saavn.flink.SongCountStreamingJob.main(SongCountStreamingJob.java:79) >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>> at >>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >>>>>> at >>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>>> at java.lang.reflect.Method.invoke(Method.java:498) >>>>>> at >>>>>> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576) >>>>>> at >>>>>> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438) >>>>>> at >>>>>> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274) >>>>>> at >>>>>> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746) >>>>>> at >>>>>> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273) >>>>>> at >>>>>> org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205) >>>>>> at >>>>>> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010) >>>>>> at >>>>>> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083) >>>>>> at >>>>>> org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) >>>>>> at >>>>>> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083) >>>>>> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job >>>>>> execution failed. >>>>>> at >>>>>> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) >>>>>> at >>>>>> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259) >>>>>> ... 18 more >>>>>> Caused by: java.util.concurrent.TimeoutException: Heartbeat of >>>>>> TaskManager with id a7abe6e294fa3ae4129fd695f7309a36 timed out. >>>>>> at >>>>>> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149) >>>>>> at >>>>>> org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:109) >>>>>> at >>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>>> at >>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) >>>>>> at >>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) >>>>>> at >>>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) >>>>>> at >>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) >>>>>> at >>>>>> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) >>>>>> at >>>>>> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) >>>>>> at >>>>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >>>>>> at >>>>>> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >>>>>> >>>>>> >>>>>> Flink version : flink-1.9.1 >>>>>> OS : CentOS Linux release 7.6.1810 (Core) >>>>>> >>>>>> Is this related to this issue : >>>>>> https://issues.apache.org/jira/browse/FLINK-11143 >>>>>> >>>>>> Can somebody throw some light on this ? >>>>>> >>>>>> >>>>>> >>>>>>