liuzhuo created FLINK-9072: ------------------------------ Summary: Host name with "_" causes cluster exception Key: FLINK-9072 URL: https://issues.apache.org/jira/browse/FLINK-9072 Project: Flink Issue Type: Bug Components: Core Affects Versions: 1.3.2 Environment: linux:
Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 22:26:13 UTC 2017 Java: 1.8.0_121-b13 Flink : flink-1.3.2-bin-hadoop26-scala_2.11 Reporter: liuzhuo In my production environment , When I start the cluster, I got errors . {code:java} 2018-03-21 09:50:42,437 ERROR org.apache.flink.runtime.webmonitor.files.StaticFileServerHandler - Caught exception akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka://flink/deadLetters), Path(/)] at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65) at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:73) at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74) at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:120) at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.scala$concurrent$impl$Promise$DefaultPromise$$dispatchOrAddCallback(Promise.scala:280) at scala.concurrent.impl.Promise$DefaultPromise.onComplete(Promise.scala:270) at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:63) at org.apache.flink.runtime.akka.AkkaUtils$.getActorRefFuture(AkkaUtils.scala:498) at org.apache.flink.runtime.akka.AkkaUtils.getActorRefFuture(AkkaUtils.scala) at org.apache.flink.runtime.webmonitor.JobManagerRetriever.notifyLeaderAddress(JobManagerRetriever.java:141) at org.apache.flink.runtime.leaderretrieval.StandaloneLeaderRetrievalService.start(StandaloneLeaderRetrievalService.java:85) at org.apache.flink.runtime.webmonitor.WebRuntimeMonitor.start(WebRuntimeMonitor.java:434) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$startJobManagerActors$6.apply(JobManager.scala:2352) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$startJobManagerActors$6.apply(JobManager.scala:2344) at scala.Option.foreach(Option.scala:257) at org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2343) at org.apache.flink.runtime.jobmanager.JobManager$.liftedTree3$1(JobManager.scala:2053) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2052) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) at scala.util.Try$.apply(Try.scala:192) at org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2172) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2117) at org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1992) at org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1990) at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) at org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990) at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) 2018-03-21 09:51:23,993 ERROR org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Resource manager could not register at JobManager akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka://flink/deadLetters), Path(/)]] after [100000 ms] at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474) at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425) at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429) at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381) at java.lang.Thread.run(Thread.java:748) {code} The error show "akka://flink/deadLetters" I search it on google , The most answer is the network not work, or 6123 port is not avaliable, or iptables problems。 I exclude All above. Finally,I found the different between production environment and the develop environment . My develop environment, Hosts like this: 192.168.xx.xx master1 192.168.xx.xx slave1 192.168.xx.xx slave2 The production environment, hosts like : 192.168.xx.xx Flink_master 192.168.xx.xx slaves_01 192.168.xx.xx slaves_02 when I change the production environment hosts to my develop environment, remove the "_".the cluster is back to normal So I guess the host with"_" can not work for Flink cluster -- This message was sent by Atlassian JIRA (v7.6.3#76005)