This looks like the reason: java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration
On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[email protected]> wrote: > Hello, > > The log file of the Taskmanager now shows the following > > 18:27:10,082 WARN org.apache.hadoop.util.NativeCodeLoader > - Unable to load native-hadoop library for your platform... using > builtin-java classes where applicable > 18:27:10,244 INFO org.apache.flink.runtime.taskmanager.TaskManager > - > -------------------------------------------------------------------------------- > 18:27:10,244 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Starting TaskManager (Version: 0.10.1, Rev:2e9b231, > Date:22.11.2015 @ 12:41:12 CET) > 18:27:10,244 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Current user: flink > 18:27:10,245 INFO org.apache.flink.runtime.taskmanager.TaskManager > - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01 > 18:27:10,245 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Maximum heap size: 491 MiBytes > 18:27:10,245 INFO org.apache.flink.runtime.taskmanager.TaskManager > - JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64 > 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Hadoop version: 2.7.0 > 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager > - JVM Options: > 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager > - -Xms512M > 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager > - -Xmx512M > 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager > - -XX:MaxDirectMemorySize=8388607T > 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager > - -XX:MaxPermSize=256m > 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager > - > -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log > 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager > - > -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties > 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager > - > -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml > 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Program Arguments: > 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager > - --configDir > 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager > - /home/flink/flink-0.10.1/conf > 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager > - --streamingMode > 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager > - batch > 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Classpath: > /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar:: > 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager > - > -------------------------------------------------------------------------------- > 18:27:10,252 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Maximum number of open file descriptors is 4096 > 18:27:10,277 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Loading configuration from /home/flink/flink-0.10.1/conf > 18:27:10,356 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Security is not enabled. Starting non-authenticated TaskManager. > 18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager > - Failed to run TaskManager. > java.net.UnknownHostException: Cannot resolve the JobManager hostname > 'hostname-of-master' specified in the configuration > at > org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79) > at > org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48) > at > org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69) > at > org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351) > at > org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328) > at > org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240) > at > org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala) > > Kind Regards, > Ravinder Kaur > > On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[email protected]> wrote: > >> What do the TaskManger logs say? >> >> On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[email protected]> >> wrote: >> >>> Hello, >>> >>> Thanks for the quick reply. I tried to set jobmanager.rpc.address in >>> flink-conf.yaml to the hostname of master node on both the nodes. >>> >>> Now it does not start the Taskmanager at the worker node at all. When I >>> start the cluster using ./bin/start-cluster.sh on master it shows the >>> normal output of starting the Jobmanager and Taskmanager but when I run jps >>> on the nodes the slave does not have the Taskmanager running. >>> >>> Running the WordCount example again fails showing the same error. >>> Stopping the cluster says no taskmanager to stop. >>> >>> Kind Regards, >>> Ravinder Kaur >>> >>> On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[email protected]> wrote: >>> >>>> Looks like the network configuration is not correct. >>>> >>>> I would try setting the full host name (like "master.abc.xyz.com") as >>>> jobmanager.rpc.address. >>>> >>>> Greetings, >>>> Stephan >>>> >>>> >>>> On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[email protected]> >>>> wrote: >>>> >>>>> >>>>> Hello Community, >>>>> >>>>> I'm a student and new to Apache Flink. I'm trying to learn and have >>>>> setup a 2- node standalone Flink(0.10.1) cluster (one master and one >>>>> worker). I'm facing the following issue. >>>>> >>>>> Cluster: consists of 2 vms (one master and one worker) >>>>> >>>>> The configurations are done as per >>>>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/cluster_setup.html >>>>> >>>>> When I start the cluster both the JobManager and the TaskManager are >>>>> started on the master and worker respectively. >>>>> >>>>> Command to start the cluster : bin/start-cluster.sh >>>>> >>>>> JPS shows all the processes running. >>>>> >>>>> Then I run the following command to run a WordCount example job: >>>>> ./bin/flink >>>>> run ./examples/WordCount.jar >>>>> >>>>> the result is attached with the mail. >>>>> >>>>> The error is >>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: >>>>> Not enough free slots available to run to run the job >>>>> ....................... Resources available to scheduler: Number of >>>>> instances=0, total number of slots= 0, available slots=0 >>>>> >>>>> Therefore I suppose that the JobManager does not find the TaskManager >>>>> and checked the logs of the TaskManager which indeed shows that the >>>>> TaskManager is unable to register at the JobManager for quite a long >>>>> time. There >>>>> are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect >>>>> from localhost: Connect timed out and >>>>> org.apache.flink.runtime.net.ConnectionUtils: >>>>> Failed to connect from address localhost: Network is Unreachable messages >>>>> in the log of the TaskManager. Later when it starts up after a number of >>>>> attempts and tries to register at the JobManager, which also fails after a >>>>> lot of attempts showing the following message >>>>> org.apache.flink.runtime.taskmanager.Taskmanager: >>>>> Trying to register at JobManager >>>>> akka.tcp://flink@master:6123/user'/jobmanager >>>>> (attempt:92, timeout:30seconds) and >>>>> org.apache.flink.runtime.taskmanager.Taskmanager: >>>>> Tried to associate with unreachable remote host >>>>> [akka.tcp://flink@master:6123/user/jobmanager]. >>>>> Address is now gated for 5000ms, all messages to this address will be >>>>> delivered to dead letters. Reason: Connection timed out: /master:6123 >>>>> >>>>> I browsed the internet for these and found >>>>> >>>>> http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb >>>>> <http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb> >>>>> and https://issues.apache.org/jira/browse/FLINK-1119 these links >>>>> helpful. Stephan Ewen the guy who provided the solution in both the links >>>>> gives a good explanation that the TaskManagers take quite some time to >>>>> register at the JobManager and therefore I waited for as long as 20 mins >>>>> after starting the cluster to run the job. But even after waiting so long >>>>> I >>>>> get the same error. >>>>> >>>>> Another suggestion was to run the cluster in streaming mode. So I >>>>> tried it with the command : bin/start-cluster-streaming.sh and ran >>>>> the job but I get the same error. I have rechecked all the configurations >>>>> but I'm unable to find out the fault. >>>>> >>>>> I re-checked all the configurations but could not find anything wrong. >>>>> Also checked the port 6123 on master which is in LISTEN state and tcp >>>>> request from worker to master shows SYN_SENT state using netstat -na and >>>>> lsof -i commands. >>>>> >>>>> I opened the webpage on master http://localhost:8081 but it shows >>>>> nothing and localhost:8080 says connection refused. >>>>> >>>>> Kindly help me out as it is very important for me. Let me know if you >>>>> have any questions. >>>>> >>>>> Kind Regards, >>>>> Ravinder Kaur >>>>> >>>>> >>>> >>> >> >
