Re: TaskManager unable to register with JobManager

Stephan Ewen Wed, 03 Feb 2016 13:10:40 -0800

Can machines connect to port 6123? The firewall may block that port, put
permit SSH.


On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur <[email protected]> wrote:

> Hello,
>
> Here is the log file of Jobmanager. I did not see some thing suspicious
> and as it suggests the ports are also listening.
>
> 20:58:46,906 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Starting JobManager on IP-of-master:6123 with execution mode
> CLUSTER and streaming mode BATCH_ONLY
> 20:58:46,978 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Security is not enabled. Starting non-authenticated JobManager.
> 20:58:46,979 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Starting JobManager
> 20:58:46,980 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Starting JobManager actor system at 10.155.208.138:6123
> 20:58:48,196 INFO  akka.event.slf4j.Slf4jLogger
>        - Slf4jLogger started
> 20:58:48,295 INFO  Remoting
>        - Starting remoting
> 20:58:48,541 INFO  Remoting
>        - Remoting started; listening on addresses
> :[akka.tcp://flink@IP-of-master:6123]
> 20:58:48,549 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Starting JobManger web frontend
> 20:58:48,690 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>       - Using directory /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986
> for the web interface files
> 20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>       - Serving job manager log from
> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log
> 20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>       - Serving job manager stdout from
> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out
> 20:58:49,044 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>       - Web frontend listening at 0:0:0:0:0:0:0:0:8081
> 20:58:49,045 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Starting JobManager actor
> 20:58:49,052 INFO  org.apache.flink.runtime.blob.BlobServer
>        - Created BLOB server storage directory
> /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258
> 20:58:49,054 INFO  org.apache.flink.runtime.blob.BlobServer
>        - Started BLOB server at 0.0.0.0:43683 - max concurrent requests:
> 50 - max backlog: 1000
> 20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist
>       - Started memory archivist akka://flink/user/archive
> 20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Starting JobManager at akka.tcp://flink@IP-of-master
> :6123/user/jobmanager.
> 20:58:49,081 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager
> was granted leadership with leader session ID None.
> 20:58:49,082 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>       - Starting with JobManager 
> akka.tcp://flink@IP-of-master:6123/user/jobmanager
> on port 8081
> 20:58:49,083 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever
>       - New leader reachable under akka.tcp://flink@IP-of-master
> :6123/user/jobmanager:null.
> 20:59:22,794 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job at
> Wed Feb 03 20:59:22 CET 2016).
> 20:59:22,853 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job at
> Wed Feb 03 20:59:22 CET 2016).
> 20:59:22,857 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at
> Wed Feb 03 20:59:22 CET 2016) changed to RUNNING.
> 20:59:22,859 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
>        - CHAIN DataSource (at
> getDefaultTextLineDataSet(WordCountData.java:70)
> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap
> at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)
> (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from CREATED to SCHEDULED
> 20:59:22,881 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
>        - CHAIN DataSource (at
> getDefaultTextLineDataSet(WordCountData.java:70)
> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap
> at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)
> (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from SCHEDULED to CANCELED
> 20:59:22,881 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at
> Wed Feb 03 20:59:22 CET 2016) changed to FAILING.
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Not enough free slots available to run the job. You can decrease the
> operator parallelism or increase the number of slots per TaskManager in the
> configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at
> getDefaultTextLineDataSet(WordCountData.java:70)
> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap
> at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)
> (1/1)) @ (unassigned) - [SCHEDULED] > with groupID <
> 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup
> [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce,
> 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler:
> Number of instances=0, total number of slots=0, available slots=0
>         at
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
>         at
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
>         at
> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
>         at
> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
>         at
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
>         at
> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
>         at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
>         at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>         at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>         at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>         at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>         at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
>         at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 20:59:22,886 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
>        - CHAIN Reduce (SUM(1), at main(WordCount.java:72) -> FlatMap
> (collect()) (1/1) (824b6e3771304cd0f92aea4ab763a11d) switched from CREATED
> to CANCELED
> 20:59:22,887 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
>        - DataSink (collect() sink) (1/1) (1bb64a2edc6f68ad716acd9f8d2d7d67)
> switched from CREATED to CANCELED
> 20:59:22,890 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at
> Wed Feb 03 20:59:22 CET 2016) changed to FAILED.
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Not enough free slots available to run the job. You can decrease the
> operator parallelism or increase the number of slots per TaskManager in the
> configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at
> getDefaultTextLineDataSet(WordCountData.java:70)
> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap
> at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)
> (1/1)) @ (unassigned) - [SCHEDULED] > with groupID <
> 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup
> [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce,
> 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler:
> Number of instances=0, total number of slots=0, available slots=0
>         at
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
>         at
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
>         at
> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
>         at
> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
>         at
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
>         at
> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
>         at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
>         at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>         at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>         at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>         at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>
>
> On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <[email protected]>
> wrote:
>
>> Hi,
>>
>> the TaskManager is starting up, but its not able to register at the job
>> manager. Did you check the JobManager log? Do you see anything suspicious
>> there? Are the ports matching?
>>
>>
>> On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <[email protected]>
>> wrote:
>>
>>> Hello,
>>>
>>> Thank you for pointing it out. I had a little typo while I edited the
>>> hostname in flink-conf.yaml. I've reset it and the TaskManager started up.
>>> But I still can't run the WordCount example and it throws the same
>>> NoResourceAvaliableException.
>>>
>>> Caused by:
>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce
>>>
>>>      ption: Not enough free slots available to run the job. You can
>>> decrease the oper
>>>                              ator parallelism or increase the number of
>>> slots per TaskManager in the configur
>>>                                                  ation. Task to schedule: <
>>> Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa
>>>
>>>  taSet(WordCountData.java:70)
>>> (org.apache.flink.api.java.io.CollectionInputFormat
>>>                                                                )) ->
>>> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo
>>>
>>>            rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with
>>> groupID < 31e497f2f6
>>>                                  8c9cee5864c8fddaff3d59 > in sharing group
>>> < SlotSharingGroup [f9ed1aab933e061a8c
>>>                                                    e1ecaa3534f18c,
>>> 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d
>>>
>>>  59] >. Resources available to scheduler: Number of instances=0, total
>>> number of
>>>                       slots=0, available slots=0
>>>         at
>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(
>>>
>>>      Scheduler.java:256)
>>>         at
>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed
>>>
>>>      iately(Scheduler.java:131)
>>>         at
>>> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio
>>>
>>>      n(Execution.java:298)
>>>         at
>>> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx
>>>
>>>      ecution(ExecutionVertex.java:458)
>>>         at
>>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl
>>>
>>>      l(ExecutionJobVertex.java:322)
>>>         at
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe
>>>
>>>      cution(ExecutionGraph.java:679)
>>>         at
>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl
>>>
>>>
>>>  
>>> ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982
>>>
>>>            )
>>>         at
>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl
>>>
>>>
>>>  ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>>         at
>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl
>>>
>>>
>>>  ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>>         ... 8 more
>>>
>>> The log of TaskManager again has the same errors as before.
>>>
>>> 20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils
>>>          - Failed to connect from address '/slave-IP': connect timed out
>>> 20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils
>>>          - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is
>>> unreachable
>>> 20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils
>>>          - Failed to connect from address '/127.0.0.1': Invalid argument
>>> 20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils
>>>          - Could not connect to /master-IP:6123. Selecting a local address
>>> using heuristics.
>>> 20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - TaskManager will use hostname/address 'hostname-of-slave'
>>> (slave-IP) for communication.
>>> 20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Starting TaskManager in streaming mode BATCH_ONLY
>>> 20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Starting TaskManager actor system at slave_IP:0
>>> 20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger
>>>          - Slf4jLogger started
>>> 20:58:59,842 INFO  Remoting
>>>          - Starting remoting
>>> 20:59:00,094 INFO  Remoting
>>>          - Remoting started; listening on addresses
>>> :[akka.tcp://flink@slave-IP:33813]
>>> 20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Starting TaskManager actor
>>> 20:59:00,125 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig
>>>         - NettyConfig [server address: hostname-of-master/master-IP, server
>>> port: 49030, memory segment size (bytes): 32768, transport type: NIO,
>>> number of server threads: 0 (use Netty's default), number of client
>>> threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's
>>> default), client connect timeout (sec): 120, send/receive buffer size
>>> (bytes): 0 (use Netty's default)]
>>> 20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Messages between TaskManager and JobManager have a max timeout
>>> of 100000 milliseconds
>>> 20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00%
>>> usable)
>>> 20:59:00,210 INFO
>>>  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated
>>> 64 MB for network buffer pool (number of memory segments: 2048, bytes per
>>> segment: 32768).
>>> 20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Using 0.7 of the currently free heap space for Flink managed
>>> heap memory (293 MB).
>>> 20:59:00,565 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager
>>>          - I/O manager uses directory
>>> /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files.
>>> 20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache
>>>          - User file cache uses directory
>>> /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
>>> 20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Starting TaskManager actor at
>>> akka://flink/user/taskmanager#-157676733.
>>> 20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - TaskManager data connection information: hostname-of-master
>>> (dataPort=49030)
>>> 20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - TaskManager has 1 task slot(s).
>>> 20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304
>>> MB (used/committed/max)]
>>> 20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Trying to register at JobManager 
>>> akka.tcp://flink@master-IP:6123/user/jobmanager
>>> (attempt 1, timeout: 500 milliseconds)
>>> 20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Trying to register at JobManager 
>>> akka.tcp://flink@master-IP:6123/user/jobmanager
>>> (attempt 2, timeout: 1000 milliseconds)
>>> 20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Trying to register at JobManager 
>>> akka.tcp://flink@master-IP:6123/user/jobmanager
>>> (attempt 3, timeout: 2000 milliseconds)
>>> 20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Trying to register at JobManager 
>>> akka.tcp://flink@master-IP:6123/user/jobmanager
>>> (attempt 4, timeout: 4000 milliseconds)
>>> 20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Trying to register at JobManager 
>>> akka.tcp://flink@master-IP:6123/user/jobmanager
>>> (attempt 5, timeout: 8000 milliseconds)
>>>
>>>
>>> Kind Regards,
>>> Ravinder Kaur
>>>
>>> On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[email protected]> wrote:
>>>
>>>> This looks like the reason:
>>>>
>>>> java.net.UnknownHostException: Cannot resolve the JobManager hostname
>>>> 'hostname-of-master' specified in the configuration
>>>>
>>>> On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[email protected]>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> The log file of the Taskmanager now shows the following
>>>>>
>>>>> 18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader
>>>>>           - Unable to load native-hadoop library for your platform... 
>>>>> using
>>>>> builtin-java classes where applicable
>>>>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -
>>>>> --------------------------------------------------------------------------------
>>>>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231,
>>>>> Date:22.11.2015 @ 12:41:12 CET)
>>>>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -  Current user: flink
>>>>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation -
>>>>> 1.7/24.91-b01
>>>>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -  Maximum heap size: 491 MiBytes
>>>>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
>>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -  Hadoop version: 2.7.0
>>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -  JVM Options:
>>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -     -Xms512M
>>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -     -Xmx512M
>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -     -XX:MaxDirectMemorySize=8388607T
>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -     -XX:MaxPermSize=256m
>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -
>>>>> -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -
>>>>> -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -
>>>>> -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -  Program Arguments:
>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -     --configDir
>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -     /home/flink/flink-0.10.1/conf
>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -     --streamingMode
>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -     batch
>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -  Classpath:
>>>>> /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            -
>>>>> --------------------------------------------------------------------------------
>>>>> 18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            - Maximum number of open file descriptors is 4096
>>>>> 18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            - Loading configuration from /home/flink/flink-0.10.1/conf
>>>>> 18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            - Security is not enabled. Starting non-authenticated
>>>>> TaskManager.
>>>>> 18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager
>>>>>            - Failed to run TaskManager.
>>>>> java.net.UnknownHostException: Cannot resolve the JobManager hostname
>>>>> 'hostname-of-master' specified in the configuration
>>>>>         at
>>>>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
>>>>>         at
>>>>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
>>>>>         at
>>>>> org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
>>>>>         at
>>>>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
>>>>>         at
>>>>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
>>>>>         at
>>>>> org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
>>>>>         at
>>>>> org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)
>>>>>
>>>>> Kind Regards,
>>>>> Ravinder Kaur
>>>>>
>>>>> On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[email protected]> wrote:
>>>>>
>>>>>> What do the TaskManger logs say?
>>>>>>
>>>>>> On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Thanks for the quick reply. I tried to set jobmanager.rpc.address in
>>>>>>> flink-conf.yaml to the hostname of master node on both the nodes.
>>>>>>>
>>>>>>> Now it does not start the Taskmanager at the worker node at all.
>>>>>>> When I start the cluster using ./bin/start-cluster.sh on master it shows
>>>>>>> the normal output of starting the Jobmanager and Taskmanager but when I 
>>>>>>> run
>>>>>>> jps on the nodes the slave does not have the Taskmanager running.
>>>>>>>
>>>>>>> Running the WordCount example again fails showing the same error.
>>>>>>> Stopping the cluster says no taskmanager to stop.
>>>>>>>
>>>>>>> Kind Regards,
>>>>>>> Ravinder Kaur
>>>>>>>
>>>>>>> On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Looks like the network configuration is not correct.
>>>>>>>>
>>>>>>>> I would try setting the full host name (like "master.abc.xyz.com")
>>>>>>>> as jobmanager.rpc.address.
>>>>>>>>
>>>>>>>> Greetings,
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hello Community,
>>>>>>>>>
>>>>>>>>> I'm a student and new to Apache Flink. I'm trying to learn and
>>>>>>>>> have setup a 2- node standalone Flink(0.10.1) cluster (one master and 
>>>>>>>>> one
>>>>>>>>> worker). I'm facing the following issue.
>>>>>>>>>
>>>>>>>>> Cluster: consists of 2 vms (one master and one worker)
>>>>>>>>>
>>>>>>>>> The configurations are done as per
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/cluster_setup.html
>>>>>>>>>
>>>>>>>>> When I start the cluster both the JobManager and the TaskManager
>>>>>>>>> are started on the master and worker respectively.
>>>>>>>>>
>>>>>>>>> Command to start the cluster : bin/start-cluster.sh
>>>>>>>>>
>>>>>>>>> JPS shows all the processes running.
>>>>>>>>>
>>>>>>>>> Then I run the following command to run a WordCount example job: 
>>>>>>>>> ./bin/flink
>>>>>>>>> run ./examples/WordCount.jar
>>>>>>>>>
>>>>>>>>> the result is attached with the mail.
>>>>>>>>>
>>>>>>>>> The error is
>>>>>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException:
>>>>>>>>> Not enough free slots available to run to run the job
>>>>>>>>> ....................... Resources available to scheduler: Number of
>>>>>>>>> instances=0, total number of slots= 0, available slots=0
>>>>>>>>>
>>>>>>>>> Therefore I suppose that the JobManager does not find the
>>>>>>>>> TaskManager and checked the logs of the TaskManager which indeed 
>>>>>>>>> shows that
>>>>>>>>> the TaskManager is unable to register at the JobManager for quite a 
>>>>>>>>> long
>>>>>>>>> time. There are org.apache.flink.runtime.net.ConnectionUtils:
>>>>>>>>> Failed to connect from localhost: Connect timed out and 
>>>>>>>>> org.apache.flink.runtime.net.ConnectionUtils:
>>>>>>>>> Failed to connect from address localhost: Network is Unreachable 
>>>>>>>>> messages
>>>>>>>>> in the log of the TaskManager. Later when it starts up after a number 
>>>>>>>>> of
>>>>>>>>> attempts and tries to register at the JobManager, which also fails 
>>>>>>>>> after a
>>>>>>>>> lot of attempts showing the following message 
>>>>>>>>> org.apache.flink.runtime.taskmanager.Taskmanager:
>>>>>>>>> Trying to register at JobManager 
>>>>>>>>> akka.tcp://flink@master:6123/user'/jobmanager
>>>>>>>>> (attempt:92, timeout:30seconds) and 
>>>>>>>>> org.apache.flink.runtime.taskmanager.Taskmanager:
>>>>>>>>> Tried to associate with unreachable remote host 
>>>>>>>>> [akka.tcp://flink@master:6123/user/jobmanager].
>>>>>>>>> Address is now gated for 5000ms, all messages to this address will be
>>>>>>>>> delivered to dead letters. Reason: Connection timed out: /master:6123
>>>>>>>>>
>>>>>>>>> I browsed the internet for these and found
>>>>>>>>>  
>>>>>>>>> http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb
>>>>>>>>> <http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb>
>>>>>>>>> and https://issues.apache.org/jira/browse/FLINK-1119 these links
>>>>>>>>> helpful. Stephan Ewen the guy who provided the solution in both the 
>>>>>>>>> links
>>>>>>>>> gives a good explanation that the TaskManagers take quite some time to
>>>>>>>>> register at the JobManager and therefore I waited for as long as 20 
>>>>>>>>> mins
>>>>>>>>> after starting the cluster to run the job. But even after waiting so 
>>>>>>>>> long I
>>>>>>>>> get the same error.
>>>>>>>>>
>>>>>>>>> Another suggestion was to run the cluster in streaming mode. So I
>>>>>>>>> tried it with the command : bin/start-cluster-streaming.sh and
>>>>>>>>> ran the job but I get the same error. I have rechecked all the
>>>>>>>>> configurations but I'm unable to find out the fault.
>>>>>>>>>
>>>>>>>>> I re-checked all the configurations but could not find anything
>>>>>>>>> wrong. Also checked the port 6123 on master which is in LISTEN state 
>>>>>>>>> and
>>>>>>>>> tcp request from worker to master shows SYN_SENT state using netstat 
>>>>>>>>> -na
>>>>>>>>> and lsof -i commands.
>>>>>>>>>
>>>>>>>>> I opened the webpage on master http://localhost:8081 but it shows
>>>>>>>>> nothing and localhost:8080 says connection refused.
>>>>>>>>>
>>>>>>>>> Kindly help me out as it is very important for me. Let me know if
>>>>>>>>> you have any questions.
>>>>>>>>>
>>>>>>>>> Kind Regards,
>>>>>>>>> Ravinder Kaur
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: TaskManager unable to register with JobManager

Reply via email to