Is that a new observation that it takes so long, or has it always taken so long?
On Fri, Oct 2, 2015 at 5:40 PM, Robert Schmidtke <ro.schmid...@gmail.com> wrote: > I figured the JM would be waiting for the TMs. Each of my nodes has 64G of > memory available. > > On Fri, Oct 2, 2015 at 5:38 PM, Maximilian Michels <m...@apache.org> wrote: > >> Hi Robert, >> >> During startup, the task manager allocates the entire managed memory. >> >> From the log: >> 17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager >> - Using 0.7 of the currently free heap space for Flink >> managed heap memory (34395 MB). >> >> It seems like you are allocating almost 35 GB of memory which might >> take a bit (40 seconds still seems like too much time). What >> configuration did you use for the task managers? Do you really have >> that much memory or is your system swapping? >> >> I think the JobManager just appears to take a long time because the >> TaskManagers register late. >> >> Cheers, >> Max >> >> On Fri, Oct 2, 2015 at 5:26 PM, Robert Schmidtke <ro.schmid...@gmail.com> >> wrote: >> > Hi everyone, >> > >> > I'm wondering about the startup times of the TMs: >> > >> > ... >> > 17:03:33,255 INFO org.apache.flink.runtime.taskmanager.TaskManager >> > - Starting TaskManager actor >> > 17:03:33,262 INFO org.apache.flink.runtime.io.network.netty.NettyConfig >> > - NettyConfig [server address: cumu02-05/130.73.144.64, server port: >> 45731, >> > memory segment size (bytes): 32768, transport type: NIO, number of >> server >> > threads: 0 (use Netty's default), number of client threads: 0 (use >> Netty's >> > default), server connect backlog: 0 (use Netty's default), client >> connect >> > timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's >> > default)] >> > 17:03:33,266 INFO org.apache.flink.runtime.taskmanager.TaskManager >> > - Messages between TaskManager and JobManager have a max timeout of >> 100000 >> > milliseconds >> > 17:03:33,268 INFO org.apache.flink.runtime.taskmanager.TaskManager >> > - Temporary file directory '/tmp': total 44 GB, usable 37 GB (84.09% >> usable) >> > 17:03:33,295 INFO >> > org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - >> Allocated 64 >> > MB for network buffer pool (number of memory segments: 2048, bytes per >> > segment: 32768). >> > 17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager >> > - Using 0.7 of the currently free heap space for Flink managed heap >> memory >> > (34395 MB). >> > >> > // almost 40 seconds // >> > >> > 17:04:12,445 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager >> > - I/O manager uses directory >> > /tmp/flink-io-922d9bf4-254e-41e7-b151-525157cd5bfe for spill files. >> > 17:04:12,455 INFO org.apache.flink.runtime.filecache.FileCache >> > - User file cache uses directory >> > /tmp/flink-dist-cache-792cf7f2-e2be-4950-a39f-d7a21326f054 >> > 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager >> > - Starting TaskManager actor at >> akka://flink/user/taskmanager#1341641688. >> > 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager >> > - TaskManager data connection information: cumu02-05.zib.de >> (dataPort=45731) >> > 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager >> > - TaskManager has 16 task slot(s). >> > 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager >> > - Memory usage stats: [HEAP: 35502/49216/49216 MB, NON HEAP: 25/52/214 >> MB >> > (used/committed/max)] >> > 17:04:12,623 INFO org.apache.flink.runtime.taskmanager.TaskManager >> > - Trying to register at JobManager >> > akka.tcp://flink@130.73.144.59:6123/user/jobmanager (attempt 1, >> timeout: 500 >> > milliseconds) >> > 17:04:12,773 INFO org.apache.flink.runtime.taskmanager.TaskManager >> > - Successful registration at JobManager >> > (akka.tcp://flink@130.73.144.59:6123/user/jobmanager), starting network >> > stack and library cache. >> > ... >> > >> > >> > The same goes for the JM (obviously). >> > >> > ... >> > 17:03:31,632 INFO org.apache.flink.runtime.jobmanager.JobManager >> > - Starting JobManger web frontend >> > 17:03:31,636 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer >> > - Setting up web info server, using web-root directory >> > >> jar:file:/nfs/csr/bzcschmi/flink/flink-dist/target/flink-0.10-SNAPSHOT-bin/flink-0.10-SNAPSHOT/lib/flink-dist-0.10-SNAPSHOT.jar!/web-docs-infoserver. >> > 17:03:31,753 INFO org.eclipse.jetty.util.log >> > - jetty-0.10-SNAPSHOT >> > 17:03:31,806 INFO org.eclipse.jetty.util.log >> > - Started SelectChannelConnector@0.0.0.0:8081 >> > 17:03:31,806 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer >> > - Started web info server for JobManager on 0.0.0.0:8081 >> > >> > // almost 35 seconds // >> > >> > 17:04:05,091 INFO org.apache.flink.runtime.instance.InstanceManager >> > - Registered TaskManager at cumu02-02 >> > (akka.tcp://flink@130.73.144.61:53549/user/taskmanager) as >> > e5ae92397a912c7360524524cf2d172a. Current number of registered hosts is >> 1. >> > Current number of alive task slots is 16. >> > ... >> > >> > >> > Is this to be expected? Any ideas what's happening in the meantime? I'm >> > asking because I'm running into errors when submitting my job too early >> (and >> > not enough TMs have connected). >> > >> > Cheers >> > Robert >> > >> > -- >> > My GPG Key ID: 336E2680 >> > > > > -- > My GPG Key ID: 336E2680 >