Thanks, the off-heap solution is indeed faster. 15s instead of 45s for the amounts of memory I allocate.
On Fri, Oct 2, 2015 at 6:09 PM, Stephan Ewen <se...@apache.org> wrote: > Yeah, registration is fast, JVM heatup is what takes time. > > You can try two things: > > - Use the off-heap memory variant and see if that allocates the memory > faster. Just add the entry "taskmanager.memory.off-heap: true" to the > config. > > - Or start the system in "streaming" mode. Then, it will not > pre-allocate the memory, but allocate it as needed. That may, however, > cause long GC pauses when the jobs start requesting the memory, so I am not > sure if that is better. > > Greetings, > Stephan > > > On Fri, Oct 2, 2015 at 5:53 PM, Robert Schmidtke <ro.schmid...@gmail.com> > wrote: > >> Looking into the logs of each TM it only took about 5 seconds per TM to >> go from "Trying to register" to "Successful registration". >> >> On Fri, Oct 2, 2015 at 5:50 PM, Robert Schmidtke <ro.schmid...@gmail.com> >> wrote: >> >>> I recently switched from running Flink on YARN to running Flink >>> Standalone and I realized I had to add a sleep after ./start-cluster.sh >>> (well, my Slurm adaptation of it). I did not have to explicitly wait before >>> since Flink would wait until all YARN containers became available, so to be >>> honest I don't know whether this is new or not. I just looked into an old >>> log (well, from last Friday) and it took about 1 minute for 31 TMs to >>> connect to 1 JM. They each had -Xms and -Xmx6079m though. >>> >>> On Fri, Oct 2, 2015 at 5:44 PM, Stephan Ewen <se...@apache.org> wrote: >>> >>>> Is that a new observation that it takes so long, or has it always taken >>>> so long? >>>> >>>> On Fri, Oct 2, 2015 at 5:40 PM, Robert Schmidtke < >>>> ro.schmid...@gmail.com> wrote: >>>> >>>>> I figured the JM would be waiting for the TMs. Each of my nodes has >>>>> 64G of memory available. >>>>> >>>>> On Fri, Oct 2, 2015 at 5:38 PM, Maximilian Michels <m...@apache.org> >>>>> wrote: >>>>> >>>>>> Hi Robert, >>>>>> >>>>>> During startup, the task manager allocates the entire managed memory. >>>>>> >>>>>> From the log: >>>>>> 17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Using 0.7 of the currently free heap space for Flink >>>>>> managed heap memory (34395 MB). >>>>>> >>>>>> It seems like you are allocating almost 35 GB of memory which might >>>>>> take a bit (40 seconds still seems like too much time). What >>>>>> configuration did you use for the task managers? Do you really have >>>>>> that much memory or is your system swapping? >>>>>> >>>>>> I think the JobManager just appears to take a long time because the >>>>>> TaskManagers register late. >>>>>> >>>>>> Cheers, >>>>>> Max >>>>>> >>>>>> On Fri, Oct 2, 2015 at 5:26 PM, Robert Schmidtke < >>>>>> ro.schmid...@gmail.com> wrote: >>>>>> > Hi everyone, >>>>>> > >>>>>> > I'm wondering about the startup times of the TMs: >>>>>> > >>>>>> > ... >>>>>> > 17:03:33,255 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> > - Starting TaskManager actor >>>>>> > 17:03:33,262 INFO >>>>>> org.apache.flink.runtime.io.network.netty.NettyConfig >>>>>> > - NettyConfig [server address: cumu02-05/130.73.144.64, server >>>>>> port: 45731, >>>>>> > memory segment size (bytes): 32768, transport type: NIO, number of >>>>>> server >>>>>> > threads: 0 (use Netty's default), number of client threads: 0 (use >>>>>> Netty's >>>>>> > default), server connect backlog: 0 (use Netty's default), client >>>>>> connect >>>>>> > timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's >>>>>> > default)] >>>>>> > 17:03:33,266 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> > - Messages between TaskManager and JobManager have a max timeout of >>>>>> 100000 >>>>>> > milliseconds >>>>>> > 17:03:33,268 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> > - Temporary file directory '/tmp': total 44 GB, usable 37 GB >>>>>> (84.09% usable) >>>>>> > 17:03:33,295 INFO >>>>>> > org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - >>>>>> Allocated 64 >>>>>> > MB for network buffer pool (number of memory segments: 2048, bytes >>>>>> per >>>>>> > segment: 32768). >>>>>> > 17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> > - Using 0.7 of the currently free heap space for Flink managed heap >>>>>> memory >>>>>> > (34395 MB). >>>>>> > >>>>>> > // almost 40 seconds // >>>>>> > >>>>>> > 17:04:12,445 INFO >>>>>> org.apache.flink.runtime.io.disk.iomanager.IOManager >>>>>> > - I/O manager uses directory >>>>>> > /tmp/flink-io-922d9bf4-254e-41e7-b151-525157cd5bfe for spill files. >>>>>> > 17:04:12,455 INFO org.apache.flink.runtime.filecache.FileCache >>>>>> > - User file cache uses directory >>>>>> > /tmp/flink-dist-cache-792cf7f2-e2be-4950-a39f-d7a21326f054 >>>>>> > 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> > - Starting TaskManager actor at >>>>>> akka://flink/user/taskmanager#1341641688. >>>>>> > 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> > - TaskManager data connection information: cumu02-05.zib.de >>>>>> (dataPort=45731) >>>>>> > 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> > - TaskManager has 16 task slot(s). >>>>>> > 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> > - Memory usage stats: [HEAP: 35502/49216/49216 MB, NON HEAP: >>>>>> 25/52/214 MB >>>>>> > (used/committed/max)] >>>>>> > 17:04:12,623 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> > - Trying to register at JobManager >>>>>> > akka.tcp://flink@130.73.144.59:6123/user/jobmanager (attempt 1, >>>>>> timeout: 500 >>>>>> > milliseconds) >>>>>> > 17:04:12,773 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> > - Successful registration at JobManager >>>>>> > (akka.tcp://flink@130.73.144.59:6123/user/jobmanager), starting >>>>>> network >>>>>> > stack and library cache. >>>>>> > ... >>>>>> > >>>>>> > >>>>>> > The same goes for the JM (obviously). >>>>>> > >>>>>> > ... >>>>>> > 17:03:31,632 INFO org.apache.flink.runtime.jobmanager.JobManager >>>>>> > - Starting JobManger web frontend >>>>>> > 17:03:31,636 INFO >>>>>> org.apache.flink.runtime.jobmanager.web.WebInfoServer >>>>>> > - Setting up web info server, using web-root directory >>>>>> > >>>>>> jar:file:/nfs/csr/bzcschmi/flink/flink-dist/target/flink-0.10-SNAPSHOT-bin/flink-0.10-SNAPSHOT/lib/flink-dist-0.10-SNAPSHOT.jar!/web-docs-infoserver. >>>>>> > 17:03:31,753 INFO org.eclipse.jetty.util.log >>>>>> > - jetty-0.10-SNAPSHOT >>>>>> > 17:03:31,806 INFO org.eclipse.jetty.util.log >>>>>> > - Started SelectChannelConnector@0.0.0.0:8081 >>>>>> > 17:03:31,806 INFO >>>>>> org.apache.flink.runtime.jobmanager.web.WebInfoServer >>>>>> > - Started web info server for JobManager on 0.0.0.0:8081 >>>>>> > >>>>>> > // almost 35 seconds // >>>>>> > >>>>>> > 17:04:05,091 INFO org.apache.flink.runtime.instance.InstanceManager >>>>>> > - Registered TaskManager at cumu02-02 >>>>>> > (akka.tcp://flink@130.73.144.61:53549/user/taskmanager) as >>>>>> > e5ae92397a912c7360524524cf2d172a. Current number of registered >>>>>> hosts is 1. >>>>>> > Current number of alive task slots is 16. >>>>>> > ... >>>>>> > >>>>>> > >>>>>> > Is this to be expected? Any ideas what's happening in the meantime? >>>>>> I'm >>>>>> > asking because I'm running into errors when submitting my job too >>>>>> early (and >>>>>> > not enough TMs have connected). >>>>>> > >>>>>> > Cheers >>>>>> > Robert >>>>>> > >>>>>> > -- >>>>>> > My GPG Key ID: 336E2680 >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> My GPG Key ID: 336E2680 >>>>> >>>> >>>> >>> >>> >>> -- >>> My GPG Key ID: 336E2680 >>> >> >> >> >> -- >> My GPG Key ID: 336E2680 >> > > -- My GPG Key ID: 336E2680