Re: JM/TM startup time

Robert Schmidtke Mon, 05 Oct 2015 03:15:30 -0700

Thanks, the off-heap solution is indeed faster. 15s instead of 45s for the
amounts of memory I allocate.


On Fri, Oct 2, 2015 at 6:09 PM, Stephan Ewen <se...@apache.org> wrote:

> Yeah, registration is fast, JVM heatup is what takes time.
>
> You can try two things:
>
>   - Use the off-heap memory variant and see if that allocates the memory
> faster. Just add the entry "taskmanager.memory.off-heap: true" to the
> config.
>
>   - Or start the system in "streaming" mode. Then, it will not
> pre-allocate the memory, but allocate it as needed. That may, however,
> cause long GC pauses when the jobs start requesting the memory, so I am not
> sure if that is better.
>
> Greetings,
> Stephan
>
>
> On Fri, Oct 2, 2015 at 5:53 PM, Robert Schmidtke <ro.schmid...@gmail.com>
> wrote:
>
>> Looking into the logs of each TM it only took about 5 seconds per TM to
>> go from "Trying to register" to "Successful registration".
>>
>> On Fri, Oct 2, 2015 at 5:50 PM, Robert Schmidtke <ro.schmid...@gmail.com>
>> wrote:
>>
>>> I recently switched from running Flink on YARN to running Flink
>>> Standalone and I realized I had to add a sleep after ./start-cluster.sh
>>> (well, my Slurm adaptation of it). I did not have to explicitly wait before
>>> since Flink would wait until all YARN containers became available, so to be
>>> honest I don't know whether this is new or not. I just looked into an old
>>> log (well, from last Friday) and it took about 1 minute for 31 TMs to
>>> connect to 1 JM. They each had -Xms and -Xmx6079m though.
>>>
>>> On Fri, Oct 2, 2015 at 5:44 PM, Stephan Ewen <se...@apache.org> wrote:
>>>
>>>> Is that a new observation that it takes so long, or has it always taken
>>>> so long?
>>>>
>>>> On Fri, Oct 2, 2015 at 5:40 PM, Robert Schmidtke <
>>>> ro.schmid...@gmail.com> wrote:
>>>>
>>>>> I figured the JM would be waiting for the TMs. Each of my nodes has
>>>>> 64G of memory available.
>>>>>
>>>>> On Fri, Oct 2, 2015 at 5:38 PM, Maximilian Michels <m...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Robert,
>>>>>>
>>>>>> During startup, the task manager allocates the entire managed memory.
>>>>>>
>>>>>> From the log:
>>>>>> 17:03:33,554 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>           - Using 0.7 of the currently free heap space for Flink
>>>>>> managed heap memory (34395 MB).
>>>>>>
>>>>>> It seems like you are allocating almost 35 GB of memory which might
>>>>>> take a bit (40 seconds still seems like too much time). What
>>>>>> configuration did you use for the task managers? Do you really have
>>>>>> that much memory or is your system swapping?
>>>>>>
>>>>>> I think the JobManager just appears to take a long time because the
>>>>>> TaskManagers register late.
>>>>>>
>>>>>> Cheers,
>>>>>> Max
>>>>>>
>>>>>> On Fri, Oct 2, 2015 at 5:26 PM, Robert Schmidtke <
>>>>>> ro.schmid...@gmail.com> wrote:
>>>>>> > Hi everyone,
>>>>>> >
>>>>>> > I'm wondering about the startup times of the TMs:
>>>>>> >
>>>>>> > ...
>>>>>> > 17:03:33,255 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>> > - Starting TaskManager actor
>>>>>> > 17:03:33,262 INFO
>>>>>> org.apache.flink.runtime.io.network.netty.NettyConfig
>>>>>> > - NettyConfig [server address: cumu02-05/130.73.144.64, server
>>>>>> port: 45731,
>>>>>> > memory segment size (bytes): 32768, transport type: NIO, number of
>>>>>> server
>>>>>> > threads: 0 (use Netty's default), number of client threads: 0 (use
>>>>>> Netty's
>>>>>> > default), server connect backlog: 0 (use Netty's default), client
>>>>>> connect
>>>>>> > timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
>>>>>> > default)]
>>>>>> > 17:03:33,266 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>> > - Messages between TaskManager and JobManager have a max timeout of
>>>>>> 100000
>>>>>> > milliseconds
>>>>>> > 17:03:33,268 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>> > - Temporary file directory '/tmp': total 44 GB, usable 37 GB
>>>>>> (84.09% usable)
>>>>>> > 17:03:33,295 INFO
>>>>>> > org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  -
>>>>>> Allocated 64
>>>>>> > MB for network buffer pool (number of memory segments: 2048, bytes
>>>>>> per
>>>>>> > segment: 32768).
>>>>>> > 17:03:33,554 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>> > - Using 0.7 of the currently free heap space for Flink managed heap
>>>>>> memory
>>>>>> > (34395 MB).
>>>>>> >
>>>>>> > // almost 40 seconds //
>>>>>> >
>>>>>> > 17:04:12,445 INFO
>>>>>> org.apache.flink.runtime.io.disk.iomanager.IOManager
>>>>>> > - I/O manager uses directory
>>>>>> > /tmp/flink-io-922d9bf4-254e-41e7-b151-525157cd5bfe for spill files.
>>>>>> > 17:04:12,455 INFO  org.apache.flink.runtime.filecache.FileCache
>>>>>> > - User file cache uses directory
>>>>>> > /tmp/flink-dist-cache-792cf7f2-e2be-4950-a39f-d7a21326f054
>>>>>> > 17:04:12,617 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>> > - Starting TaskManager actor at
>>>>>> akka://flink/user/taskmanager#1341641688.
>>>>>> > 17:04:12,617 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>> > - TaskManager data connection information: cumu02-05.zib.de
>>>>>> (dataPort=45731)
>>>>>> > 17:04:12,618 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>> > - TaskManager has 16 task slot(s).
>>>>>> > 17:04:12,618 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>> > - Memory usage stats: [HEAP: 35502/49216/49216 MB, NON HEAP:
>>>>>> 25/52/214 MB
>>>>>> > (used/committed/max)]
>>>>>> > 17:04:12,623 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>> > - Trying to register at JobManager
>>>>>> > akka.tcp://flink@130.73.144.59:6123/user/jobmanager (attempt 1,
>>>>>> timeout: 500
>>>>>> > milliseconds)
>>>>>> > 17:04:12,773 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>> > - Successful registration at JobManager
>>>>>> > (akka.tcp://flink@130.73.144.59:6123/user/jobmanager), starting
>>>>>> network
>>>>>> > stack and library cache.
>>>>>> > ...
>>>>>> >
>>>>>> >
>>>>>> > The same goes for the JM (obviously).
>>>>>> >
>>>>>> > ...
>>>>>> > 17:03:31,632 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>>> > - Starting JobManger web frontend
>>>>>> > 17:03:31,636 INFO
>>>>>> org.apache.flink.runtime.jobmanager.web.WebInfoServer
>>>>>> > - Setting up web info server, using web-root directory
>>>>>> >
>>>>>> jar:file:/nfs/csr/bzcschmi/flink/flink-dist/target/flink-0.10-SNAPSHOT-bin/flink-0.10-SNAPSHOT/lib/flink-dist-0.10-SNAPSHOT.jar!/web-docs-infoserver.
>>>>>> > 17:03:31,753 INFO  org.eclipse.jetty.util.log
>>>>>> > - jetty-0.10-SNAPSHOT
>>>>>> > 17:03:31,806 INFO  org.eclipse.jetty.util.log
>>>>>> > - Started SelectChannelConnector@0.0.0.0:8081
>>>>>> > 17:03:31,806 INFO
>>>>>> org.apache.flink.runtime.jobmanager.web.WebInfoServer
>>>>>> > - Started web info server for JobManager on 0.0.0.0:8081
>>>>>> >
>>>>>> > // almost 35 seconds //
>>>>>> >
>>>>>> > 17:04:05,091 INFO  org.apache.flink.runtime.instance.InstanceManager
>>>>>> > - Registered TaskManager at cumu02-02
>>>>>> > (akka.tcp://flink@130.73.144.61:53549/user/taskmanager) as
>>>>>> > e5ae92397a912c7360524524cf2d172a. Current number of registered
>>>>>> hosts is 1.
>>>>>> > Current number of alive task slots is 16.
>>>>>> > ...
>>>>>> >
>>>>>> >
>>>>>> > Is this to be expected? Any ideas what's happening in the meantime?
>>>>>> I'm
>>>>>> > asking because I'm running into errors when submitting my job too
>>>>>> early (and
>>>>>> > not enough TMs have connected).
>>>>>> >
>>>>>> > Cheers
>>>>>> > Robert
>>>>>> >
>>>>>> > --
>>>>>> > My GPG Key ID: 336E2680
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> My GPG Key ID: 336E2680
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> My GPG Key ID: 336E2680
>>>
>>
>>
>>
>> --
>> My GPG Key ID: 336E2680
>>
>
>


-- 
My GPG Key ID: 336E2680

Re: JM/TM startup time

Reply via email to