Re: YARN mode startup takes too long (10+ secs)

Mridul Muralidharan Mon, 11 May 2015 09:05:55 -0700

For tiny/small clusters (particularly single tenet), you can set it to
lower value.
But for anything reasonably large or multi-tenet, the request storm
can be bad if large enough number of applications start aggressively
polling RM.
That is why the interval is set to configurable.


- Mridul


On Mon, May 11, 2015 at 6:54 AM, Zoltán Zvara <zoltan.zv...@gmail.com> wrote:
> Isn't this issue something that should be improved? Based on the following
> discussion, there are two places were YARN's heartbeat interval is
> respected on job start-up, but do we really need to respect it on start-up?
>
> On Fri, May 8, 2015 at 12:14 PM Taeyun Kim <taeyun....@innowireless.com>
> wrote:
>
>> I think so.
>>
>> In fact, the flow is: allocator.allocateResources() -> sleep ->
>> allocator.allocateResources() -> sleep …
>>
>> But I guess that on the first allocateResources() the allocation is not
>> fulfilled. So sleep occurs.
>>
>>
>>
>> *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com]
>> *Sent:* Friday, May 08, 2015 4:25 PM
>>
>>
>> *To:* Taeyun Kim; u...@spark.apache.org
>> *Subject:* Re: YARN mode startup takes too long (10+ secs)
>>
>>
>>
>> So is this sleep occurs before allocating resources for the first few
>> executors to start the job?
>>
>>
>>
>> On Fri, May 8, 2015 at 6:23 AM Taeyun Kim <taeyun....@innowireless.com>
>> wrote:
>>
>> I think I’ve found the (maybe partial, but major) reason.
>>
>>
>>
>> It’s between the following lines, (it’s newly captured, but essentially
>> the same place that Zoltán Zvara picked:
>>
>>
>>
>> 15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager
>>
>> 15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor:
>> Actor[akka.tcp://sparkExecutor@cluster04:55237/user/Executor#-149550753]
>> with ID 1
>>
>>
>>
>> When I read the logs on cluster side, the following lines were found: (the
>> exact time is different with above line, but it’s the difference between
>> machines)
>>
>>
>>
>> 15/05/08 11:36:23 INFO yarn.ApplicationMaster: Started progress reporter
>> thread - sleep time : 5000
>>
>> 15/05/08 11:36:28 INFO impl.AMRMClientImpl: Received new token for :
>> cluster04:45454
>>
>>
>>
>> It seemed that Spark deliberately sleeps 5 secs.
>>
>> I’ve read the Spark source code, and in
>> org.apache.spark.deploy.yarn.ApplicationMaster.scala, launchReporterThread()
>> had the code for that.
>>
>> It loops calling allocator.allocateResources() and Thread.sleep().
>>
>> For sleep, it reads the configuration variable
>> spark.yarn.scheduler.heartbeat.interval-ms (the default value is 5000,
>> which is 5 secs).
>>
>> According to the comment, “we want to be reasonably responsive without
>> causing too many requests to RM”.
>>
>> So, unless YARN immediately fulfill the allocation request, it seems that
>> 5 secs will be wasted.
>>
>>
>>
>> When I modified the configuration variable to 1000, it only waited for 1
>> sec.
>>
>>
>>
>> Here is the log lines after the change:
>>
>>
>>
>> 15/05/08 11:47:21 INFO yarn.ApplicationMaster: Started progress reporter
>> thread - sleep time : 1000
>>
>> 15/05/08 11:47:22 INFO impl.AMRMClientImpl: Received new token for :
>> cluster04:45454
>>
>>
>>
>> 4 secs saved.
>>
>> So, when one does not want to wait 5 secs, one can change the
>> spark.yarn.scheduler.heartbeat.interval-ms.
>>
>> I hope that the additional overhead it incurs would be negligible.
>>
>>
>>
>>
>>
>> *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com]
>> *Sent:* Thursday, May 07, 2015 10:05 PM
>> *To:* Taeyun Kim; u...@spark.apache.org
>> *Subject:* Re: YARN mode startup takes too long (10+ secs)
>>
>>
>>
>> Without considering everything, just a few hints:
>>
>> You are running on YARN. From 09:18:34 to 09:18:37 your application is in
>> state ACCEPTED. There is a noticeable overhead introduced due to
>> communicating with YARN's ResourceManager, NodeManager and given that the
>> YARN scheduler needs time to make a decision. I guess somewhere
>> from 09:18:38 to 09:18:43 your application JAR gets copied to another
>> container requested by the Spark ApplicationMaster deployed on YARN's
>> container 0. Deploying an executor needs further resource negotiations with
>> the ResourceManager usually. Also, as I said, your JAR and Executor's code
>> requires copying to the container's local directory - execution blocked
>> until that is complete.
>>
>>
>>
>> On Thu, May 7, 2015 at 3:09 AM Taeyun Kim <taeyun....@innowireless.com>
>> wrote:
>>
>> Hi,
>>
>>
>>
>> I’m running a spark application with YARN-client or YARN-cluster mode.
>>
>> But it seems to take too long to startup.
>>
>> It takes 10+ seconds to initialize the spark context.
>>
>> Is this normal? Or can it be optimized?
>>
>>
>>
>> The environment is as follows:
>>
>> - Hadoop: Hortonworks HDP 2.2 (Hadoop 2.6)
>>
>> - Spark: 1.3.1
>>
>> - Client: Windows 7, but similar result on CentOS 6.6
>>
>>
>>
>> The following is the startup part of the application log. (Some private
>> information was edited)
>>
>> ‘Main: Initializing context’ at the first line and ‘MainProcessor:
>> Deleting previous output files’ at the last line are the logs by the
>> application. Others in between are from Spark itself. Application logic is
>> executed after this log is displayed.
>>
>>
>>
>> ---
>>
>>
>>
>> 15/05/07 09:18:31 INFO Main: Initializing context
>>
>> 15/05/07 09:18:31 INFO SparkContext: Running Spark version 1.3.1
>>
>> 15/05/07 09:18:31 INFO SecurityManager: Changing view acls to: myuser,myapp
>>
>> 15/05/07 09:18:31 INFO SecurityManager: Changing modify acls to:
>> myuser,myapp
>>
>> 15/05/07 09:18:31 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users with view permissions: Set(myuser,
>> myapp); users with modify permissions: Set(myuser, myapp)
>>
>> 15/05/07 09:18:31 INFO Slf4jLogger: Slf4jLogger started
>>
>> 15/05/07 09:18:31 INFO Remoting: Starting remoting
>>
>> 15/05/07 09:18:31 INFO Remoting: Remoting started; listening on addresses
>> :[akka.tcp://sparkDriver@mymachine:54449]
>>
>> 15/05/07 09:18:31 INFO Utils: Successfully started service 'sparkDriver'
>> on port 54449.
>>
>> 15/05/07 09:18:31 INFO SparkEnv: Registering MapOutputTracker
>>
>> 15/05/07 09:18:32 INFO SparkEnv: Registering BlockManagerMaster
>>
>> 15/05/07 09:18:32 INFO DiskBlockManager: Created local directory at
>> C:\Users\myuser\AppData\Local\Temp\spark-2d3db9d6-ea78-438e-956f-be9c1dcf3a9d\blockmgr-e9ade223-a4b8-4d9f-b038-efd66adf9772
>>
>> 15/05/07 09:18:32 INFO MemoryStore: MemoryStore started with capacity
>> 1956.7 MB
>>
>> 15/05/07 09:18:32 INFO HttpFileServer: HTTP File server directory is
>> C:\Users\myuser\AppData\Local\Temp\spark-ff40d73b-e8ab-433e-88c4-35da27fb6278\httpd-def9220f-ac3a-4dd2-9ac1-2c593b94b2d9
>>
>> 15/05/07 09:18:32 INFO HttpServer: Starting HTTP Server
>>
>> 15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
>>
>> 15/05/07 09:18:32 INFO AbstractConnector: Started
>> SocketConnector@0.0.0.0:54450
>>
>> 15/05/07 09:18:32 INFO Utils: Successfully started service 'HTTP file
>> server' on port 54450.
>>
>> 15/05/07 09:18:32 INFO SparkEnv: Registering OutputCommitCoordinator
>>
>> 15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
>>
>> 15/05/07 09:18:32 INFO AbstractConnector: Started
>> SelectChannelConnector@0.0.0.0:4040
>>
>> 15/05/07 09:18:32 INFO Utils: Successfully started service 'SparkUI' on
>> port 4040.
>>
>> 15/05/07 09:18:32 INFO SparkUI: Started SparkUI at http://mymachine:4040
>>
>> 15/05/07 09:18:32 INFO SparkContext: Added JAR
>> file:/D:/Projects/MyApp/MyApp.jar at
>> http://10.111.111.199:54450/jars/MyApp.jar with timestamp 1430957912240
>>
>> 15/05/07 09:18:32 INFO RMProxy: Connecting to ResourceManager at cluster01/
>> 10.111.111.11:8050
>>
>> 15/05/07 09:18:32 INFO Client: Requesting a new application from cluster
>> with 3 NodeManagers
>>
>> 15/05/07 09:18:32 INFO Client: Verifying our application has not requested
>> more than the maximum memory capability of the cluster (23040 MB per
>> container)
>>
>> 15/05/07 09:18:32 INFO Client: Will allocate AM container, with 896 MB
>> memory including 384 MB overhead
>>
>> 15/05/07 09:18:32 INFO Client: Setting up container launch context for our
>> AM
>>
>> 15/05/07 09:18:32 INFO Client: Preparing resources for our AM container
>>
>> 15/05/07 09:18:32 INFO Client: Source and destination file systems are the
>> same. Not copying
>> hdfs://cluster01/apps/spark/spark-assembly-1.3.1-hadoop2.6.0.jar
>>
>> 15/05/07 09:18:32 INFO Client: Setting up the launch environment for our
>> AM container
>>
>> 15/05/07 09:18:33 INFO SecurityManager: Changing view acls to: myuser,myapp
>>
>> 15/05/07 09:18:33 INFO SecurityManager: Changing modify acls to:
>> myuser,myapp
>>
>> 15/05/07 09:18:33 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users with view permissions: Set(myuser,
>> myapp); users with modify permissions: Set(myuser, myapp)
>>
>> 15/05/07 09:18:33 INFO Client: Submitting application 2 to ResourceManager
>>
>> 15/05/07 09:18:33 INFO YarnClientImpl: Submitted application
>> application_1430956687773_0002
>>
>> 15/05/07 09:18:34 INFO Client: Application report for
>> application_1430956687773_0002 (state: ACCEPTED)
>>
>> 15/05/07 09:18:34 INFO Client:
>>
>>             client token: N/A
>>
>>            diagnostics: N/A
>>
>>            ApplicationMaster host: N/A
>>
>>            ApplicationMaster RPC port: -1
>>
>>            queue: default
>>
>>            start time: 1430957906540
>>
>>            final status: UNDEFINED
>>
>>            tracking URL:
>> http://cluster01:8088/proxy/application_1430956687773_0002/
>>
>>            user: myapp
>>
>> 15/05/07 09:18:35 INFO Client: Application report for
>> application_1430956687773_0002 (state: ACCEPTED)
>>
>> 15/05/07 09:18:36 INFO Client: Application report for
>> application_1430956687773_0002 (state: ACCEPTED)
>>
>> 15/05/07 09:18:37 INFO Client: Application report for
>> application_1430956687773_0002 (state: ACCEPTED)
>>
>> 15/05/07 09:18:37 INFO YarnClientSchedulerBackend: ApplicationMaster
>> registered as Actor[akka.tcp://sparkYarnAM@cluster02
>> :39698/user/YarnAM#-1579648782]
>>
>> 15/05/07 09:18:37 INFO YarnClientSchedulerBackend: Add WebUI Filter.
>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS
>> -> cluster01, PROXY_URI_BASES ->
>> http://cluster01:8088/proxy/application_1430956687773_0002),
>> /proxy/application_1430956687773_0002
>>
>> 15/05/07 09:18:37 INFO JettyUtils: Adding filter:
>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
>>
>> 15/05/07 09:18:38 INFO Client: Application report for
>> application_1430956687773_0002 (state: RUNNING)
>>
>> 15/05/07 09:18:38 INFO Client:
>>
>>             client token: N/A
>>
>>            diagnostics: N/A
>>
>>            ApplicationMaster host: cluster02
>>
>>            ApplicationMaster RPC port: 0
>>
>>            queue: default
>>
>>            start time: 1430957906540
>>
>>            final status: UNDEFINED
>>
>>            tracking URL:
>> http://cluster01:8088/proxy/application_1430956687773_0002/
>>
>>            user: myapp
>>
>> 15/05/07 09:18:38 INFO YarnClientSchedulerBackend: Application
>> application_1430956687773_0002 has started running.
>>
>> 15/05/07 09:18:38 INFO NettyBlockTransferService: Server created on 54491
>>
>> 15/05/07 09:18:38 INFO BlockManagerMaster: Trying to register BlockManager
>>
>> 15/05/07 09:18:38 INFO BlockManagerMasterActor: Registering block manager
>> mymachine:54491 with 1956.7 MB RAM, BlockManagerId(<driver>, mymachine,
>> 54491)
>>
>> 15/05/07 09:18:38 INFO BlockManagerMaster: Registered BlockManager
>>
>> 15/05/07 09:18:43 INFO YarnClientSchedulerBackend: Registered executor:
>> Actor[akka.tcp://sparkExecutor@cluster02:44996/user/Executor#-786778979]
>> with ID 1
>>
>> 15/05/07 09:18:43 INFO YarnClientSchedulerBackend: SchedulerBackend is
>> ready for scheduling beginning after reached minRegisteredResourcesRatio:
>> 0.8
>>
>> 15/05/07 09:18:43 INFO MainProcessor: Deleting previous output files
>>
>>
>>
>> Thanks.
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: YARN mode startup takes too long (10+ secs)

Reply via email to