Re: YARN mode startup takes too long (10+ secs)

Zoltán Zvara Mon, 11 May 2015 06:54:58 -0700

Isn't this issue something that should be improved? Based on the following
discussion, there are two places were YARN's heartbeat interval is
respected on job start-up, but do we really need to respect it on start-up?


On Fri, May 8, 2015 at 12:14 PM Taeyun Kim <taeyun....@innowireless.com>
wrote:

> I think so.
>
> In fact, the flow is: allocator.allocateResources() -> sleep ->
> allocator.allocateResources() -> sleep …
>
> But I guess that on the first allocateResources() the allocation is not
> fulfilled. So sleep occurs.
>
>
>
> *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com]
> *Sent:* Friday, May 08, 2015 4:25 PM
>
>
> *To:* Taeyun Kim; u...@spark.apache.org
> *Subject:* Re: YARN mode startup takes too long (10+ secs)
>
>
>
> So is this sleep occurs before allocating resources for the first few
> executors to start the job?
>
>
>
> On Fri, May 8, 2015 at 6:23 AM Taeyun Kim <taeyun....@innowireless.com>
> wrote:
>
> I think I’ve found the (maybe partial, but major) reason.
>
>
>
> It’s between the following lines, (it’s newly captured, but essentially
> the same place that Zoltán Zvara picked:
>
>
>
> 15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager
>
> 15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor:
> Actor[akka.tcp://sparkExecutor@cluster04:55237/user/Executor#-149550753]
> with ID 1
>
>
>
> When I read the logs on cluster side, the following lines were found: (the
> exact time is different with above line, but it’s the difference between
> machines)
>
>
>
> 15/05/08 11:36:23 INFO yarn.ApplicationMaster: Started progress reporter
> thread - sleep time : 5000
>
> 15/05/08 11:36:28 INFO impl.AMRMClientImpl: Received new token for :
> cluster04:45454
>
>
>
> It seemed that Spark deliberately sleeps 5 secs.
>
> I’ve read the Spark source code, and in
> org.apache.spark.deploy.yarn.ApplicationMaster.scala, launchReporterThread()
> had the code for that.
>
> It loops calling allocator.allocateResources() and Thread.sleep().
>
> For sleep, it reads the configuration variable
> spark.yarn.scheduler.heartbeat.interval-ms (the default value is 5000,
> which is 5 secs).
>
> According to the comment, “we want to be reasonably responsive without
> causing too many requests to RM”.
>
> So, unless YARN immediately fulfill the allocation request, it seems that
> 5 secs will be wasted.
>
>
>
> When I modified the configuration variable to 1000, it only waited for 1
> sec.
>
>
>
> Here is the log lines after the change:
>
>
>
> 15/05/08 11:47:21 INFO yarn.ApplicationMaster: Started progress reporter
> thread - sleep time : 1000
>
> 15/05/08 11:47:22 INFO impl.AMRMClientImpl: Received new token for :
> cluster04:45454
>
>
>
> 4 secs saved.
>
> So, when one does not want to wait 5 secs, one can change the
> spark.yarn.scheduler.heartbeat.interval-ms.
>
> I hope that the additional overhead it incurs would be negligible.
>
>
>
>
>
> *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com]
> *Sent:* Thursday, May 07, 2015 10:05 PM
> *To:* Taeyun Kim; u...@spark.apache.org
> *Subject:* Re: YARN mode startup takes too long (10+ secs)
>
>
>
> Without considering everything, just a few hints:
>
> You are running on YARN. From 09:18:34 to 09:18:37 your application is in
> state ACCEPTED. There is a noticeable overhead introduced due to
> communicating with YARN's ResourceManager, NodeManager and given that the
> YARN scheduler needs time to make a decision. I guess somewhere
> from 09:18:38 to 09:18:43 your application JAR gets copied to another
> container requested by the Spark ApplicationMaster deployed on YARN's
> container 0. Deploying an executor needs further resource negotiations with
> the ResourceManager usually. Also, as I said, your JAR and Executor's code
> requires copying to the container's local directory - execution blocked
> until that is complete.
>
>
>
> On Thu, May 7, 2015 at 3:09 AM Taeyun Kim <taeyun....@innowireless.com>
> wrote:
>
> Hi,
>
>
>
> I’m running a spark application with YARN-client or YARN-cluster mode.
>
> But it seems to take too long to startup.
>
> It takes 10+ seconds to initialize the spark context.
>
> Is this normal? Or can it be optimized?
>
>
>
> The environment is as follows:
>
> - Hadoop: Hortonworks HDP 2.2 (Hadoop 2.6)
>
> - Spark: 1.3.1
>
> - Client: Windows 7, but similar result on CentOS 6.6
>
>
>
> The following is the startup part of the application log. (Some private
> information was edited)
>
> ‘Main: Initializing context’ at the first line and ‘MainProcessor:
> Deleting previous output files’ at the last line are the logs by the
> application. Others in between are from Spark itself. Application logic is
> executed after this log is displayed.
>
>
>
> ---
>
>
>
> 15/05/07 09:18:31 INFO Main: Initializing context
>
> 15/05/07 09:18:31 INFO SparkContext: Running Spark version 1.3.1
>
> 15/05/07 09:18:31 INFO SecurityManager: Changing view acls to: myuser,myapp
>
> 15/05/07 09:18:31 INFO SecurityManager: Changing modify acls to:
> myuser,myapp
>
> 15/05/07 09:18:31 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(myuser,
> myapp); users with modify permissions: Set(myuser, myapp)
>
> 15/05/07 09:18:31 INFO Slf4jLogger: Slf4jLogger started
>
> 15/05/07 09:18:31 INFO Remoting: Starting remoting
>
> 15/05/07 09:18:31 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkDriver@mymachine:54449]
>
> 15/05/07 09:18:31 INFO Utils: Successfully started service 'sparkDriver'
> on port 54449.
>
> 15/05/07 09:18:31 INFO SparkEnv: Registering MapOutputTracker
>
> 15/05/07 09:18:32 INFO SparkEnv: Registering BlockManagerMaster
>
> 15/05/07 09:18:32 INFO DiskBlockManager: Created local directory at
> C:\Users\myuser\AppData\Local\Temp\spark-2d3db9d6-ea78-438e-956f-be9c1dcf3a9d\blockmgr-e9ade223-a4b8-4d9f-b038-efd66adf9772
>
> 15/05/07 09:18:32 INFO MemoryStore: MemoryStore started with capacity
> 1956.7 MB
>
> 15/05/07 09:18:32 INFO HttpFileServer: HTTP File server directory is
> C:\Users\myuser\AppData\Local\Temp\spark-ff40d73b-e8ab-433e-88c4-35da27fb6278\httpd-def9220f-ac3a-4dd2-9ac1-2c593b94b2d9
>
> 15/05/07 09:18:32 INFO HttpServer: Starting HTTP Server
>
> 15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
>
> 15/05/07 09:18:32 INFO AbstractConnector: Started
> SocketConnector@0.0.0.0:54450
>
> 15/05/07 09:18:32 INFO Utils: Successfully started service 'HTTP file
> server' on port 54450.
>
> 15/05/07 09:18:32 INFO SparkEnv: Registering OutputCommitCoordinator
>
> 15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
>
> 15/05/07 09:18:32 INFO AbstractConnector: Started
> SelectChannelConnector@0.0.0.0:4040
>
> 15/05/07 09:18:32 INFO Utils: Successfully started service 'SparkUI' on
> port 4040.
>
> 15/05/07 09:18:32 INFO SparkUI: Started SparkUI at http://mymachine:4040
>
> 15/05/07 09:18:32 INFO SparkContext: Added JAR
> file:/D:/Projects/MyApp/MyApp.jar at
> http://10.111.111.199:54450/jars/MyApp.jar with timestamp 1430957912240
>
> 15/05/07 09:18:32 INFO RMProxy: Connecting to ResourceManager at cluster01/
> 10.111.111.11:8050
>
> 15/05/07 09:18:32 INFO Client: Requesting a new application from cluster
> with 3 NodeManagers
>
> 15/05/07 09:18:32 INFO Client: Verifying our application has not requested
> more than the maximum memory capability of the cluster (23040 MB per
> container)
>
> 15/05/07 09:18:32 INFO Client: Will allocate AM container, with 896 MB
> memory including 384 MB overhead
>
> 15/05/07 09:18:32 INFO Client: Setting up container launch context for our
> AM
>
> 15/05/07 09:18:32 INFO Client: Preparing resources for our AM container
>
> 15/05/07 09:18:32 INFO Client: Source and destination file systems are the
> same. Not copying
> hdfs://cluster01/apps/spark/spark-assembly-1.3.1-hadoop2.6.0.jar
>
> 15/05/07 09:18:32 INFO Client: Setting up the launch environment for our
> AM container
>
> 15/05/07 09:18:33 INFO SecurityManager: Changing view acls to: myuser,myapp
>
> 15/05/07 09:18:33 INFO SecurityManager: Changing modify acls to:
> myuser,myapp
>
> 15/05/07 09:18:33 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(myuser,
> myapp); users with modify permissions: Set(myuser, myapp)
>
> 15/05/07 09:18:33 INFO Client: Submitting application 2 to ResourceManager
>
> 15/05/07 09:18:33 INFO YarnClientImpl: Submitted application
> application_1430956687773_0002
>
> 15/05/07 09:18:34 INFO Client: Application report for
> application_1430956687773_0002 (state: ACCEPTED)
>
> 15/05/07 09:18:34 INFO Client:
>
>             client token: N/A
>
>            diagnostics: N/A
>
>            ApplicationMaster host: N/A
>
>            ApplicationMaster RPC port: -1
>
>            queue: default
>
>            start time: 1430957906540
>
>            final status: UNDEFINED
>
>            tracking URL:
> http://cluster01:8088/proxy/application_1430956687773_0002/
>
>            user: myapp
>
> 15/05/07 09:18:35 INFO Client: Application report for
> application_1430956687773_0002 (state: ACCEPTED)
>
> 15/05/07 09:18:36 INFO Client: Application report for
> application_1430956687773_0002 (state: ACCEPTED)
>
> 15/05/07 09:18:37 INFO Client: Application report for
> application_1430956687773_0002 (state: ACCEPTED)
>
> 15/05/07 09:18:37 INFO YarnClientSchedulerBackend: ApplicationMaster
> registered as Actor[akka.tcp://sparkYarnAM@cluster02
> :39698/user/YarnAM#-1579648782]
>
> 15/05/07 09:18:37 INFO YarnClientSchedulerBackend: Add WebUI Filter.
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS
> -> cluster01, PROXY_URI_BASES ->
> http://cluster01:8088/proxy/application_1430956687773_0002),
> /proxy/application_1430956687773_0002
>
> 15/05/07 09:18:37 INFO JettyUtils: Adding filter:
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
>
> 15/05/07 09:18:38 INFO Client: Application report for
> application_1430956687773_0002 (state: RUNNING)
>
> 15/05/07 09:18:38 INFO Client:
>
>             client token: N/A
>
>            diagnostics: N/A
>
>            ApplicationMaster host: cluster02
>
>            ApplicationMaster RPC port: 0
>
>            queue: default
>
>            start time: 1430957906540
>
>            final status: UNDEFINED
>
>            tracking URL:
> http://cluster01:8088/proxy/application_1430956687773_0002/
>
>            user: myapp
>
> 15/05/07 09:18:38 INFO YarnClientSchedulerBackend: Application
> application_1430956687773_0002 has started running.
>
> 15/05/07 09:18:38 INFO NettyBlockTransferService: Server created on 54491
>
> 15/05/07 09:18:38 INFO BlockManagerMaster: Trying to register BlockManager
>
> 15/05/07 09:18:38 INFO BlockManagerMasterActor: Registering block manager
> mymachine:54491 with 1956.7 MB RAM, BlockManagerId(<driver>, mymachine,
> 54491)
>
> 15/05/07 09:18:38 INFO BlockManagerMaster: Registered BlockManager
>
> 15/05/07 09:18:43 INFO YarnClientSchedulerBackend: Registered executor:
> Actor[akka.tcp://sparkExecutor@cluster02:44996/user/Executor#-786778979]
> with ID 1
>
> 15/05/07 09:18:43 INFO YarnClientSchedulerBackend: SchedulerBackend is
> ready for scheduling beginning after reached minRegisteredResourcesRatio:
> 0.8
>
> 15/05/07 09:18:43 INFO MainProcessor: Deleting previous output files
>
>
>
> Thanks.
>

Re: YARN mode startup takes too long (10+ secs)

Reply via email to