Isn't this issue something that should be improved? Based on the following discussion, there are two places were YARN's heartbeat interval is respected on job start-up, but do we really need to respect it on start-up?
On Fri, May 8, 2015 at 12:14 PM Taeyun Kim <taeyun....@innowireless.com> wrote: > I think so. > > In fact, the flow is: allocator.allocateResources() -> sleep -> > allocator.allocateResources() -> sleep … > > But I guess that on the first allocateResources() the allocation is not > fulfilled. So sleep occurs. > > > > *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com] > *Sent:* Friday, May 08, 2015 4:25 PM > > > *To:* Taeyun Kim; u...@spark.apache.org > *Subject:* Re: YARN mode startup takes too long (10+ secs) > > > > So is this sleep occurs before allocating resources for the first few > executors to start the job? > > > > On Fri, May 8, 2015 at 6:23 AM Taeyun Kim <taeyun....@innowireless.com> > wrote: > > I think I’ve found the (maybe partial, but major) reason. > > > > It’s between the following lines, (it’s newly captured, but essentially > the same place that Zoltán Zvara picked: > > > > 15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager > > 15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor: > Actor[akka.tcp://sparkExecutor@cluster04:55237/user/Executor#-149550753] > with ID 1 > > > > When I read the logs on cluster side, the following lines were found: (the > exact time is different with above line, but it’s the difference between > machines) > > > > 15/05/08 11:36:23 INFO yarn.ApplicationMaster: Started progress reporter > thread - sleep time : 5000 > > 15/05/08 11:36:28 INFO impl.AMRMClientImpl: Received new token for : > cluster04:45454 > > > > It seemed that Spark deliberately sleeps 5 secs. > > I’ve read the Spark source code, and in > org.apache.spark.deploy.yarn.ApplicationMaster.scala, launchReporterThread() > had the code for that. > > It loops calling allocator.allocateResources() and Thread.sleep(). > > For sleep, it reads the configuration variable > spark.yarn.scheduler.heartbeat.interval-ms (the default value is 5000, > which is 5 secs). > > According to the comment, “we want to be reasonably responsive without > causing too many requests to RM”. > > So, unless YARN immediately fulfill the allocation request, it seems that > 5 secs will be wasted. > > > > When I modified the configuration variable to 1000, it only waited for 1 > sec. > > > > Here is the log lines after the change: > > > > 15/05/08 11:47:21 INFO yarn.ApplicationMaster: Started progress reporter > thread - sleep time : 1000 > > 15/05/08 11:47:22 INFO impl.AMRMClientImpl: Received new token for : > cluster04:45454 > > > > 4 secs saved. > > So, when one does not want to wait 5 secs, one can change the > spark.yarn.scheduler.heartbeat.interval-ms. > > I hope that the additional overhead it incurs would be negligible. > > > > > > *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com] > *Sent:* Thursday, May 07, 2015 10:05 PM > *To:* Taeyun Kim; u...@spark.apache.org > *Subject:* Re: YARN mode startup takes too long (10+ secs) > > > > Without considering everything, just a few hints: > > You are running on YARN. From 09:18:34 to 09:18:37 your application is in > state ACCEPTED. There is a noticeable overhead introduced due to > communicating with YARN's ResourceManager, NodeManager and given that the > YARN scheduler needs time to make a decision. I guess somewhere > from 09:18:38 to 09:18:43 your application JAR gets copied to another > container requested by the Spark ApplicationMaster deployed on YARN's > container 0. Deploying an executor needs further resource negotiations with > the ResourceManager usually. Also, as I said, your JAR and Executor's code > requires copying to the container's local directory - execution blocked > until that is complete. > > > > On Thu, May 7, 2015 at 3:09 AM Taeyun Kim <taeyun....@innowireless.com> > wrote: > > Hi, > > > > I’m running a spark application with YARN-client or YARN-cluster mode. > > But it seems to take too long to startup. > > It takes 10+ seconds to initialize the spark context. > > Is this normal? Or can it be optimized? > > > > The environment is as follows: > > - Hadoop: Hortonworks HDP 2.2 (Hadoop 2.6) > > - Spark: 1.3.1 > > - Client: Windows 7, but similar result on CentOS 6.6 > > > > The following is the startup part of the application log. (Some private > information was edited) > > ‘Main: Initializing context’ at the first line and ‘MainProcessor: > Deleting previous output files’ at the last line are the logs by the > application. Others in between are from Spark itself. Application logic is > executed after this log is displayed. > > > > --- > > > > 15/05/07 09:18:31 INFO Main: Initializing context > > 15/05/07 09:18:31 INFO SparkContext: Running Spark version 1.3.1 > > 15/05/07 09:18:31 INFO SecurityManager: Changing view acls to: myuser,myapp > > 15/05/07 09:18:31 INFO SecurityManager: Changing modify acls to: > myuser,myapp > > 15/05/07 09:18:31 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(myuser, > myapp); users with modify permissions: Set(myuser, myapp) > > 15/05/07 09:18:31 INFO Slf4jLogger: Slf4jLogger started > > 15/05/07 09:18:31 INFO Remoting: Starting remoting > > 15/05/07 09:18:31 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriver@mymachine:54449] > > 15/05/07 09:18:31 INFO Utils: Successfully started service 'sparkDriver' > on port 54449. > > 15/05/07 09:18:31 INFO SparkEnv: Registering MapOutputTracker > > 15/05/07 09:18:32 INFO SparkEnv: Registering BlockManagerMaster > > 15/05/07 09:18:32 INFO DiskBlockManager: Created local directory at > C:\Users\myuser\AppData\Local\Temp\spark-2d3db9d6-ea78-438e-956f-be9c1dcf3a9d\blockmgr-e9ade223-a4b8-4d9f-b038-efd66adf9772 > > 15/05/07 09:18:32 INFO MemoryStore: MemoryStore started with capacity > 1956.7 MB > > 15/05/07 09:18:32 INFO HttpFileServer: HTTP File server directory is > C:\Users\myuser\AppData\Local\Temp\spark-ff40d73b-e8ab-433e-88c4-35da27fb6278\httpd-def9220f-ac3a-4dd2-9ac1-2c593b94b2d9 > > 15/05/07 09:18:32 INFO HttpServer: Starting HTTP Server > > 15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT > > 15/05/07 09:18:32 INFO AbstractConnector: Started > SocketConnector@0.0.0.0:54450 > > 15/05/07 09:18:32 INFO Utils: Successfully started service 'HTTP file > server' on port 54450. > > 15/05/07 09:18:32 INFO SparkEnv: Registering OutputCommitCoordinator > > 15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT > > 15/05/07 09:18:32 INFO AbstractConnector: Started > SelectChannelConnector@0.0.0.0:4040 > > 15/05/07 09:18:32 INFO Utils: Successfully started service 'SparkUI' on > port 4040. > > 15/05/07 09:18:32 INFO SparkUI: Started SparkUI at http://mymachine:4040 > > 15/05/07 09:18:32 INFO SparkContext: Added JAR > file:/D:/Projects/MyApp/MyApp.jar at > http://10.111.111.199:54450/jars/MyApp.jar with timestamp 1430957912240 > > 15/05/07 09:18:32 INFO RMProxy: Connecting to ResourceManager at cluster01/ > 10.111.111.11:8050 > > 15/05/07 09:18:32 INFO Client: Requesting a new application from cluster > with 3 NodeManagers > > 15/05/07 09:18:32 INFO Client: Verifying our application has not requested > more than the maximum memory capability of the cluster (23040 MB per > container) > > 15/05/07 09:18:32 INFO Client: Will allocate AM container, with 896 MB > memory including 384 MB overhead > > 15/05/07 09:18:32 INFO Client: Setting up container launch context for our > AM > > 15/05/07 09:18:32 INFO Client: Preparing resources for our AM container > > 15/05/07 09:18:32 INFO Client: Source and destination file systems are the > same. Not copying > hdfs://cluster01/apps/spark/spark-assembly-1.3.1-hadoop2.6.0.jar > > 15/05/07 09:18:32 INFO Client: Setting up the launch environment for our > AM container > > 15/05/07 09:18:33 INFO SecurityManager: Changing view acls to: myuser,myapp > > 15/05/07 09:18:33 INFO SecurityManager: Changing modify acls to: > myuser,myapp > > 15/05/07 09:18:33 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(myuser, > myapp); users with modify permissions: Set(myuser, myapp) > > 15/05/07 09:18:33 INFO Client: Submitting application 2 to ResourceManager > > 15/05/07 09:18:33 INFO YarnClientImpl: Submitted application > application_1430956687773_0002 > > 15/05/07 09:18:34 INFO Client: Application report for > application_1430956687773_0002 (state: ACCEPTED) > > 15/05/07 09:18:34 INFO Client: > > client token: N/A > > diagnostics: N/A > > ApplicationMaster host: N/A > > ApplicationMaster RPC port: -1 > > queue: default > > start time: 1430957906540 > > final status: UNDEFINED > > tracking URL: > http://cluster01:8088/proxy/application_1430956687773_0002/ > > user: myapp > > 15/05/07 09:18:35 INFO Client: Application report for > application_1430956687773_0002 (state: ACCEPTED) > > 15/05/07 09:18:36 INFO Client: Application report for > application_1430956687773_0002 (state: ACCEPTED) > > 15/05/07 09:18:37 INFO Client: Application report for > application_1430956687773_0002 (state: ACCEPTED) > > 15/05/07 09:18:37 INFO YarnClientSchedulerBackend: ApplicationMaster > registered as Actor[akka.tcp://sparkYarnAM@cluster02 > :39698/user/YarnAM#-1579648782] > > 15/05/07 09:18:37 INFO YarnClientSchedulerBackend: Add WebUI Filter. > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS > -> cluster01, PROXY_URI_BASES -> > http://cluster01:8088/proxy/application_1430956687773_0002), > /proxy/application_1430956687773_0002 > > 15/05/07 09:18:37 INFO JettyUtils: Adding filter: > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter > > 15/05/07 09:18:38 INFO Client: Application report for > application_1430956687773_0002 (state: RUNNING) > > 15/05/07 09:18:38 INFO Client: > > client token: N/A > > diagnostics: N/A > > ApplicationMaster host: cluster02 > > ApplicationMaster RPC port: 0 > > queue: default > > start time: 1430957906540 > > final status: UNDEFINED > > tracking URL: > http://cluster01:8088/proxy/application_1430956687773_0002/ > > user: myapp > > 15/05/07 09:18:38 INFO YarnClientSchedulerBackend: Application > application_1430956687773_0002 has started running. > > 15/05/07 09:18:38 INFO NettyBlockTransferService: Server created on 54491 > > 15/05/07 09:18:38 INFO BlockManagerMaster: Trying to register BlockManager > > 15/05/07 09:18:38 INFO BlockManagerMasterActor: Registering block manager > mymachine:54491 with 1956.7 MB RAM, BlockManagerId(<driver>, mymachine, > 54491) > > 15/05/07 09:18:38 INFO BlockManagerMaster: Registered BlockManager > > 15/05/07 09:18:43 INFO YarnClientSchedulerBackend: Registered executor: > Actor[akka.tcp://sparkExecutor@cluster02:44996/user/Executor#-786778979] > with ID 1 > > 15/05/07 09:18:43 INFO YarnClientSchedulerBackend: SchedulerBackend is > ready for scheduling beginning after reached minRegisteredResourcesRatio: > 0.8 > > 15/05/07 09:18:43 INFO MainProcessor: Deleting previous output files > > > > Thanks. >