For tiny/small clusters (particularly single tenet), you can set it to lower value. But for anything reasonably large or multi-tenet, the request storm can be bad if large enough number of applications start aggressively polling RM. That is why the interval is set to configurable.
- Mridul On Mon, May 11, 2015 at 6:54 AM, Zoltán Zvara <zoltan.zv...@gmail.com> wrote: > Isn't this issue something that should be improved? Based on the following > discussion, there are two places were YARN's heartbeat interval is > respected on job start-up, but do we really need to respect it on start-up? > > On Fri, May 8, 2015 at 12:14 PM Taeyun Kim <taeyun....@innowireless.com> > wrote: > >> I think so. >> >> In fact, the flow is: allocator.allocateResources() -> sleep -> >> allocator.allocateResources() -> sleep … >> >> But I guess that on the first allocateResources() the allocation is not >> fulfilled. So sleep occurs. >> >> >> >> *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com] >> *Sent:* Friday, May 08, 2015 4:25 PM >> >> >> *To:* Taeyun Kim; u...@spark.apache.org >> *Subject:* Re: YARN mode startup takes too long (10+ secs) >> >> >> >> So is this sleep occurs before allocating resources for the first few >> executors to start the job? >> >> >> >> On Fri, May 8, 2015 at 6:23 AM Taeyun Kim <taeyun....@innowireless.com> >> wrote: >> >> I think I’ve found the (maybe partial, but major) reason. >> >> >> >> It’s between the following lines, (it’s newly captured, but essentially >> the same place that Zoltán Zvara picked: >> >> >> >> 15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager >> >> 15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor: >> Actor[akka.tcp://sparkExecutor@cluster04:55237/user/Executor#-149550753] >> with ID 1 >> >> >> >> When I read the logs on cluster side, the following lines were found: (the >> exact time is different with above line, but it’s the difference between >> machines) >> >> >> >> 15/05/08 11:36:23 INFO yarn.ApplicationMaster: Started progress reporter >> thread - sleep time : 5000 >> >> 15/05/08 11:36:28 INFO impl.AMRMClientImpl: Received new token for : >> cluster04:45454 >> >> >> >> It seemed that Spark deliberately sleeps 5 secs. >> >> I’ve read the Spark source code, and in >> org.apache.spark.deploy.yarn.ApplicationMaster.scala, launchReporterThread() >> had the code for that. >> >> It loops calling allocator.allocateResources() and Thread.sleep(). >> >> For sleep, it reads the configuration variable >> spark.yarn.scheduler.heartbeat.interval-ms (the default value is 5000, >> which is 5 secs). >> >> According to the comment, “we want to be reasonably responsive without >> causing too many requests to RM”. >> >> So, unless YARN immediately fulfill the allocation request, it seems that >> 5 secs will be wasted. >> >> >> >> When I modified the configuration variable to 1000, it only waited for 1 >> sec. >> >> >> >> Here is the log lines after the change: >> >> >> >> 15/05/08 11:47:21 INFO yarn.ApplicationMaster: Started progress reporter >> thread - sleep time : 1000 >> >> 15/05/08 11:47:22 INFO impl.AMRMClientImpl: Received new token for : >> cluster04:45454 >> >> >> >> 4 secs saved. >> >> So, when one does not want to wait 5 secs, one can change the >> spark.yarn.scheduler.heartbeat.interval-ms. >> >> I hope that the additional overhead it incurs would be negligible. >> >> >> >> >> >> *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com] >> *Sent:* Thursday, May 07, 2015 10:05 PM >> *To:* Taeyun Kim; u...@spark.apache.org >> *Subject:* Re: YARN mode startup takes too long (10+ secs) >> >> >> >> Without considering everything, just a few hints: >> >> You are running on YARN. From 09:18:34 to 09:18:37 your application is in >> state ACCEPTED. There is a noticeable overhead introduced due to >> communicating with YARN's ResourceManager, NodeManager and given that the >> YARN scheduler needs time to make a decision. I guess somewhere >> from 09:18:38 to 09:18:43 your application JAR gets copied to another >> container requested by the Spark ApplicationMaster deployed on YARN's >> container 0. Deploying an executor needs further resource negotiations with >> the ResourceManager usually. Also, as I said, your JAR and Executor's code >> requires copying to the container's local directory - execution blocked >> until that is complete. >> >> >> >> On Thu, May 7, 2015 at 3:09 AM Taeyun Kim <taeyun....@innowireless.com> >> wrote: >> >> Hi, >> >> >> >> I’m running a spark application with YARN-client or YARN-cluster mode. >> >> But it seems to take too long to startup. >> >> It takes 10+ seconds to initialize the spark context. >> >> Is this normal? Or can it be optimized? >> >> >> >> The environment is as follows: >> >> - Hadoop: Hortonworks HDP 2.2 (Hadoop 2.6) >> >> - Spark: 1.3.1 >> >> - Client: Windows 7, but similar result on CentOS 6.6 >> >> >> >> The following is the startup part of the application log. (Some private >> information was edited) >> >> ‘Main: Initializing context’ at the first line and ‘MainProcessor: >> Deleting previous output files’ at the last line are the logs by the >> application. Others in between are from Spark itself. Application logic is >> executed after this log is displayed. >> >> >> >> --- >> >> >> >> 15/05/07 09:18:31 INFO Main: Initializing context >> >> 15/05/07 09:18:31 INFO SparkContext: Running Spark version 1.3.1 >> >> 15/05/07 09:18:31 INFO SecurityManager: Changing view acls to: myuser,myapp >> >> 15/05/07 09:18:31 INFO SecurityManager: Changing modify acls to: >> myuser,myapp >> >> 15/05/07 09:18:31 INFO SecurityManager: SecurityManager: authentication >> disabled; ui acls disabled; users with view permissions: Set(myuser, >> myapp); users with modify permissions: Set(myuser, myapp) >> >> 15/05/07 09:18:31 INFO Slf4jLogger: Slf4jLogger started >> >> 15/05/07 09:18:31 INFO Remoting: Starting remoting >> >> 15/05/07 09:18:31 INFO Remoting: Remoting started; listening on addresses >> :[akka.tcp://sparkDriver@mymachine:54449] >> >> 15/05/07 09:18:31 INFO Utils: Successfully started service 'sparkDriver' >> on port 54449. >> >> 15/05/07 09:18:31 INFO SparkEnv: Registering MapOutputTracker >> >> 15/05/07 09:18:32 INFO SparkEnv: Registering BlockManagerMaster >> >> 15/05/07 09:18:32 INFO DiskBlockManager: Created local directory at >> C:\Users\myuser\AppData\Local\Temp\spark-2d3db9d6-ea78-438e-956f-be9c1dcf3a9d\blockmgr-e9ade223-a4b8-4d9f-b038-efd66adf9772 >> >> 15/05/07 09:18:32 INFO MemoryStore: MemoryStore started with capacity >> 1956.7 MB >> >> 15/05/07 09:18:32 INFO HttpFileServer: HTTP File server directory is >> C:\Users\myuser\AppData\Local\Temp\spark-ff40d73b-e8ab-433e-88c4-35da27fb6278\httpd-def9220f-ac3a-4dd2-9ac1-2c593b94b2d9 >> >> 15/05/07 09:18:32 INFO HttpServer: Starting HTTP Server >> >> 15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT >> >> 15/05/07 09:18:32 INFO AbstractConnector: Started >> SocketConnector@0.0.0.0:54450 >> >> 15/05/07 09:18:32 INFO Utils: Successfully started service 'HTTP file >> server' on port 54450. >> >> 15/05/07 09:18:32 INFO SparkEnv: Registering OutputCommitCoordinator >> >> 15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT >> >> 15/05/07 09:18:32 INFO AbstractConnector: Started >> SelectChannelConnector@0.0.0.0:4040 >> >> 15/05/07 09:18:32 INFO Utils: Successfully started service 'SparkUI' on >> port 4040. >> >> 15/05/07 09:18:32 INFO SparkUI: Started SparkUI at http://mymachine:4040 >> >> 15/05/07 09:18:32 INFO SparkContext: Added JAR >> file:/D:/Projects/MyApp/MyApp.jar at >> http://10.111.111.199:54450/jars/MyApp.jar with timestamp 1430957912240 >> >> 15/05/07 09:18:32 INFO RMProxy: Connecting to ResourceManager at cluster01/ >> 10.111.111.11:8050 >> >> 15/05/07 09:18:32 INFO Client: Requesting a new application from cluster >> with 3 NodeManagers >> >> 15/05/07 09:18:32 INFO Client: Verifying our application has not requested >> more than the maximum memory capability of the cluster (23040 MB per >> container) >> >> 15/05/07 09:18:32 INFO Client: Will allocate AM container, with 896 MB >> memory including 384 MB overhead >> >> 15/05/07 09:18:32 INFO Client: Setting up container launch context for our >> AM >> >> 15/05/07 09:18:32 INFO Client: Preparing resources for our AM container >> >> 15/05/07 09:18:32 INFO Client: Source and destination file systems are the >> same. Not copying >> hdfs://cluster01/apps/spark/spark-assembly-1.3.1-hadoop2.6.0.jar >> >> 15/05/07 09:18:32 INFO Client: Setting up the launch environment for our >> AM container >> >> 15/05/07 09:18:33 INFO SecurityManager: Changing view acls to: myuser,myapp >> >> 15/05/07 09:18:33 INFO SecurityManager: Changing modify acls to: >> myuser,myapp >> >> 15/05/07 09:18:33 INFO SecurityManager: SecurityManager: authentication >> disabled; ui acls disabled; users with view permissions: Set(myuser, >> myapp); users with modify permissions: Set(myuser, myapp) >> >> 15/05/07 09:18:33 INFO Client: Submitting application 2 to ResourceManager >> >> 15/05/07 09:18:33 INFO YarnClientImpl: Submitted application >> application_1430956687773_0002 >> >> 15/05/07 09:18:34 INFO Client: Application report for >> application_1430956687773_0002 (state: ACCEPTED) >> >> 15/05/07 09:18:34 INFO Client: >> >> client token: N/A >> >> diagnostics: N/A >> >> ApplicationMaster host: N/A >> >> ApplicationMaster RPC port: -1 >> >> queue: default >> >> start time: 1430957906540 >> >> final status: UNDEFINED >> >> tracking URL: >> http://cluster01:8088/proxy/application_1430956687773_0002/ >> >> user: myapp >> >> 15/05/07 09:18:35 INFO Client: Application report for >> application_1430956687773_0002 (state: ACCEPTED) >> >> 15/05/07 09:18:36 INFO Client: Application report for >> application_1430956687773_0002 (state: ACCEPTED) >> >> 15/05/07 09:18:37 INFO Client: Application report for >> application_1430956687773_0002 (state: ACCEPTED) >> >> 15/05/07 09:18:37 INFO YarnClientSchedulerBackend: ApplicationMaster >> registered as Actor[akka.tcp://sparkYarnAM@cluster02 >> :39698/user/YarnAM#-1579648782] >> >> 15/05/07 09:18:37 INFO YarnClientSchedulerBackend: Add WebUI Filter. >> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS >> -> cluster01, PROXY_URI_BASES -> >> http://cluster01:8088/proxy/application_1430956687773_0002), >> /proxy/application_1430956687773_0002 >> >> 15/05/07 09:18:37 INFO JettyUtils: Adding filter: >> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter >> >> 15/05/07 09:18:38 INFO Client: Application report for >> application_1430956687773_0002 (state: RUNNING) >> >> 15/05/07 09:18:38 INFO Client: >> >> client token: N/A >> >> diagnostics: N/A >> >> ApplicationMaster host: cluster02 >> >> ApplicationMaster RPC port: 0 >> >> queue: default >> >> start time: 1430957906540 >> >> final status: UNDEFINED >> >> tracking URL: >> http://cluster01:8088/proxy/application_1430956687773_0002/ >> >> user: myapp >> >> 15/05/07 09:18:38 INFO YarnClientSchedulerBackend: Application >> application_1430956687773_0002 has started running. >> >> 15/05/07 09:18:38 INFO NettyBlockTransferService: Server created on 54491 >> >> 15/05/07 09:18:38 INFO BlockManagerMaster: Trying to register BlockManager >> >> 15/05/07 09:18:38 INFO BlockManagerMasterActor: Registering block manager >> mymachine:54491 with 1956.7 MB RAM, BlockManagerId(<driver>, mymachine, >> 54491) >> >> 15/05/07 09:18:38 INFO BlockManagerMaster: Registered BlockManager >> >> 15/05/07 09:18:43 INFO YarnClientSchedulerBackend: Registered executor: >> Actor[akka.tcp://sparkExecutor@cluster02:44996/user/Executor#-786778979] >> with ID 1 >> >> 15/05/07 09:18:43 INFO YarnClientSchedulerBackend: SchedulerBackend is >> ready for scheduling beginning after reached minRegisteredResourcesRatio: >> 0.8 >> >> 15/05/07 09:18:43 INFO MainProcessor: Deleting previous output files >> >> >> >> Thanks. >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org