Hi Andrew, Thanks for the advice. I didn't see the log in the NodeManager, so apparently, something was wrong with the yarn-site.xml configuration. After digging in more, I realize it was an user error. I'm sharing this with other people so others may know what mistake I have made. When I review the configurations, I notice that there was another property setting "yarn.nodemanager.aux-services" in mapred-site.xml. It turns out that mapred-site.xml will override the property "yarn.nodemanager.aux-services" in yarn-site.xml, because of this, spark_shuffle service was never enabled. :( err......
After deleting the redundant invalid properties in mapred-site.xml, it starts working. I see the following logs from the NodeManager. 2015-07-21 21:24:44,046 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing YARN shuffle service for Spark 2015-07-21 21:24:44,046 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Adding auxiliary service spark_shuffle, "spark_shuffle" 2015-07-21 21:24:44,264 INFO org.apache.spark.network.yarn.YarnShuffleService: Started YARN shuffle service for Spark on port 7337. Authentication is not enabled. Appreciate all and the pointers where to look at. Thanks, problem solved. Date: Tue, 21 Jul 2015 09:31:50 -0700 Subject: Re: The auxService:spark_shuffle does not exist From: and...@databricks.com To: alee...@hotmail.com CC: zjf...@gmail.com; rp...@njit.edu; user@spark.apache.org Hi Andrew, Based on your driver logs, it seems the issue is that the shuffle service is actually not running on the NodeManagers, but your application is trying to provide a "spark_shuffle" secret anyway. One way to verify whether the shuffle service is actually started is to look at the NodeManager logs for the following lines: Initializing YARN shuffle service for Spark Started YARN shuffle service for Spark on port X These should be logged under the INFO level. Also, could you verify whether all the executors have this problem, or just a subset? If even one of the NM doesn't have the shuffle service, you'll see the stack trace that you ran into. It would be good to confirm whether the yarn-site.xml change is actually reflected on all NMs if the log statements above are missing. Let me know if you can get it working. I've run the shuffle service myself on the master branch (which will become Spark 1.5.0) recently following the instructions and have not encountered any problems. -Andrew