Hi Andrew, Based on your driver logs, it seems the issue is that the shuffle service is actually not running on the NodeManagers, but your application is trying to provide a "spark_shuffle" secret anyway. One way to verify whether the shuffle service is actually started is to look at the NodeManager logs for the following lines:
*Initializing YARN shuffle service for Spark* *Started YARN shuffle service for Spark on port X* These should be logged under the INFO level. Also, could you verify whether *all* the executors have this problem, or just a subset? If even one of the NM doesn't have the shuffle service, you'll see the stack trace that you ran into. It would be good to confirm whether the yarn-site.xml change is actually reflected on all NMs if the log statements above are missing. Let me know if you can get it working. I've run the shuffle service myself on the master branch (which will become Spark 1.5.0) recently following the instructions and have not encountered any problems. -Andrew