Hi all, I am quite new to spark and need some help troubleshooting the execution of an application running on a spark cluster...
My spark environment is deployed using Ambari (HDP), YARM is the resource scheduler and hadoop as file system. The application I am trying to run is a python script (test.py). The worker nodes have python 2.6 so I am asking spark to spin up a virtual environment based on python 2.7. I can successfully run this test app in a single node (see below): -bash-4.1$ spark-submit \ > --conf spark.pyspark.virtualenv.type=native \ > --conf spark.pyspark.virtualenv.requirements=/home/mansop/requirements.txt \ > --conf > spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate > \ > --conf spark.pyspark.python=/home/mansop/hail-test/python-2.7.2/bin/python \ > --jars $HAIL_HOME/build/libs/hail-all-spark.jar \ > --py-files $HAIL_HOME/build/distributions/hail-python.zip \ > test.py hail: info: SparkUI: http://192.168.10.201:4040 Welcome to __ __ <>__ / /_/ /__ __/ / / __ / _ `/ / / /_/ /_/\_,_/_/_/ version 0.1-0320a61 [Stage 2:==================================================> (91 + 4) / 100]Summary(samples=3, variants=308, call_rate= 1.000000, contigs=['1'], multiallelics=0, snps=308, mnps=0, insertions=0, deletions=0, complex=0, star=0, max_alleles=2) However spark crashes while trying to run my test script (error below) throwing this error message /d0/hadoop/yarn/local/usercache/mansop/appcache/application_1512016123441_0032/container_1512016123441_0032_02_000001/tmp/1515989862748-0/bin/python -bash-4.1$ spark-submit --master yarn \ > --deploy-mode cluster \ > --driver-memory 4g \ > --executor-memory 2g \ > --executor-cores 4 \ > --queue default \ > --conf spark.pyspark.virtualenv.type=native \ > --conf > spark.pyspark.virtualenv.requirements=/home/mansop/requirements.txt \ > --conf > spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate > \ > --jars $HAIL_HOME/build/libs/hail-all-spark.jar \ > --py-files $HAIL_HOME/build/distributions/hail-python.zip \ > test.py 18/01/16 09:55:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/01/16 09:55:18 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 18/01/16 09:55:18 INFO RMProxy: Connecting to ResourceManager at wp-hdp-ctrl03-mlx.mlx/10.0.1.206:8050 18/01/16 09:55:18 INFO Client: Requesting a new application from cluster with 4 NodeManagers 18/01/16 09:55:18 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (450560 MB per container) 18/01/16 09:55:18 INFO Client: Will allocate AM container, with 4505 MB memory including 409 MB overhead 18/01/16 09:55:18 INFO Client: Setting up container launch context for our AM 18/01/16 09:55:18 INFO Client: Setting up the launch environment for our AM container 18/01/16 09:55:18 INFO Client: Preparing resources for our AM container 18/01/16 09:55:19 INFO Client: Use hdfs cache file as spark.yarn.archive for HDP, hdfsCacheFile:hdfs://wp-hdp-ctrl01-mlx.mlx:8020/hdp/apps/2.6.3.0-235/spark2/spark2-hdp-yarn-archive.tar.gz 18/01/16 09:55:19 INFO Client: Source and destination file systems are the same. Not copying hdfs://wp-hdp-ctrl01-mlx.mlx:8020/hdp/apps/2.6.3.0-235/spark2/spark2-hdp-yarn-archive.tar.gz 18/01/16 09:55:19 INFO Client: Uploading resource file:/home/mansop/hail-test2/hail/build/libs/hail-all-spark.jar -> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/hail-all-spark.jar 18/01/16 09:55:20 INFO Client: Uploading resource file:/home/mansop/requirements.txt -> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/requirements.txt 18/01/16 09:55:20 INFO Client: Uploading resource file:/home/mansop/test.py -> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/test.py 18/01/16 09:55:20 INFO Client: Uploading resource file:/usr/hdp/2.6.3.0-235/spark2/python/lib/pyspark.zip -> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/pyspark.zip 18/01/16 09:55:20 INFO Client: Uploading resource file:/usr/hdp/2.6.3.0-235/spark2/python/lib/py4j-0.10.4-src.zip -> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/py4j-0.10.4-src.zip 18/01/16 09:55:20 INFO Client: Uploading resource file:/home/mansop/hail-test2/hail/build/distributions/hail-python.zip -> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/hail-python.zip 18/01/16 09:55:20 INFO Client: Uploading resource file:/tmp/spark-888af623-c81d-4ff1-ac8a-15f25112cc4a/__spark_conf__1173722187739681647.zip -> hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/__spark_conf__.zip 18/01/16 09:55:20 INFO SecurityManager: Changing view acls to: mansop 18/01/16 09:55:20 INFO SecurityManager: Changing modify acls to: mansop 18/01/16 09:55:20 INFO SecurityManager: Changing view acls groups to: 18/01/16 09:55:20 INFO SecurityManager: Changing modify acls groups to: 18/01/16 09:55:20 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mansop); groups with view permissions: Set(); users with modify permissions: Set(mansop); groups with modify permissions: Set() 18/01/16 09:55:20 INFO Client: Submitting application application_1512016123441_0043 to ResourceManager 18/01/16 09:55:20 INFO YarnClientImpl: Submitted application application_1512016123441_0043 18/01/16 09:55:21 INFO Client: Application report for application_1512016123441_0043 (state: ACCEPTED) 18/01/16 09:55:21 INFO Client: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1516056920515 final status: UNDEFINED tracking URL: http://wp-hdp-ctrl03-mlx.mlx:8088/proxy/application_1512016123441_0043/ user: mansop 18/01/16 09:55:22 INFO Client: Application report for application_1512016123441_0043 (state: ACCEPTED) 18/01/16 09:55:23 INFO Client: Application report for application_1512016123441_0043 (state: ACCEPTED) 18/01/16 09:55:24 INFO Client: Application report for application_1512016123441_0043 (state: ACCEPTED) 18/01/16 09:55:25 INFO Client: Application report for application_1512016123441_0043 (state: ACCEPTED) 18/01/16 09:55:26 INFO Client: Application report for application_1512016123441_0043 (state: ACCEPTED) 18/01/16 09:55:27 INFO Client: Application report for application_1512016123441_0043 (state: ACCEPTED) 18/01/16 09:55:28 INFO Client: Application report for application_1512016123441_0043 (state: ACCEPTED) 18/01/16 09:55:29 INFO Client: Application report for application_1512016123441_0043 (state: ACCEPTED) 18/01/16 09:55:30 INFO Client: Application report for application_1512016123441_0043 (state: FAILED) 18/01/16 09:55:30 INFO Client: client token: N/A diagnostics: Application application_1512016123441_0043 failed 2 times due to AM Container for appattempt_1512016123441_0043_000002 exited with exitCode: 1 For more detailed output, check the application tracking page: http://wp-hdp-ctrl03-mlx.mlx:8088/cluster/app/application_1512016123441_0043 Then click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_1512016123441_0043_02_000001 Exit code: 1 Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/d1/hadoop/yarn/local/filecache/11/spark2-hdp-yarn-archive.tar.gz/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/2.6.3.0-235/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 18/01/16 09:55:27 INFO SignalUtils: Registered signal handler for TERM 18/01/16 09:55:27 INFO SignalUtils: Registered signal handler for HUP 18/01/16 09:55:27 INFO SignalUtils: Registered signal handler for INT 18/01/16 09:55:28 INFO ApplicationMaster: Preparing Local resources 18/01/16 09:55:28 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1512016123441_0043_000002 18/01/16 09:55:28 INFO SecurityManager: Changing view acls to: yarn,mansop 18/01/16 09:55:28 INFO SecurityManager: Changing modify acls to: yarn,mansop 18/01/16 09:55:28 INFO SecurityManager: Changing view acls groups to: 18/01/16 09:55:28 INFO SecurityManager: Changing modify acls groups to: 18/01/16 09:55:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, mansop); groups with view permissions: Set(); users with modify permissions: Set(yarn, mansop); groups with modify permissions: Set() 18/01/16 09:55:28 INFO ApplicationMaster: Starting the user application in a separate Thread 18/01/16 09:55:28 INFO ApplicationMaster: Waiting for spark context initialization... 18/01/16 09:55:29 ERROR ApplicationMaster: User application exited with status 1 18/01/16 09:55:29 INFO ApplicationMaster: Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1) 18/01/16 09:55:29 ERROR ApplicationMaster: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:423) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:282) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:768) at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67) at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:766) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) Caused by: org.apache.spark.SparkUserAppException: User application exited with 1 at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:105) at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:646) 18/01/16 09:55:29 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User application exited with status 1) 18/01/16 09:55:29 INFO ApplicationMaster: Deleting staging directory hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043 18/01/16 09:55:29 INFO ShutdownHookManager: Shutdown hook called Failing this attempt. Failing the application. ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1516056920515 final status: FAILED tracking URL: http://wp-hdp-ctrl03-mlx.mlx:8088/cluster/app/application_1512016123441_0043 user: mansop Exception in thread "main" org.apache.spark.SparkException: Application application_1512016123441_0043 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:1187) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1233) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 18/01/16 09:55:30 INFO ShutdownHookManager: Shutdown hook called 18/01/16 09:55:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-888af623-c81d-4ff1-ac8a-15f25112cc4a QUESTION: Why spark/yarn can't find this file /d0/hadoop/yarn/local/usercache/mansop/appcache/application_1512016123441_0032/container_1512016123441_0032_02_000001/tmp/1515989862748-0/bin/python? Who copies it and from where? And what do I need to do in order to make my spark-submit job to run? Thank you very much Manuel Sopena Ballesteros | Big data Engineer Garvan Institute of Medical Research The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010 T: + 61 (0)2 9355 5760 | F: +61 (0)2 9295 8507 | E: manuel...@garvan.org.au<mailto:manuel...@garvan.org.au> NOTICE Please consider the environment before printing this email. This message and any attachments are intended for the addressee named and may contain legally privileged/confidential/copyright information. If you are not the intended recipient, you should not read, use, disclose, copy or distribute this communication. If you have received this message in error please notify us at once by return email and then delete both messages. We accept no liability for the distribution of viruses or similar in electronic communications. This notice should not be removed.