Hi all,

I am quite new to spark and need some help troubleshooting the execution of an 
application running on a spark cluster...

My spark environment is deployed using Ambari (HDP), YARM is the resource 
scheduler and hadoop as file system.

The application I am trying to run is a python script (test.py).

The worker nodes have python 2.6 so I am asking spark to spin up a virtual 
environment based on python 2.7.

I can successfully run this test app in a single node (see below):

-bash-4.1$ spark-submit \
> --conf spark.pyspark.virtualenv.type=native \
> --conf spark.pyspark.virtualenv.requirements=/home/mansop/requirements.txt \
> --conf 
> spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate
>  \
> --conf spark.pyspark.python=/home/mansop/hail-test/python-2.7.2/bin/python \
> --jars $HAIL_HOME/build/libs/hail-all-spark.jar \
> --py-files $HAIL_HOME/build/distributions/hail-python.zip \
> test.py
hail: info: SparkUI: http://192.168.10.201:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.1-0320a61
[Stage 2:==================================================>     (91 + 4) / 
100]Summary(samples=3, variants=308, call_rate=                                 
                                                                                
                 1.000000, contigs=['1'], multiallelics=0, snps=308, mnps=0, 
insertions=0, deletions=0, complex=0, star=0, max_alleles=2)


However spark crashes while trying to run my test script (error below) throwing 
this error message 
/d0/hadoop/yarn/local/usercache/mansop/appcache/application_1512016123441_0032/container_1512016123441_0032_02_000001/tmp/1515989862748-0/bin/python

-bash-4.1$ spark-submit --master yarn \
>     --deploy-mode cluster \
>     --driver-memory 4g \
>     --executor-memory 2g \
>     --executor-cores 4 \
>     --queue default \
>     --conf spark.pyspark.virtualenv.type=native \
>     --conf 
> spark.pyspark.virtualenv.requirements=/home/mansop/requirements.txt \
>     --conf 
> spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate
>  \
>     --jars $HAIL_HOME/build/libs/hail-all-spark.jar \
>     --py-files $HAIL_HOME/build/distributions/hail-python.zip \
>     test.py
18/01/16 09:55:17 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
18/01/16 09:55:18 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
18/01/16 09:55:18 INFO RMProxy: Connecting to ResourceManager at 
wp-hdp-ctrl03-mlx.mlx/10.0.1.206:8050
18/01/16 09:55:18 INFO Client: Requesting a new application from cluster with 4 
NodeManagers
18/01/16 09:55:18 INFO Client: Verifying our application has not requested more 
than the maximum memory capability of the cluster (450560 MB per container)
18/01/16 09:55:18 INFO Client: Will allocate AM container, with 4505 MB memory 
including 409 MB overhead
18/01/16 09:55:18 INFO Client: Setting up container launch context for our AM
18/01/16 09:55:18 INFO Client: Setting up the launch environment for our AM 
container
18/01/16 09:55:18 INFO Client: Preparing resources for our AM container
18/01/16 09:55:19 INFO Client: Use hdfs cache file as spark.yarn.archive for 
HDP, 
hdfsCacheFile:hdfs://wp-hdp-ctrl01-mlx.mlx:8020/hdp/apps/2.6.3.0-235/spark2/spark2-hdp-yarn-archive.tar.gz
18/01/16 09:55:19 INFO Client: Source and destination file systems are the 
same. Not copying 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/hdp/apps/2.6.3.0-235/spark2/spark2-hdp-yarn-archive.tar.gz
18/01/16 09:55:19 INFO Client: Uploading resource 
file:/home/mansop/hail-test2/hail/build/libs/hail-all-spark.jar -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/hail-all-spark.jar
18/01/16 09:55:20 INFO Client: Uploading resource 
file:/home/mansop/requirements.txt -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/requirements.txt
18/01/16 09:55:20 INFO Client: Uploading resource file:/home/mansop/test.py -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/test.py
18/01/16 09:55:20 INFO Client: Uploading resource 
file:/usr/hdp/2.6.3.0-235/spark2/python/lib/pyspark.zip -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/pyspark.zip
18/01/16 09:55:20 INFO Client: Uploading resource 
file:/usr/hdp/2.6.3.0-235/spark2/python/lib/py4j-0.10.4-src.zip -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/py4j-0.10.4-src.zip
18/01/16 09:55:20 INFO Client: Uploading resource 
file:/home/mansop/hail-test2/hail/build/distributions/hail-python.zip -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/hail-python.zip
18/01/16 09:55:20 INFO Client: Uploading resource 
file:/tmp/spark-888af623-c81d-4ff1-ac8a-15f25112cc4a/__spark_conf__1173722187739681647.zip
 -> 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043/__spark_conf__.zip
18/01/16 09:55:20 INFO SecurityManager: Changing view acls to: mansop
18/01/16 09:55:20 INFO SecurityManager: Changing modify acls to: mansop
18/01/16 09:55:20 INFO SecurityManager: Changing view acls groups to:
18/01/16 09:55:20 INFO SecurityManager: Changing modify acls groups to:
18/01/16 09:55:20 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(mansop); groups 
with view permissions: Set(); users  with modify permissions: Set(mansop); 
groups with modify permissions: Set()
18/01/16 09:55:20 INFO Client: Submitting application 
application_1512016123441_0043 to ResourceManager
18/01/16 09:55:20 INFO YarnClientImpl: Submitted application 
application_1512016123441_0043
18/01/16 09:55:21 INFO Client: Application report for 
application_1512016123441_0043 (state: ACCEPTED)
18/01/16 09:55:21 INFO Client:
         client token: N/A
         diagnostics: AM container is launched, waiting for AM container to 
Register with RM
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1516056920515
         final status: UNDEFINED
         tracking URL: 
http://wp-hdp-ctrl03-mlx.mlx:8088/proxy/application_1512016123441_0043/
         user: mansop
18/01/16 09:55:22 INFO Client: Application report for 
application_1512016123441_0043 (state: ACCEPTED)
18/01/16 09:55:23 INFO Client: Application report for 
application_1512016123441_0043 (state: ACCEPTED)
18/01/16 09:55:24 INFO Client: Application report for 
application_1512016123441_0043 (state: ACCEPTED)
18/01/16 09:55:25 INFO Client: Application report for 
application_1512016123441_0043 (state: ACCEPTED)
18/01/16 09:55:26 INFO Client: Application report for 
application_1512016123441_0043 (state: ACCEPTED)
18/01/16 09:55:27 INFO Client: Application report for 
application_1512016123441_0043 (state: ACCEPTED)
18/01/16 09:55:28 INFO Client: Application report for 
application_1512016123441_0043 (state: ACCEPTED)
18/01/16 09:55:29 INFO Client: Application report for 
application_1512016123441_0043 (state: ACCEPTED)
18/01/16 09:55:30 INFO Client: Application report for 
application_1512016123441_0043 (state: FAILED)
18/01/16 09:55:30 INFO Client:
         client token: N/A
         diagnostics: Application application_1512016123441_0043 failed 2 times 
due to AM Container for appattempt_1512016123441_0043_000002 exited with  
exitCode: 1
For more detailed output, check the application tracking page: 
http://wp-hdp-ctrl03-mlx.mlx:8088/cluster/app/application_1512016123441_0043 
Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1512016123441_0043_02_000001
Exit code: 1

Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/d1/hadoop/yarn/local/filecache/11/spark2-hdp-yarn-archive.tar.gz/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/usr/hdp/2.6.3.0-235/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/01/16 09:55:27 INFO SignalUtils: Registered signal handler for TERM
18/01/16 09:55:27 INFO SignalUtils: Registered signal handler for HUP
18/01/16 09:55:27 INFO SignalUtils: Registered signal handler for INT
18/01/16 09:55:28 INFO ApplicationMaster: Preparing Local resources
18/01/16 09:55:28 INFO ApplicationMaster: ApplicationAttemptId: 
appattempt_1512016123441_0043_000002
18/01/16 09:55:28 INFO SecurityManager: Changing view acls to: yarn,mansop
18/01/16 09:55:28 INFO SecurityManager: Changing modify acls to: yarn,mansop
18/01/16 09:55:28 INFO SecurityManager: Changing view acls groups to:
18/01/16 09:55:28 INFO SecurityManager: Changing modify acls groups to:
18/01/16 09:55:28 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(yarn, mansop); 
groups with view permissions: Set(); users  with modify permissions: Set(yarn, 
mansop); groups with modify permissions: Set()
18/01/16 09:55:28 INFO ApplicationMaster: Starting the user application in a 
separate Thread
18/01/16 09:55:28 INFO ApplicationMaster: Waiting for spark context 
initialization...
18/01/16 09:55:29 ERROR ApplicationMaster: User application exited with status 1
18/01/16 09:55:29 INFO ApplicationMaster: Final app status: FAILED, exitCode: 
1, (reason: User application exited with status 1)
18/01/16 09:55:29 ERROR ApplicationMaster: Uncaught exception:
org.apache.spark.SparkException: Exception thrown in awaitResult:
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:423)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:282)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:768)
        at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
        at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
        at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:766)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: org.apache.spark.SparkUserAppException: User application exited with 
1
        at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:105)
        at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:646)
18/01/16 09:55:29 INFO ApplicationMaster: Unregistering ApplicationMaster with 
FAILED (diag message: User application exited with status 1)
18/01/16 09:55:29 INFO ApplicationMaster: Deleting staging directory 
hdfs://wp-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1512016123441_0043
18/01/16 09:55:29 INFO ShutdownHookManager: Shutdown hook called

Failing this attempt. Failing the application.
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1516056920515
         final status: FAILED
         tracking URL: 
http://wp-hdp-ctrl03-mlx.mlx:8088/cluster/app/application_1512016123441_0043
         user: mansop
Exception in thread "main" org.apache.spark.SparkException: Application 
application_1512016123441_0043 finished with failed status
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:1187)
        at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1233)
        at org.apache.spark.deploy.yarn.Client.main(Client.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
        at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/01/16 09:55:30 INFO ShutdownHookManager: Shutdown hook called
18/01/16 09:55:30 INFO ShutdownHookManager: Deleting directory 
/tmp/spark-888af623-c81d-4ff1-ac8a-15f25112cc4a

QUESTION:
Why spark/yarn can't find this file 
/d0/hadoop/yarn/local/usercache/mansop/appcache/application_1512016123441_0032/container_1512016123441_0032_02_000001/tmp/1515989862748-0/bin/python?
 Who copies it and from where? And what do I need to do in order to make my 
spark-submit job to run?

Thank you very much


Manuel Sopena Ballesteros | Big data Engineer
Garvan Institute of Medical Research
The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010
T: + 61 (0)2 9355 5760 | F: +61 (0)2 9295 8507 | E: 
manuel...@garvan.org.au<mailto:manuel...@garvan.org.au>

NOTICE
Please consider the environment before printing this email. This message and 
any attachments are intended for the addressee named and may contain legally 
privileged/confidential/copyright information. If you are not the intended 
recipient, you should not read, use, disclose, copy or distribute this 
communication. If you have received this message in error please notify us at 
once by return email and then delete both messages. We accept no liability for 
the distribution of viruses or similar in electronic communications. This 
notice should not be removed.

Reply via email to