Hi Sandy:
Thank you. I have not tried that mechanism (I wasn't are of it). I will
try that instead.
Is it possible to also represent '--driver-memory' and
'--executor-memory' (and basically all properties)
using the '--conf' directive?
The Reason: I actually discovered the below issue while writing a custom
PYTHONSTARTUP script that I use
to launch *bpython* or *python* or my *WING python IDE* with. That
script reads a python *dict* (from a file)
containing key/value pairs from which it constructs the
"--driver-java-options ...", which I will now
switch generate '--conf key1=val1 --conf key2=val2 --conf key3=val3 (and
so on), instead.
If all of the properties could be represented in this way, then it makes
the code cleaner (all in
the dict file, and no one-offs).
Either way, thank you. =:)
Noel,
team didata
On 09/16/2014 08:03 PM, Sandy Ryza wrote:
Hi team didata,
This doesn't directly answer your question, but with Spark 1.1,
instead of user the driver options, it's better to pass your spark
properties using the "conf" option.
E.g.
pyspark --master yarn-client --conf spark.shuffle.spill=true --conf
spark.yarn.executor.memoryOverhead=512M
Additionally, executor and memory have dedicated options:
pyspark --master yarn-client --conf spark.shuffle.spill=true --conf
spark.yarn.executor.memoryOverhead=512M --driver-memory 3G
--executor-memory 5G
-Sandy
On Tue, Sep 16, 2014 at 6:22 PM, Dimension Data, LLC.
<subscripti...@didata.us <mailto:subscripti...@didata.us>> wrote:
Hello friends:
Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN
distribution. Everything went fine, and everything seems
to work, but for the following.
Following are two invocations of the 'pyspark' script, one with
enclosing quotes around the options passed to
'--driver-java-options', and one without them. I added the
following one-line in the 'pyspark' script to
show my problem...
ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line
that exports this variable.
=========================================================
FIRST:
[ without enclosing quotes ]:
user@linux$ pyspark --master yarn-client --driver-java-options
-Dspark.executor.memory=1G -Dspark.ui.port=8468
-Dspark.driver.memory=512M
-Dspark.yarn.executor.memoryOverhead=512M
-Dspark.executor.instances=3
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
xxx --master yarn-client --driver-java-options
-Dspark.executor.memory=1Gxxx <--- echo statement show option
truncation.
While this succeeds in getting to a pyspark shell prompt (sc), the
context isn't setup properly because, as seen
in red above and below, all but the first option took effect.
(Note spark.executor.memory is correct but that's only because
my spark defaults coincide with it.)
14/09/16 17:35:32 INFO yarn.Client: command: $JAVA_HOME/bin/java
-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
'-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89'
'-Dspark.serializer.objectStreamReset=100'
'-Dspark.executor.memory=1G' '-Dspark.rdd.compress=True'
'-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles='
'-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
'-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress='
'-Dspark.app.name <http://Dspark.app.name>=PySparkShell'
'-Dspark.driver.appUIAddress=dstorm:4040'
'-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G'
'-Dspark.fileserver.uri=http://192.168.0.16:60305'
'-Dspark.driver.port=44616' '-Dspark.master=yarn-client'
org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused'
--jar null --arg 'dstorm:44616' --executor-memory 1024
--executor-cores 1 --num-executors 2 1> <LOG_DIR>/stdout 2>
<LOG_DIR>/stderr
(Note: I happen to notice that 'spark.driver.memory' is missing as
well).
===========================================
NEXT:
[ So let's try with enclosing quotes ]
user@linux$ pyspark --master yarn-client --driver-java-options
'-Dspark.executor.memory=1G -Dspark.ui.port=8468
-Dspark.driver.memory=512M
-Dspark.yarn.executor.memoryOverhead=512M
-Dspark.executor.instances=3
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
xxx --master yarn-client --driver-java-options
"-Dspark.executor.memory=1G -Dspark.ui.port=8468
-Dspark.driver.memory=512M
-Dspark.yarn.executor.memoryOverhead=512M
-Dspark.executor.instances=3
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx
While this does have all the options (shown in the red echo output
above and the command executed below), pyspark invocation fails,
indicating
that the application ended before I got to a shell prompt.
See below snippet.
14/09/16 17:44:12 INFO yarn.Client: command: $JAVA_HOME/bin/java
-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
'-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada'
'-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M'
'-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
'-Dspark.serializer.objectStreamReset=100'
'-Dspark.executor.instances=3' '-Dspark.rdd.compress=True'
'-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles='
'-Dspark.ui.port=8468' '-Dspark.driver.host=dstorm'
'-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
'-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name
<http://Dspark.app.name>=PySparkShell'
'-Dspark.driver.appUIAddress=dstorm:8468'
'-Dspark.yarn.executor.memoryOverhead=512M'
'-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G
-Dspark.ui.port=8468 -Dspark.driver.memory=512M
-Dspark.yarn.executor.memoryOverhead=512M
-Dspark.executor.instances=3
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
'-Dspark.fileserver.uri=http://192.168.0.16:54171'
'-Dspark.master=yarn-client' '-Dspark.driver.port=58542'
org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused'
--jar null --arg 'dstorm:58542' --executor-memory 1024
--executor-cores 1 --num-executors 3 1> <LOG_DIR>/stdout 2>
<LOG_DIR>/stderr
[ ... SNIP ... ]
4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend:
Application report from ASM:
appMasterRpcPort: -1
appStartTime: 1410903852044
yarnAppState: ACCEPTED
14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend:
Application report from ASM:
appMasterRpcPort: -1
appStartTime: 1410903852044
yarnAppState: ACCEPTED
14/09/16 17:44:14 INFO cluster.YarnClientSchedulerBackend:
Application report from ASM:
appMasterRpcPort: -1
appStartTime: 1410903852044
yarnAppState: ACCEPTED
14/09/16 17:44:15 INFO cluster.YarnClientSchedulerBackend:
Application report from ASM:
appMasterRpcPort: 0
appStartTime: 1410903852044
yarnAppState: RUNNING
14/09/16 17:44:19 ERROR cluster.YarnClientSchedulerBackend: Yarn
application already ended: FAILED
Am I doing something wrong?
Thank you in advance!
Team didata