Hi Sandy:

Thank you. I have not tried that mechanism (I wasn't are of it). I will try that instead.

Is it possible to also represent '--driver-memory' and '--executor-memory' (and basically all properties)
using the '--conf' directive?

The Reason: I actually discovered the below issue while writing a custom PYTHONSTARTUP script that I use to launch *bpython* or *python* or my *WING python IDE* with. That script reads a python *dict* (from a file) containing key/value pairs from which it constructs the "--driver-java-options ...", which I will now switch generate '--conf key1=val1 --conf key2=val2 --conf key3=val3 (and so on), instead.

If all of the properties could be represented in this way, then it makes the code cleaner (all in
the dict file, and no one-offs).

Either way, thank you. =:)

Noel,
team didata


On 09/16/2014 08:03 PM, Sandy Ryza wrote:
Hi team didata,

This doesn't directly answer your question, but with Spark 1.1, instead of user the driver options, it's better to pass your spark properties using the "conf" option.

E.g.
pyspark --master yarn-client --conf spark.shuffle.spill=true --conf spark.yarn.executor.memoryOverhead=512M

Additionally, executor and memory have dedicated options:

pyspark --master yarn-client --conf spark.shuffle.spill=true --conf spark.yarn.executor.memoryOverhead=512M --driver-memory 3G --executor-memory 5G

-Sandy


On Tue, Sep 16, 2014 at 6:22 PM, Dimension Data, LLC. <subscripti...@didata.us <mailto:subscripti...@didata.us>> wrote:



    Hello friends:

    Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN
    distribution. Everything went fine, and everything seems
    to work, but for the following.

    Following are two invocations of the 'pyspark' script, one with
    enclosing quotes around the options passed to
    '--driver-java-options', and one without them. I added the
    following one-line in the 'pyspark' script to
    show my problem...

    ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line
    that exports this variable.

    =========================================================

    FIRST:
    [ without enclosing quotes ]:

    user@linux$ pyspark --master yarn-client --driver-java-options
    -Dspark.executor.memory=1G -Dspark.ui.port=8468
    -Dspark.driver.memory=512M
    -Dspark.yarn.executor.memoryOverhead=512M
    -Dspark.executor.instances=3
    
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
    xxx --master yarn-client --driver-java-options
    -Dspark.executor.memory=1Gxxx  <--- echo statement show option
    truncation.

    While this succeeds in getting to a pyspark shell prompt (sc), the
    context isn't setup properly because, as seen
    in red above and below, all but the first option took effect.
    (Note spark.executor.memory is correct but that's only because
    my spark defaults coincide with it.)

    14/09/16 17:35:32 INFO yarn.Client:   command: $JAVA_HOME/bin/java
    -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
    '-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89'
    '-Dspark.serializer.objectStreamReset=100'
    '-Dspark.executor.memory=1G' '-Dspark.rdd.compress=True'
    '-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles='
    '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
    '-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress='
    '-Dspark.app.name <http://Dspark.app.name>=PySparkShell'
    '-Dspark.driver.appUIAddress=dstorm:4040'
    '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G'
    '-Dspark.fileserver.uri=http://192.168.0.16:60305'
    '-Dspark.driver.port=44616' '-Dspark.master=yarn-client'
    org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused'
    --jar  null  --arg 'dstorm:44616' --executor-memory 1024
    --executor-cores 1 --num-executors 2 1> <LOG_DIR>/stdout 2>
    <LOG_DIR>/stderr

    (Note: I happen to notice that 'spark.driver.memory' is missing as
    well).

    ===========================================

    NEXT:

    [ So let's try with enclosing quotes ]
        user@linux$ pyspark --master yarn-client --driver-java-options
    '-Dspark.executor.memory=1G -Dspark.ui.port=8468
    -Dspark.driver.memory=512M
    -Dspark.yarn.executor.memoryOverhead=512M
    -Dspark.executor.instances=3
    
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
    xxx --master yarn-client --driver-java-options
    "-Dspark.executor.memory=1G -Dspark.ui.port=8468
    -Dspark.driver.memory=512M
    -Dspark.yarn.executor.memoryOverhead=512M
    -Dspark.executor.instances=3
    
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx

    While this does have all the options (shown in the red echo output
    above and the command executed below), pyspark invocation fails,
    indicating
    that the application ended before I got to a shell prompt.
    See below snippet.

    14/09/16 17:44:12 INFO yarn.Client:   command: $JAVA_HOME/bin/java
    -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
    '-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada'
    '-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M'
    
'-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
    '-Dspark.serializer.objectStreamReset=100'
    '-Dspark.executor.instances=3' '-Dspark.rdd.compress=True'
    '-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles='
    '-Dspark.ui.port=8468' '-Dspark.driver.host=dstorm'
    '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
    '-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name
    <http://Dspark.app.name>=PySparkShell'
    '-Dspark.driver.appUIAddress=dstorm:8468'
    '-Dspark.yarn.executor.memoryOverhead=512M'
    '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G
    -Dspark.ui.port=8468 -Dspark.driver.memory=512M
    -Dspark.yarn.executor.memoryOverhead=512M
    -Dspark.executor.instances=3
    
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
    '-Dspark.fileserver.uri=http://192.168.0.16:54171'
    '-Dspark.master=yarn-client' '-Dspark.driver.port=58542'
    org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused'
    --jar  null  --arg  'dstorm:58542' --executor-memory 1024
    --executor-cores 1 --num-executors  3 1> <LOG_DIR>/stdout 2>
    <LOG_DIR>/stderr


    [ ... SNIP ... ]
    4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend:
    Application report from ASM:
         appMasterRpcPort: -1
         appStartTime: 1410903852044
         yarnAppState: ACCEPTED

    14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend:
    Application report from ASM:
         appMasterRpcPort: -1
         appStartTime: 1410903852044
         yarnAppState: ACCEPTED

    14/09/16 17:44:14 INFO cluster.YarnClientSchedulerBackend:
    Application report from ASM:
         appMasterRpcPort: -1
         appStartTime: 1410903852044
         yarnAppState: ACCEPTED

    14/09/16 17:44:15 INFO cluster.YarnClientSchedulerBackend:
    Application report from ASM:
         appMasterRpcPort: 0
         appStartTime: 1410903852044
         yarnAppState: RUNNING

    14/09/16 17:44:19 ERROR cluster.YarnClientSchedulerBackend: Yarn
    application already ended: FAILED


    Am I doing something wrong?

    Thank you in advance!
    Team didata






Reply via email to