Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

Dimension Data, LLC. Tue, 16 Sep 2014 19:41:20 -0700

Hi Sandy:

Thank you. I have not tried that mechanism (I wasn't are of it). I willtry that instead.

Is it possible to also represent '--driver-memory' and'--executor-memory' (and basically all properties)

using the '--conf' directive?

The Reason: I actually discovered the below issue while writing a customPYTHONSTARTUP script that I useto launch *bpython* or *python* or my *WING python IDE* with. Thatscript reads a python *dict* (from a file)containing key/value pairs from which it constructs the"--driver-java-options ...", which I will nowswitch generate '--conf key1=val1 --conf key2=val2 --conf key3=val3 (andso on), instead.

If all of the properties could be represented in this way, then it makesthe code cleaner (all in

the dict file, and no one-offs).

Either way, thank you. =:)

Noel,
team didata


On 09/16/2014 08:03 PM, Sandy Ryza wrote:

Hi team didata,

This doesn't directly answer your question, but with Spark 1.1,instead of user the driver options, it's better to pass your sparkproperties using the "conf" option.


E.g.

pyspark --master yarn-client --conf spark.shuffle.spill=true --confspark.yarn.executor.memoryOverhead=512M


Additionally, executor and memory have dedicated options:

pyspark --master yarn-client --conf spark.shuffle.spill=true --confspark.yarn.executor.memoryOverhead=512M --driver-memory 3G--executor-memory 5G


-Sandy

On Tue, Sep 16, 2014 at 6:22 PM, Dimension Data, LLC.<subscripti...@didata.us <mailto:subscripti...@didata.us>> wrote:




    Hello friends:

    Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN
    distribution. Everything went fine, and everything seems
    to work, but for the following.

    Following are two invocations of the 'pyspark' script, one with
    enclosing quotes around the options passed to
    '--driver-java-options', and one without them. I added the
    following one-line in the 'pyspark' script to
    show my problem...

    ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line
    that exports this variable.

    =========================================================

    FIRST:
    [ without enclosing quotes ]:

    user@linux$ pyspark --master yarn-client --driver-java-options
    -Dspark.executor.memory=1G -Dspark.ui.port=8468
    -Dspark.driver.memory=512M
    -Dspark.yarn.executor.memoryOverhead=512M
    -Dspark.executor.instances=3
    
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
    xxx --master yarn-client --driver-java-options
    -Dspark.executor.memory=1Gxxx  <--- echo statement show option
    truncation.

    While this succeeds in getting to a pyspark shell prompt (sc), the
    context isn't setup properly because, as seen
    in red above and below, all but the first option took effect.
    (Note spark.executor.memory is correct but that's only because
    my spark defaults coincide with it.)

    14/09/16 17:35:32 INFO yarn.Client:   command: $JAVA_HOME/bin/java
    -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
    '-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89'
    '-Dspark.serializer.objectStreamReset=100'
    '-Dspark.executor.memory=1G' '-Dspark.rdd.compress=True'
    '-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles='
    '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
    '-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress='
    '-Dspark.app.name <http://Dspark.app.name>=PySparkShell'
    '-Dspark.driver.appUIAddress=dstorm:4040'
    '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G'
    '-Dspark.fileserver.uri=http://192.168.0.16:60305'
    '-Dspark.driver.port=44616' '-Dspark.master=yarn-client'
    org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused'
    --jar  null  --arg 'dstorm:44616' --executor-memory 1024
    --executor-cores 1 --num-executors 2 1> <LOG_DIR>/stdout 2>
    <LOG_DIR>/stderr

    (Note: I happen to notice that 'spark.driver.memory' is missing as
    well).

    ===========================================

    NEXT:

    [ So let's try with enclosing quotes ]
        user@linux$ pyspark --master yarn-client --driver-java-options
    '-Dspark.executor.memory=1G -Dspark.ui.port=8468
    -Dspark.driver.memory=512M
    -Dspark.yarn.executor.memoryOverhead=512M
    -Dspark.executor.instances=3
    
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
    xxx --master yarn-client --driver-java-options
    "-Dspark.executor.memory=1G -Dspark.ui.port=8468
    -Dspark.driver.memory=512M
    -Dspark.yarn.executor.memoryOverhead=512M
    -Dspark.executor.instances=3
    
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx

    While this does have all the options (shown in the red echo output
    above and the command executed below), pyspark invocation fails,
    indicating
    that the application ended before I got to a shell prompt.
    See below snippet.

    14/09/16 17:44:12 INFO yarn.Client:   command: $JAVA_HOME/bin/java
    -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
    '-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada'
    '-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M'
    
'-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
    '-Dspark.serializer.objectStreamReset=100'
    '-Dspark.executor.instances=3' '-Dspark.rdd.compress=True'
    '-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles='
    '-Dspark.ui.port=8468' '-Dspark.driver.host=dstorm'
    '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
    '-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name
    <http://Dspark.app.name>=PySparkShell'
    '-Dspark.driver.appUIAddress=dstorm:8468'
    '-Dspark.yarn.executor.memoryOverhead=512M'
    '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G
    -Dspark.ui.port=8468 -Dspark.driver.memory=512M
    -Dspark.yarn.executor.memoryOverhead=512M
    -Dspark.executor.instances=3
    
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
    '-Dspark.fileserver.uri=http://192.168.0.16:54171'
    '-Dspark.master=yarn-client' '-Dspark.driver.port=58542'
    org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused'
    --jar  null  --arg  'dstorm:58542' --executor-memory 1024
    --executor-cores 1 --num-executors  3 1> <LOG_DIR>/stdout 2>
    <LOG_DIR>/stderr


    [ ... SNIP ... ]
    4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend:
    Application report from ASM:
         appMasterRpcPort: -1
         appStartTime: 1410903852044
         yarnAppState: ACCEPTED

    14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend:
    Application report from ASM:
         appMasterRpcPort: -1
         appStartTime: 1410903852044
         yarnAppState: ACCEPTED

    14/09/16 17:44:14 INFO cluster.YarnClientSchedulerBackend:
    Application report from ASM:
         appMasterRpcPort: -1
         appStartTime: 1410903852044
         yarnAppState: ACCEPTED

    14/09/16 17:44:15 INFO cluster.YarnClientSchedulerBackend:
    Application report from ASM:
         appMasterRpcPort: 0
         appStartTime: 1410903852044
         yarnAppState: RUNNING

    14/09/16 17:44:19 ERROR cluster.YarnClientSchedulerBackend: Yarn
    application already ended: FAILED


    Am I doing something wrong?

    Thank you in advance!
    Team didata

Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

Reply via email to