Hi
I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how to
get the spark repo to use more than 2 nodes in an interactive session.
So, this works, but is non-interactive (using yarn-client as MASTER)
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/bin/spark-class \
org.apache.spark.deploy.yarn.Client \
--jar
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/examples/lib/spark-examples_2.10-0.9.0-cdh5.0.0.jar
\
--class org.apache.spark.examples.SparkPi \
--args yarn-standalone \
--args 10 \
--num-workers 100
There does not appear to be an (obvious?) way to get more than 2 nodes involved
from the repl.
I am running the REPL like this:
#!/bin/sh
. /etc/spark/conf.cloudera.spark/spark-env.sh
export SPARK_JAR=hdfs://nameservice1/user/spark/share/lib/spark-assembly.jar
export SPARK_WORKER_MEMORY=512m
export MASTER=yarn-client
exec $SPARK_HOME/bin/spark-shell
Now if I comment out the line with `export SPARK_JAR=…’ and run this again, I
get an error like this:
14/05/19 08:03:41 ERROR Client: Error: You must set SPARK_JAR environment
variable!
Usage: org.apache.spark.deploy.yarn.Client [options]
Options:
--jar JAR_PATH Path to your application's JAR file (required in
yarn-cluster mode)
--class CLASS_NAME Name of your application's main class (required)
--args ARGS Arguments to be passed to your application's main
class.
Mutliple invocations are possible, each will be
passed in order.
--num-workers NUM Number of workers to start (Default: 2)
[…]
But none of those options are exposed at the `spark-shell’ level.
Thanks in advance for your guidance.
Eric