Re: How does spark-submit handle Python scripts (and how to repeat it)?

Andrei Thu, 14 Apr 2016 15:08:21 -0700

Yes, I tried setting YARN_CONF_DIR, but with no luck. I will play around
with environment variables and system properties and post back in case of
success. Thanks for your help so far!


On Thu, Apr 14, 2016 at 5:48 AM, Sun, Rui <rui....@intel.com> wrote:

> In SparkSubmit, there is less work for yarn-client than that for
> yarn-cluster. Basically prepare some spark configurations into system prop
> , for example, information on additional resources required by the
> application that need to be distributed to the cluster. These
> configurations will be used in SparkContext initialization later.
>
>
>
> So generally for yarn-client, maybe you can skip spark-submit and directly
> launching the spark application with some configurations setup before new
> SparkContext.
>
>
>
> Not sure about your error, have you setup YARN_CONF_DIR?
>
>
>
> *From:* Andrei [mailto:faithlessfri...@gmail.com]
> *Sent:* Thursday, April 14, 2016 5:45 AM
>
> *To:* Sun, Rui <rui....@intel.com>
> *Cc:* user <user@spark.apache.org>
> *Subject:* Re: How does spark-submit handle Python scripts (and how to
> repeat it)?
>
>
>
> Julia can pick the env var, and set the system properties or directly fill
> the configurations into a SparkConf, and then create a SparkContext
>
>
>
> That's the point - just setting master to "yarn-client" doesn't work, even
> in Java/Scala. E.g. following code in *Scala*:
>
>
> val conf = new SparkConf().setAppName("My App").setMaster("yarn-client")
> val sc = new SparkContext(conf)
> sc.parallelize(1 to 10).collect()
> sc.stop()
>
>
>
> results in an error:
>
>
>
> Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032
>
>
>
> I think for now we can even put Julia aside and concentrate the following
> question: how does submitting application via `spark-submit` with
> "yarn-client" mode differ from setting the same mode directly in
> `SparkConf`?
>
>
>
>
>
>
>
> On Wed, Apr 13, 2016 at 5:06 AM, Sun, Rui <rui....@intel.com> wrote:
>
> Spark configurations specified at the command line for spark-submit should
> be passed to the JVM inside Julia process. You can refer to
> https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L267
> and
> https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L295
>
> Generally,
>
>                 spark-submit JVM -> JuliaRunner -> Env var like
> “JULIA_SUBMIT_ARGS” -> julia process -> new JVM with SparkContext
>
>   Julia can pick the env var, and set the system properties or directly
> fill the configurations into a SparkConf, and then create a SparkContext
>
>
>
> Yes, you are right, `spark-submit` creates new Python/R process that
> connects back to that same JVM and creates SparkContext in it.
>
> Refer to
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L47
> and
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/RRunner.scala#L65
>
>
>
>
>
> *From:* Andrei [mailto:faithlessfri...@gmail.com]
> *Sent:* Wednesday, April 13, 2016 4:32 AM
> *To:* Sun, Rui <rui....@intel.com>
> *Cc:* user <user@spark.apache.org>
> *Subject:* Re: How does spark-submit handle Python scripts (and how to
> repeat it)?
>
>
>
> One part is passing the command line options, like “--master”, from the
> JVM launched by spark-submit to the JVM where SparkContext resides
>
>
>
> Since I have full control over both - JVM and Julia parts - I can pass
> whatever options to both. But what exactly should be passed? Currently
> pipeline looks like this:
>
>
>
> spark-submit JVM -> JuliaRunner -> julia process -> new JVM with
> SparkContext
>
>
>
>  I want to make the last JVM's SparkContext to understand that it should
> run on YARN. Obviously, I can't pass `--master yarn` option to JVM itself.
> Instead, I can pass system property "spark.master" = "yarn-client", but
> this results in an error:
>
>
>
> Retrying connect to server: 0.0.0.0/0.0.0.0:8032
>
>
>
>
>
> So it's definitely not enough. I tried to set manually all system
> properties that `spark-submit` adds to the JVM (including
> "spark-submit=true", "spark.submit.deployMode=client", etc.), but it didn't
> help too. Source code is always good, but for a stranger like me it's a
> little bit hard to grasp control flow in SparkSubmit class.
>
>
>
>
>
> For pySpark & SparkR, when running scripts in client deployment modes
> (standalone client and yarn client), the JVM is the same (py4j/RBackend
> running as a thread in the JVM launched by spark-submit)
>
>
>
> Can you elaborate on this? Does it mean that `spark-submit` creates new
> Python/R process that connects back to that same JVM and creates
> SparkContext in it?
>
>
>
>
>
> On Tue, Apr 12, 2016 at 2:04 PM, Sun, Rui <rui....@intel.com> wrote:
>
> There is much deployment preparation work handling different deployment
> modes for pyspark and SparkR in SparkSubmit. It is difficult to summarize
> it briefly, you had better refer to the source code.
>
>
>
> Supporting running Julia scripts in SparkSubmit is more than implementing
> a ‘JuliaRunner’. One part is passing the command line options, like
> “--master”, from the JVM launched by spark-submit to the JVM where
> SparkContext resides, in the case that the two JVMs are not the same. For
> pySpark & SparkR, when running scripts in client deployment modes
> (standalone client and yarn client), the JVM is the same (py4j/RBackend
> running as a thread in the JVM launched by spark-submit) , so no need to
> pass the command line options around. However, in your case, Julia
> interpreter launches an in-process JVM for SparkContext, which is a
> separate JVM from the one launched by spark-submit. So you need a way,
> typically an environment environment variable, like “SPARKR_SUBMIT_ARGS”
> for SparkR or “PYSPARK_SUBMIT_ARGS” for pyspark, to pass command line args
> to the in-process JVM in the Julia interpreter so that SparkConf can pick
> the options.
>
>
>
> *From:* Andrei [mailto:faithlessfri...@gmail.com]
> *Sent:* Tuesday, April 12, 2016 3:48 AM
> *To:* user <user@spark.apache.org>
> *Subject:* How does spark-submit handle Python scripts (and how to repeat
> it)?
>
>
>
> I'm working on a wrapper [1] around Spark for the Julia programming
> language [2] similar to PySpark. I've got it working with Spark Standalone
> server by creating local JVM and setting master programmatically. However,
> this approach doesn't work with YARN (and probably Mesos), which require
> running via `spark-submit`.
>
>
>
> In `SparkSubmit` class I see that for Python a special class
> `PythonRunner` is launched, so I tried to do similar `JuliaRunner`, which
> essentially does the following:
>
>
>
> val pb = new ProcessBuilder(Seq("julia", juliaScript))
>
> val process = pb.start()
>
> process.waitFor()
>
>
>
> where `juliaScript` itself creates new JVM and `SparkContext` inside it
> WITHOUT setting master URL. I then tried to launch this class using
>
>
>
> spark-submit --master yarn \
>
>                       --class o.a.s.a.j.JuliaRunner \
>
>                       project.jar my_script.jl
>
>
>
> I expected that `spark-submit` would set environment variables or
> something that SparkContext would then read and connect to appropriate
> master. This didn't happen, however, and process failed while trying to
> instantiate `SparkContext`, saying that master is not specified.
>
>
>
> So what am I missing? How can use `spark-submit` to run driver in a
> non-JVM language?
>
>
>
>
>
> [1]: https://github.com/dfdx/Sparta.jl
>
> [2]: http://julialang.org/
>
>
>
>
>

Re: How does spark-submit handle Python scripts (and how to repeat it)?

Reply via email to