RE: How does spark-submit handle Python scripts (and how to repeat it)?

Sun, Rui Tue, 12 Apr 2016 04:06:02 -0700

There is much deployment preparation work handling different deployment modes 
for pyspark and SparkR in SparkSubmit. It is difficult to summarize it briefly, 
you had better refer to the source code.


Supporting running Julia scripts in SparkSubmit is more than implementing a 
‘JuliaRunner’. One part is passing the command line options, like “--master”, 
from the JVM launched by spark-submit to the JVM where SparkContext resides, in 
the case that the two JVMs are not the same. For pySpark & SparkR, when running 
scripts in client deployment modes (standalone client and yarn client), the JVM 
is the same (py4j/RBackend running as a thread in the JVM launched by 
spark-submit) , so no need to pass the command line options around. However, in 
your case, Julia interpreter launches an in-process JVM for SparkContext, which 
is a separate JVM from the one launched by spark-submit. So you need a way, 
typically an environment environment variable, like “SPARKR_SUBMIT_ARGS” for 
SparkR or “PYSPARK_SUBMIT_ARGS” for pyspark, to pass command line args to the 
in-process JVM in the Julia interpreter so that SparkConf can pick the options.

From: Andrei [mailto:[email protected]]
Sent: Tuesday, April 12, 2016 3:48 AM
To: user <[email protected]>
Subject: How does spark-submit handle Python scripts (and how to repeat it)?

I'm working on a wrapper [1] around Spark for the Julia programming language 
[2] similar to PySpark. I've got it working with Spark Standalone server by 
creating local JVM and setting master programmatically. However, this approach 
doesn't work with YARN (and probably Mesos), which require running via 
`spark-submit`.

In `SparkSubmit` class I see that for Python a special class `PythonRunner` is 
launched, so I tried to do similar `JuliaRunner`, which essentially does the 
following:

val pb = new ProcessBuilder(Seq("julia", juliaScript))
val process = pb.start()
process.waitFor()

where `juliaScript` itself creates new JVM and `SparkContext` inside it WITHOUT 
setting master URL. I then tried to launch this class using

spark-submit --master yarn \
                      --class o.a.s.a.j.JuliaRunner \
                      project.jar my_script.jl

I expected that `spark-submit` would set environment variables or something 
that SparkContext would then read and connect to appropriate master. This 
didn't happen, however, and process failed while trying to instantiate 
`SparkContext`, saying that master is not specified.

So what am I missing? How can use `spark-submit` to run driver in a non-JVM 
language?


[1]: https://github.com/dfdx/Sparta.jl
[2]: http://julialang.org/

RE: How does spark-submit handle Python scripts (and how to repeat it)?

Reply via email to