Spark configurations specified at the command line for spark-submit should be passed to the JVM inside Julia process. You can refer to https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L267 and https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L295 Generally, spark-submit JVM -> JuliaRunner -> Env var like “JULIA_SUBMIT_ARGS” -> julia process -> new JVM with SparkContext Julia can pick the env var, and set the system properties or directly fill the configurations into a SparkConf, and then create a SparkContext
Yes, you are right, `spark-submit` creates new Python/R process that connects back to that same JVM and creates SparkContext in it. Refer to https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L47 and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/RRunner.scala#L65 From: Andrei [mailto:faithlessfri...@gmail.com] Sent: Wednesday, April 13, 2016 4:32 AM To: Sun, Rui <rui....@intel.com> Cc: user <user@spark.apache.org> Subject: Re: How does spark-submit handle Python scripts (and how to repeat it)? One part is passing the command line options, like “--master”, from the JVM launched by spark-submit to the JVM where SparkContext resides Since I have full control over both - JVM and Julia parts - I can pass whatever options to both. But what exactly should be passed? Currently pipeline looks like this: spark-submit JVM -> JuliaRunner -> julia process -> new JVM with SparkContext I want to make the last JVM's SparkContext to understand that it should run on YARN. Obviously, I can't pass `--master yarn` option to JVM itself. Instead, I can pass system property "spark.master" = "yarn-client", but this results in an error: Retrying connect to server: 0.0.0.0/0.0.0.0:8032<http://0.0.0.0/0.0.0.0:8032> So it's definitely not enough. I tried to set manually all system properties that `spark-submit` adds to the JVM (including "spark-submit=true", "spark.submit.deployMode=client", etc.), but it didn't help too. Source code is always good, but for a stranger like me it's a little bit hard to grasp control flow in SparkSubmit class. For pySpark & SparkR, when running scripts in client deployment modes (standalone client and yarn client), the JVM is the same (py4j/RBackend running as a thread in the JVM launched by spark-submit) Can you elaborate on this? Does it mean that `spark-submit` creates new Python/R process that connects back to that same JVM and creates SparkContext in it? On Tue, Apr 12, 2016 at 2:04 PM, Sun, Rui <rui....@intel.com<mailto:rui....@intel.com>> wrote: There is much deployment preparation work handling different deployment modes for pyspark and SparkR in SparkSubmit. It is difficult to summarize it briefly, you had better refer to the source code. Supporting running Julia scripts in SparkSubmit is more than implementing a ‘JuliaRunner’. One part is passing the command line options, like “--master”, from the JVM launched by spark-submit to the JVM where SparkContext resides, in the case that the two JVMs are not the same. For pySpark & SparkR, when running scripts in client deployment modes (standalone client and yarn client), the JVM is the same (py4j/RBackend running as a thread in the JVM launched by spark-submit) , so no need to pass the command line options around. However, in your case, Julia interpreter launches an in-process JVM for SparkContext, which is a separate JVM from the one launched by spark-submit. So you need a way, typically an environment environment variable, like “SPARKR_SUBMIT_ARGS” for SparkR or “PYSPARK_SUBMIT_ARGS” for pyspark, to pass command line args to the in-process JVM in the Julia interpreter so that SparkConf can pick the options. From: Andrei [mailto:faithlessfri...@gmail.com<mailto:faithlessfri...@gmail.com>] Sent: Tuesday, April 12, 2016 3:48 AM To: user <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: How does spark-submit handle Python scripts (and how to repeat it)? I'm working on a wrapper [1] around Spark for the Julia programming language [2] similar to PySpark. I've got it working with Spark Standalone server by creating local JVM and setting master programmatically. However, this approach doesn't work with YARN (and probably Mesos), which require running via `spark-submit`. In `SparkSubmit` class I see that for Python a special class `PythonRunner` is launched, so I tried to do similar `JuliaRunner`, which essentially does the following: val pb = new ProcessBuilder(Seq("julia", juliaScript)) val process = pb.start() process.waitFor() where `juliaScript` itself creates new JVM and `SparkContext` inside it WITHOUT setting master URL. I then tried to launch this class using spark-submit --master yarn \ --class o.a.s.a.j.JuliaRunner \ project.jar my_script.jl I expected that `spark-submit` would set environment variables or something that SparkContext would then read and connect to appropriate master. This didn't happen, however, and process failed while trying to instantiate `SparkContext`, saying that master is not specified. So what am I missing? How can use `spark-submit` to run driver in a non-JVM language? [1]: https://github.com/dfdx/Sparta.jl [2]: http://julialang.org/