Yes, I tried setting YARN_CONF_DIR, but with no luck. I will play around with environment variables and system properties and post back in case of success. Thanks for your help so far!
On Thu, Apr 14, 2016 at 5:48 AM, Sun, Rui <rui....@intel.com> wrote: > In SparkSubmit, there is less work for yarn-client than that for > yarn-cluster. Basically prepare some spark configurations into system prop > , for example, information on additional resources required by the > application that need to be distributed to the cluster. These > configurations will be used in SparkContext initialization later. > > > > So generally for yarn-client, maybe you can skip spark-submit and directly > launching the spark application with some configurations setup before new > SparkContext. > > > > Not sure about your error, have you setup YARN_CONF_DIR? > > > > *From:* Andrei [mailto:faithlessfri...@gmail.com] > *Sent:* Thursday, April 14, 2016 5:45 AM > > *To:* Sun, Rui <rui....@intel.com> > *Cc:* user <user@spark.apache.org> > *Subject:* Re: How does spark-submit handle Python scripts (and how to > repeat it)? > > > > Julia can pick the env var, and set the system properties or directly fill > the configurations into a SparkConf, and then create a SparkContext > > > > That's the point - just setting master to "yarn-client" doesn't work, even > in Java/Scala. E.g. following code in *Scala*: > > > val conf = new SparkConf().setAppName("My App").setMaster("yarn-client") > val sc = new SparkContext(conf) > sc.parallelize(1 to 10).collect() > sc.stop() > > > > results in an error: > > > > Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032 > > > > I think for now we can even put Julia aside and concentrate the following > question: how does submitting application via `spark-submit` with > "yarn-client" mode differ from setting the same mode directly in > `SparkConf`? > > > > > > > > On Wed, Apr 13, 2016 at 5:06 AM, Sun, Rui <rui....@intel.com> wrote: > > Spark configurations specified at the command line for spark-submit should > be passed to the JVM inside Julia process. You can refer to > https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L267 > and > https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L295 > > Generally, > > spark-submit JVM -> JuliaRunner -> Env var like > “JULIA_SUBMIT_ARGS” -> julia process -> new JVM with SparkContext > > Julia can pick the env var, and set the system properties or directly > fill the configurations into a SparkConf, and then create a SparkContext > > > > Yes, you are right, `spark-submit` creates new Python/R process that > connects back to that same JVM and creates SparkContext in it. > > Refer to > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L47 > and > > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/RRunner.scala#L65 > > > > > > *From:* Andrei [mailto:faithlessfri...@gmail.com] > *Sent:* Wednesday, April 13, 2016 4:32 AM > *To:* Sun, Rui <rui....@intel.com> > *Cc:* user <user@spark.apache.org> > *Subject:* Re: How does spark-submit handle Python scripts (and how to > repeat it)? > > > > One part is passing the command line options, like “--master”, from the > JVM launched by spark-submit to the JVM where SparkContext resides > > > > Since I have full control over both - JVM and Julia parts - I can pass > whatever options to both. But what exactly should be passed? Currently > pipeline looks like this: > > > > spark-submit JVM -> JuliaRunner -> julia process -> new JVM with > SparkContext > > > > I want to make the last JVM's SparkContext to understand that it should > run on YARN. Obviously, I can't pass `--master yarn` option to JVM itself. > Instead, I can pass system property "spark.master" = "yarn-client", but > this results in an error: > > > > Retrying connect to server: 0.0.0.0/0.0.0.0:8032 > > > > > > So it's definitely not enough. I tried to set manually all system > properties that `spark-submit` adds to the JVM (including > "spark-submit=true", "spark.submit.deployMode=client", etc.), but it didn't > help too. Source code is always good, but for a stranger like me it's a > little bit hard to grasp control flow in SparkSubmit class. > > > > > > For pySpark & SparkR, when running scripts in client deployment modes > (standalone client and yarn client), the JVM is the same (py4j/RBackend > running as a thread in the JVM launched by spark-submit) > > > > Can you elaborate on this? Does it mean that `spark-submit` creates new > Python/R process that connects back to that same JVM and creates > SparkContext in it? > > > > > > On Tue, Apr 12, 2016 at 2:04 PM, Sun, Rui <rui....@intel.com> wrote: > > There is much deployment preparation work handling different deployment > modes for pyspark and SparkR in SparkSubmit. It is difficult to summarize > it briefly, you had better refer to the source code. > > > > Supporting running Julia scripts in SparkSubmit is more than implementing > a ‘JuliaRunner’. One part is passing the command line options, like > “--master”, from the JVM launched by spark-submit to the JVM where > SparkContext resides, in the case that the two JVMs are not the same. For > pySpark & SparkR, when running scripts in client deployment modes > (standalone client and yarn client), the JVM is the same (py4j/RBackend > running as a thread in the JVM launched by spark-submit) , so no need to > pass the command line options around. However, in your case, Julia > interpreter launches an in-process JVM for SparkContext, which is a > separate JVM from the one launched by spark-submit. So you need a way, > typically an environment environment variable, like “SPARKR_SUBMIT_ARGS” > for SparkR or “PYSPARK_SUBMIT_ARGS” for pyspark, to pass command line args > to the in-process JVM in the Julia interpreter so that SparkConf can pick > the options. > > > > *From:* Andrei [mailto:faithlessfri...@gmail.com] > *Sent:* Tuesday, April 12, 2016 3:48 AM > *To:* user <user@spark.apache.org> > *Subject:* How does spark-submit handle Python scripts (and how to repeat > it)? > > > > I'm working on a wrapper [1] around Spark for the Julia programming > language [2] similar to PySpark. I've got it working with Spark Standalone > server by creating local JVM and setting master programmatically. However, > this approach doesn't work with YARN (and probably Mesos), which require > running via `spark-submit`. > > > > In `SparkSubmit` class I see that for Python a special class > `PythonRunner` is launched, so I tried to do similar `JuliaRunner`, which > essentially does the following: > > > > val pb = new ProcessBuilder(Seq("julia", juliaScript)) > > val process = pb.start() > > process.waitFor() > > > > where `juliaScript` itself creates new JVM and `SparkContext` inside it > WITHOUT setting master URL. I then tried to launch this class using > > > > spark-submit --master yarn \ > > --class o.a.s.a.j.JuliaRunner \ > > project.jar my_script.jl > > > > I expected that `spark-submit` would set environment variables or > something that SparkContext would then read and connect to appropriate > master. This didn't happen, however, and process failed while trying to > instantiate `SparkContext`, saying that master is not specified. > > > > So what am I missing? How can use `spark-submit` to run driver in a > non-JVM language? > > > > > > [1]: https://github.com/dfdx/Sparta.jl > > [2]: http://julialang.org/ > > > > >