Hi Kevin, I read through your e-mail and I see two main things you're talking about.
- You want a public YARN "Client" class and don't really care about anything else. In you message you already mention why that's not a good idea. It's much better to have a standardized submission API. As you noticed by working with the previous Client API, it's not for the faint of heart; SparkSubmit hides a lot of the complexity, and does so in a way that is transparent to the caller. Whether you're submitting a Scala, Python, or R app against standlone, yarn, mesos, or local, the interface is the same. You may argue that for your particular use case you don't care about anything other than Scala apps on YARN cluster mode, but Spark does need to care about more than that. I still think that once we have a way to expose more information about the application being launched (more specifically, the app id), then doing anything else you may want to do that is specific to YARN is up to you and pretty easy to do. But I strongly believe that having different ways to launch apps in Spark is not good design. - You have some restriction that you app servers cannot fork processes Honestly, I didn't really understand what that is about. Why can't you fork processes? Is it a restriction regarding what you can deploy on the server (e.g. you cannot have a full Spark installation, everything needs to be contained in a jar that is deployed in the app server)? I really don't believe this is about the inability to fork a process, so it must be something else. The unfortunate reality is that Spark is really not friendly multiple things being launched from the same JVM. Doing that is prone to apps running all over each other and overwriting configs and other things, which would lead to many, many tears. Once the limitations around that are fixed, then we can study adding a way to launch multiple Spark apps from the same JVM, but right now that's just asking for (hard to debug) trouble. It might be possible to add support for launching subprocesses without having to invoke the shell scripts; that would have limitations (e.g. no "spark-env.sh" support). In fact I did something like that in the first implementation of the launcher library, but was asked to go through the shell scripts during code review. (I even had a different method that launched in the same VM, but that one suffered from all the problems described in the paragraph above.) On Thu, May 21, 2015 at 5:21 PM, Kevin Markey <kevin.mar...@oracle.com> wrote: > This is an excellent discussion. As mentioned in an earlier email, we > agree with a number of Chester's suggestions, but we have yet other > concerns. I've researched this further in the past several days, and I've > queried my team. This email attempts to capture those other concerns. > > Making *yarn.Client* private has prevented us from moving from Spark > 1.0.x to Spark 1.2 or 1.3 despite many alluring new features. The > SparkLauncher, which provides “support for programmatically running Spark > jobs” (SPARK-3733 and SPARK-4924) will not work in our environment or for > our use case -- which requires programmatically initiating and monitoring > Spark jobs on Yarn in cluster mode *from a cloud-based application server*. > > > It is not just that the Yarn *ApplicationId* is no longer directly or > indirectly available. More critically, it violates constraints imposed by > any application server and additional constraints imposed by security, > process, and dynamic resource allocation requirements in our cloud services > environment. > > In Spark 1.0 and 1.1, with *yarn.Client* *public*, our applications' *job > scheduler* marshalls configuration and environmental resources necessary > for any Spark job, including cluster-, data- or job-specific parameters, > makes the appropriate calls to initialize and run *yarn.Client*, which > together with the other classes in the spark-yarn module requests the Yarn > resource manager to start and monitor a job (see Figure 1) on the cluster. > (Our job scheduler is not Yarn replacement; it leverages Yarn to coordinate > a variety of different Spark analytic and data enrichment jobs.) > > More recent Spark versions make *yarn.Client* *private* and thus remove > that capability, but the *SparkLauncher*, scheduled for Spark 1.4, > replaces this simple programmatic solution with one considerably more > complicated. Based on our understanding, in this scenario, our job > scheduler marshalls configuration and environmental resources for the > *SparkLauncher > *much as it did for *yarn.Client*. It then calls *launch() *to initialize > a new Linux process to execute the *spark-submit *shell script with the > specified configuration and environment, which in turn starts a new JVM > (with the Spark assembly jar in its class path) that executes *launcher.Main. > *This ultimately calls *yarn.Client* (see Figure 2). This is more than an > arm's-length transaction. There are three legs: job scheduler > *SparkLauncher.launch() > *call → *spark-submit *bash execution → *launcher.Main *call to *yarn.Client > *→ Yarn resource manager allocation and execution of job driver and > executors. > > Not only is this scenario unnecessarily complicated, it will simply not > work. The “programmatic” call to *SparkLauncher.launch() *starts a *new > JVM*, which is not allowed in any application server, which must own all > its JVMs. Perhaps, *spark-submit *and the *launcher.Main *JVM process > could be hosted outside the application server, but in violation of > security and multiple-tenant cloud architectural constraints. > > We appreciate that yarn.Client was perhaps never intended to be public. > Configuring it is not for the faint-of-heart, and some of its methods > should indeed be private. We wonder whether there is another option. > > In researching and discussing these issues with Cloudera and others, we've > been told that only one mechanism is supported for starting Spark jobs: the > *spark-submit* scripts. We also have gathered (perhaps mistakenly) from > discussions reaching back 20 months that Spark's intention is to have a > unified job submission interface for all supported platforms. Unfortunately > this doesn't recognize the asymmetries among those platforms. Submitting a > local Spark job or a job to a Spark master in cluster mode may indeed > require initializing a separate process in order to pass configuration > parameters via the environment and command line. But Spark's *yarn.Client* > in cluster mode already has an arm's length relationship with the Yarn > resource manager. Configuration may be passed from the job scheduling > application to *yarn.Client* as Strings or property map variables and > method parameters. > > Our request is for a *public **yarn.Client* or some reasonable facsimile. > > Thanks. > > > > On 05/13/2015 08:22 PM, Patrick Wendell wrote: > > Hey Chester, > > Thanks for sending this. It's very helpful to have this list. > > The reason we made the Client API private was that it was never > intended to be used by third parties programmatically and we don't > intend to support it in its current form as a stable API. We thought > the fact that it was for internal use would be obvious since it > accepts arguments as a string array of CL args. It was always intended > for command line use and the stable API was the command line. > > When we migrated the Launcher library we figured we covered most of > the use cases in the off chance someone was using the Client. It > appears we regressed one feature which was a clean way to get the app > ID. > > The items you list here 2-6 all seem like new feature requests rather > than a regression caused by us making that API private. > > I think the way to move forward is for someone to design a proper > long-term stable API for the things you mentioned here. That could > either be by extension of the Launcher library. Marcelo would be > natural to help with this effort since he was heavily involved in both > YARN support and the launcher. So I'm curious to hear his opinion on > how best to move forward. > > I do see how apps that run Spark would benefit of having a control > plane for querying status, both on YARN and elsewhere. > > - Patrick > > On Wed, May 13, 2015 at 5:44 AM, Chester At Work <ches...@alpinenow.com> > <ches...@alpinenow.com> wrote: > > Patrick > There are several things we need, some of them already mentioned in the > mailing list before. > > I haven't looked at the SparkLauncher code, but here are few things we need > from our perspectives for Spark Yarn Client > > 1) client should not be private ( unless alternative is provided) so we > can call it directly. > 2) we need a way to stop the running yarn app programmatically ( the PR > is already submitted) > 3) before we start the spark job, we should have a call back to the > application, which will provide the yarn container capacity (number of cores > and max memory ), so spark program will not set values beyond max values (PR > submitted) > 4) call back could be in form of yarn app listeners, which call back > based on yarn status changes ( start, in progress, failure, complete etc), > application can react based on these events in PR) > > 5) yarn client passing arguments to spark program in the form of main > program, we had experience problems when we pass a very large argument due > the length limit. For example, we use json to serialize the argument and > encoded, then parse them as argument. For wide columns datasets, we will run > into limit. Therefore, an alternative way of passing additional larger > argument is needed. We are experimenting with passing the args via a > established akka messaging channel. > > 6) spark yarn client in yarn-cluster mode right now is essentially a > batch job with no communication once it launched. Need to establish the > communication channel so that logs, errors, status updates, progress bars, > execution stages etc can be displayed on the application side. We added an > akka communication channel for this (working on PR ). > > Combined with others items in this list, we are able to redirect print > and error statement to application log (outside of the hadoop cluster), so > spark UI equivalent progress bar via spark listener. We can show yarn > progress via yarn app listener before spark started; and status can be > updated during job execution. > > We are also experimenting with long running job with additional spark > commands and interactions via this channel. > > > Chester > > > > > > > > > > Sent from my iPad > > On May 12, 2015, at 20:54, Patrick Wendell <pwend...@gmail.com> > <pwend...@gmail.com> wrote: > > > Hey Kevin and Ron, > > So is the main shortcoming of the launcher library the inability to > get an app ID back from YARN? Or are there other issues here that > fundamentally regress things for you. > > It seems like adding a way to get back the appID would be a reasonable > addition to the launcher. > > - Patrick > > On Tue, May 12, 2015 at 12:51 PM, Marcelo Vanzin <van...@cloudera.com> > <van...@cloudera.com> wrote: > > On Tue, May 12, 2015 at 11:34 AM, Kevin Markey <kevin.mar...@oracle.com> > <kevin.mar...@oracle.com> > wrote: > > > I understand that SparkLauncher was supposed to address these issues, but > it really doesn't. Yarn already provides indirection and an arm's length > transaction for starting Spark on a cluster. The launcher introduces yet > another layer of indirection and dissociates the Yarn Client from the > application that launches it. > > > Well, not fully. The launcher was supposed to solve "how to launch a Spark > app programatically", but in the first version nothing was added to > actually gather information about the running app. It's also limited in the > way it works because of Spark's limitations (one context per JVM, etc). > > Still, adding things like this is something that is definitely in the scope > for the launcher library; information such as app id can be useful for the > code launching the app, not just in yarn mode. We just have to find a clean > way to provide that information to the caller. > > > > I am still reading the newest code, and we are still researching options > to move forward. If there are alternatives, we'd like to know. > > > > Super hacky, but if you launch Spark as a child process you could parse the > stderr and get the app ID. > > -- > Marcelo > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > > -- Marcelo
�PNG IHDR v � /�S gAMA ���a sRGB ��� cHRM z&