Marcelo Thanks for the comments. All my requirements are from our work over last year in yarn-cluster mode. So I am biased on the yarn side.
It's true some of the task might be able accomplished with a separate yarn API call, the API just does not same to be that nature any more if we do that way. I had a great discussion (face to face) at data bricks today with Andrew Or to see how to address these requirements. For #3 Andrew points out that recent new feature of dynamic resource allocation make this requirement less important. As once the dynamic resource allocation is enabled, user doesn't need specify the number of executors or memories up front as before. In spark 1.x, user needs to specify these numbers, for small cluster, these jobs got killed immediately as if the memory specified is larger than yarn max memory. Also we were hoping to dynamically determine the executors and memory needed based on the data size, but make sure they are not exceeding the max. With dynamic resource allocation, I think we can just let spark handle this dynamically. For #4 spark context status tracker can give information, but you need to pull it based on certain time interval. Some kind event based call backs would be nice. For #5 yes, it's about the command line args. These are args are the input for the spark jobs. Seems a bit too much to create a file just to specify spark job args. These args could be few thousands columns in machine learning jobs. For #6 we was thinking our needs for communication is not special to us, other applications may need this as well. But this maybe request too much changes in spark In our case, we did the followings 1) we modified the yarn client to expose yarn app listener, so it call back on events based on spark yarn report interval (default to 1 sec). This gives us the container start, app in progress, failed, killed events 2) in our own spark job, we wrap the main method with a akka actor which communicate with the actor in the application job submitter. A logger and spark job listener are created. Spark job listener send message to the logger. Logger relay the message to the application via akka actor. Std out and error are redirect to logger as well. Depending on the type of the messages, the application will update the UI (witch shows the progress bar) or log the message directly to the log file, or update the job state. We are using log4j, the issue is that in yarn cluster mode, the log are inside the cluster, not in the application, which is out side the cluster. We want to capture the cluster or error messages directly in the application log. I will put some design doc and actual code in my pull request later, as Andrew requested. This PR is unlikely to get merge in, but it will show the idea I am talking about here. Thanks for listening and responding Chester Sent from my iPad On May 14, 2015, at 18:41, Marcelo Vanzin <van...@cloudera.com> wrote: > Hi Chester, > > Thanks for the feedback. A few of those are great candidates for improvements > to the launcher library. > > On Wed, May 13, 2015 at 5:44 AM, Chester At Work <ches...@alpinenow.com> > wrote: > 1) client should not be private ( unless alternative is provided) so we > can call it directly. > > Patrick already touched on this subject, but I believe Client should be kept > private. If we want to expose functionality for code launching Spark apps, > Spark should provide an interface for that so that other cluster managers can > benefit. It also keeps the API more consistent (everybody uses the same API > regardless of what's the underlying cluster manager). > > 2) we need a way to stop the running yarn app programmatically ( the PR > is already submitted) > > My first reaction to this was "with the app id, you can talk to YARN directly > and do that". But given what I wrote above, I guess it would make sense for > something like this to be exposed through the library too. > > 3) before we start the spark job, we should have a call back to the > application, which will provide the yarn container capacity (number of cores > and max memory ), so spark program will not set values beyond max values (PR > submitted) > > I'm not sure exactly what you mean here, but it feels like we're starting to > get into "wrapping the YARN API" territory. Someone who really cares about > that information can easily fetch it from YARN, the same way Spark would. > > 4) call back could be in form of yarn app listeners, which call back > based on yarn status changes ( start, in progress, failure, complete etc), > application can react based on these events in PR) > > Exposing some sort of status for the running application does sound useful. > > 5) yarn client passing arguments to spark program in the form of main > program, we had experience problems when we pass a very large argument due > the length limit. For example, we use json to serialize the argument and > encoded, then parse them as argument. For wide columns datasets, we will run > into limit. Therefore, an alternative way of passing additional larger > argument is needed. We are experimenting with passing the args via a > established akka messaging channel. > > I believe you're talking about command line arguments to your application > here? I'm not sure what Spark can do to alleviate this. With YARN, if you > need to pass a lot of information to your application, I'd recommend creating > a file and adding it to "--files" when calling spark-submit. It's not > optimal, since the file will be distributed to all executors (not just the > AM), but it should help if you're really running into problems, and is > simpler than using akka. > > 6) spark yarn client in yarn-cluster mode right now is essentially a > batch job with no communication once it launched. Need to establish the > communication channel so that logs, errors, status updates, progress bars, > execution stages etc can be displayed on the application side. We added an > akka communication channel for this (working on PR ). > > This is another thing I'm a little unsure about. It seems that if you really > need to, you can implement your own SparkListener to do all this, and then > you can transfer all that data to wherever you need it to be. Exposing > something like this in Spark would just mean having to support some new RPC > mechanism as a public API, which is kinda burdensome. You mention yourself > you've done something already, which is an indication that you can do it with > existing Spark APIs, which makes it not a great candidate for a core Spark > feature. I guess you can't get logs with that mechanism, but you can use your > custom log4j configuration for that if you really want to pipe logs to a > different location. > > But those are a good starting point - knowing what people need is the first > step in adding a new feature. :-) > > -- > Marcelo