Marcelo
     Thanks for the comments. All my requirements are from our work over last 
year in yarn-cluster mode. So I am biased on the yarn side.

      It's true some of the task might be able accomplished  with a separate 
yarn API call, the API just does not same to be that nature any more if we do 
that way.

       I had a great discussion (face to face) at data bricks today with Andrew 
Or to see how to address these requirements.

       For #3 Andrew points out that recent new feature of dynamic resource 
allocation make this requirement less important. As once the dynamic resource 
allocation is enabled, user doesn't need specify the number of executors or 
memories up front as before. In spark 1.x, user needs to specify these numbers, 
for small cluster, these jobs got killed immediately as if the memory specified 
is larger than yarn max memory. Also we were hoping to dynamically determine 
the executors and memory needed based on the data size, but make sure they are 
not exceeding the max.
         With dynamic resource allocation, I think we can just let spark handle 
this dynamically.

       For #4 spark context status tracker can give information, but you need 
to pull it based on certain time interval. Some kind event based call backs 
would be nice.

       For #5 yes, it's about the command line args. These are args are the 
input for the spark jobs. Seems a bit too much to create a file just to specify 
spark job args. These args could be few thousands columns in machine learning 
jobs.

       For #6 we was thinking our needs for communication is not special to us, 
 other applications may need this as well. But this maybe request too much 
changes in  spark

       In our case, we did the followings 
         1) we modified the yarn client to expose yarn app listener, so it call 
back on events based on spark yarn report interval (default to 1 sec). This 
gives us the container start, app in progress, failed, killed events

         2) in our own spark job, we wrap the main method with a akka actor 
which communicate with the actor in the application job submitter. A logger and 
spark job listener are created. Spark job listener send message to the logger. 
Logger relay the message to the application via akka actor. Std out and error 
are redirect to logger as well. Depending on the type of the messages, the 
application will update the UI (witch shows the progress bar) or log the 
message directly to the log file, or update the job state. We are using log4j, 
the issue is that in yarn cluster mode, the log are inside the cluster, not in 
the application, which is out side the cluster. We want to capture the cluster 
or error messages directly in the application log.
      
       I will put some design doc and actual code in my pull request later, as 
Andrew requested. This PR is unlikely to get merge in, but it will show the 
idea I am talking about here.

     Thanks for listening and responding 

Chester

Sent from my iPad

On May 14, 2015, at 18:41, Marcelo Vanzin <van...@cloudera.com> wrote:

> Hi Chester,
> 
> Thanks for the feedback. A few of those are great candidates for improvements 
> to the launcher library.
> 
> On Wed, May 13, 2015 at 5:44 AM, Chester At Work <ches...@alpinenow.com> 
> wrote:
>      1) client should not be private ( unless alternative is provided) so we 
> can call it directly.
> 
> Patrick already touched on this subject, but I believe Client should be kept 
> private. If we want to expose functionality for code launching Spark apps, 
> Spark should provide an interface for that so that other cluster managers can 
> benefit. It also keeps the API more consistent (everybody uses the same API 
> regardless of what's the underlying cluster manager).
>  
>      2) we need a way to stop the running yarn app programmatically ( the PR 
> is already submitted)
> 
> My first reaction to this was "with the app id, you can talk to YARN directly 
> and do that". But given what I wrote above, I guess it would make sense for 
> something like this to be exposed through the library too.
>  
>      3) before we start the spark job, we should have a call back to the 
> application, which will provide the yarn container capacity (number of cores 
> and max memory ), so spark program will not set values beyond max values (PR 
> submitted)
> 
> I'm not sure exactly what you mean here, but it feels like we're starting to 
> get into "wrapping the YARN API" territory. Someone who really cares about 
> that information can easily fetch it from YARN, the same way Spark would.
>  
>      4) call back could be in form of yarn app listeners, which call back 
> based on yarn status changes ( start, in progress, failure, complete etc), 
> application can react based on these events in PR)
> 
> Exposing some sort of status for the running application does sound useful.
>  
>      5) yarn client passing arguments to spark program in the form of main 
> program, we had experience problems when we pass a very large argument due 
> the length limit. For example, we use json to serialize the argument and 
> encoded, then parse them as argument. For wide columns datasets, we will run 
> into limit. Therefore, an alternative way of passing additional larger 
> argument is needed. We are experimenting with passing the args via a 
> established akka messaging channel.
> 
> I believe you're talking about command line arguments to your application 
> here? I'm not sure what Spark can do to alleviate this. With YARN, if you 
> need to pass a lot of information to your application, I'd recommend creating 
> a file and adding it to "--files" when calling spark-submit. It's not 
> optimal, since the file will be distributed to all executors (not just the 
> AM), but it should help if you're really running into problems, and is 
> simpler than using akka.
>  
>     6) spark yarn client in yarn-cluster mode right now is essentially a 
> batch job with no communication once it launched. Need to establish the 
> communication channel so that logs, errors, status updates, progress bars, 
> execution stages etc can be displayed on the application side. We added an 
> akka communication channel for this (working on PR ).
> 
> This is another thing I'm a little unsure about. It seems that if you really 
> need to, you can implement your own SparkListener to do all this, and then 
> you can transfer all that data to wherever you need it to be. Exposing 
> something like this in Spark would just mean having to support some new RPC 
> mechanism as a public API, which is kinda burdensome. You mention yourself 
> you've done something already, which is an indication that you can do it with 
> existing Spark APIs, which makes it not a great candidate for a core Spark 
> feature. I guess you can't get logs with that mechanism, but you can use your 
> custom log4j configuration for that if you really want to pipe logs to a 
> different location.
> 
> But those are a good starting point - knowing what people need is the first 
> step in adding a new feature. :-) 
> 
> -- 
> Marcelo

Reply via email to