Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Tom Graves Thu, 21 Mar 2019 10:10:39 -0700

 While I agree with you that it would be ideal to have the task level resources 
and do a deeper redesign for the scheduler, I think that can be a separate 
enhancement like was discussed earlier in the thread. That feature is useful 
without GPU's.  I do realize that they overlap some but I think the changes for 
this will be minimal to the scheduler, follow existing conventions, and it is 
an improvement over what we have now. I know many users will be happy to have 
this even without the task level scheduling as many of the conventions used now 
to scheduler gpus can easily be broken by one bad user.     I think from the 
user point of view this gives many users an improvement and we can extend it 
later to cover more use cases. 
Tom    On Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Hamstra 
<[email protected]> wrote:  
 
 I understand the application-level, static, global nature of 
spark.task.accelerator.gpu.count and its similarity to the existing 
spark.task.cpus, but to me this feels like extending a weakness of Spark's 
scheduler, not building on its strengths. That is because I consider binding 
the number of cores for each task to an application configuration to be far 
from optimal. This is already far from the desired behavior when an application 
is running a wide range of jobs (as in a generic job-runner style of Spark 
application), some of which require or can benefit from multi-core tasks, 
others of which will just waste the extra cores allocated to their tasks. 
Ideally, the number of cores allocated to tasks would get pushed to an even 
finer granularity that jobs, and instead being a per-stage property.
Now, of course, making allocation of general-purpose cores and domain-specific 
resources work in this finer-grained fashion is a lot more work than just 
trying to extend the existing resource allocation mechanisms to handle 
domain-specific resources, but it does feel to me like we should at least be 
considering doing that deeper redesign.  
On Thu, Mar 21, 2019 at 7:33 AM Tom Graves <[email protected]> wrote:


 Tthe proposal here is that all your resources are static and the gpu per task 
config is global per application, meaning you ask for a certain amount memory, 
cpu, GPUs for every executor up front just like you do today and every executor 
you get is that size.  This means that both static or dynamic allocation still 
work without explicitly adding more logic at this point. Since the config for 
gpu per task is global it means every task you want will need a certain ratio 
of cpu to gpu.  Since that is a global you can't really have the scenario you 
mentioned, all tasks are assuming to need GPU.  For instance. I request 5 
cores, 2 GPUs, set 1 gpu per task for each executor.  That means that I could 
only run 2 tasks and 3 cores would be wasted.  The stage/task level 
configuration of resources was removed and is something we can do in a separate 
SPIP. We thought erroring would make it more obvious to the user.  We could 
change this to a warning if everyone thinks that is better but I personally 
like the error until we can implement the per lower level per stage 
configuration. 
Tom
    On Thursday, March 21, 2019, 1:45:01 AM PDT, Marco Gaido 
<[email protected]> wrote:  
 
 Thanks for this SPIP.I cannot comment on the docs, but just wanted to 
highlight one thing. In page 5 of the SPIP, when we talk about DRA, I see:
"For instance, if each executor consists 4 CPUs and 2 GPUs, and each task 
requires 1 CPU and 1GPU, then we shall throw an error on application start 
because we shall always have at least 2 idle CPUs per executor"
I am not sure this is a correct behavior. We might have tasks requiring only 
CPU running in parallel as well, hence that may make sense. I'd rather emit a 
WARN or something similar. Anyway we just said we will keep GPU scheduling on 
task level out of scope for the moment, right?
Thanks,Marco
Il giorno gio 21 mar 2019 alle ore 01:26 Xiangrui Meng <[email protected]> ha 
scritto:

Steve, the initial work would focus on GPUs, but we will keep the interfaces 
general to support other accelerators in the future. This was mentioned in the 
SPIP and draft design. 
Imran, you should have comment permission now. Thanks for making a pass! I 
don't think the proposed 3.0 features should block Spark 3.0 release either. It 
is just an estimate of what we could deliver. I will update the doc to make it 
clear.
Felix, it would be great if you can review the updated docs and let us know 
your feedback.
** How about setting a tentative vote closing time to next Tue (Mar 26)?
On Wed, Mar 20, 2019 at 11:01 AM Imran Rashid <[email protected]> wrote:

Thanks for sending the updated docs.  Can you please give everyone the ability 
to comment?  I have some comments, but overall I think this is a good proposal 
and addresses my prior concerns.
My only real concern is that I notice some mention of "must dos" for spark 3.0. 
 I don't want to make any commitment to holding spark 3.0 for parts of this, I 
think that is an entirely separate decision.  However I'm guessing this is just 
a minor wording issue, and you really mean that's a minimal set of features you 
are aiming for, which is reasonable.
On Mon, Mar 18, 2019 at 12:56 PM Xingbo Jiang <[email protected]> wrote:

Hi all,
I updated the SPIP doc and stories, I hope it now contains clear scope of the 
changes and enough details for SPIP vote.Please review the updated docs, thanks!
Xiangrui Meng <[email protected]> 于2019年3月6日周三 上午8:35写道：

How about letting Xingbo make a major revision to the SPIP doc to make it clear 
what proposed are? I like Felix's suggestion to switch to the new Heilmeier 
template, which helps clarify what are proposed and what are not. Then let's 
review the new SPIP and resume the vote.
On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid <[email protected]> wrote:

OK, I suppose then we are getting bogged down into what a vote on an SPIP means 
then anyway, which I guess we can set aside for now.  With the level of detail 
in this proposal, I feel like there is a reasonable chance I'd still -1 the 
design or implementation.
And the other thing you're implicitly asking the community for is to prioritize 
this feature for continued review and maintenance.  There is already work to be 
done in things like making barrier mode support dynamic allocation 
(SPARK-24942), bugs in failure handling (eg. SPARK-25250), and general 
efficiency of failure handling (eg. SPARK-25341, SPARK-20178).  I'm very 
concerned about getting spread too thin.


But if this is really just a vote on (1) is better gpu support important for 
spark, in some form, in some release? and (2) is it *possible* to do this in a 
safe way?  then I will vote +0.
On Tue, Mar 5, 2019 at 8:25 AM Tom Graves <[email protected]> wrote:

 So to me most of the questions here are implementation/design questions, I've 
had this issue in the past with SPIP's where I expected to have more high level 
design details but was basically told that belongs in the design jira follow 
on. This makes me think we need to revisit what a SPIP really need to contain, 
which should be done in a separate thread.  Note personally I would be for 
having more high level details in it.But the way I read our documentation on a 
SPIP right now that detail is all optional, now maybe we could argue its based 
on what reviewers request, but really perhaps we should make the wording of 
that more required.  thoughts?  We should probably separate that discussion if 
people want to talk about that.
For this SPIP in particular the reason I +1 it is because it came down to 2 
questions:
1) do I think spark should support this -> my answer is yes, I think this would 
improve spark, users have been requesting both better GPUs support and support 
for controlling container requests at a finer granularity for a while.  If 
spark doesn't support this then users may go to something else, so I think it 
we should support it
2) do I think its possible to design and implement it without causing large 
instabilities?   My opinion here again is yes. I agree with Imran and others 
that the scheduler piece needs to be looked at very closely as we have had a 
lot of issues there and that is why I was asking for more details in the design 
jira:  https://issues.apache.org/jira/browse/SPARK-27005.  But I do believe its 
possible to do.
If others have reservations on similar questions then I think we should resolve 
here or take the discussion of what a SPIP is to a different thread and then 
come back to this, thoughts?    
Note there is a high level design for at least the core piece, which is what 
people seem concerned with, already so including it in the SPIP should be 
straight forward.
Tom
    On Monday, March 4, 2019, 2:52:43 PM CST, Imran Rashid 
<[email protected]> wrote:  
 
 On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng <[email protected]> wrote:

On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung <[email protected]> wrote:
IMO upfront allocation is less useful. Specifically too expensive for large 
jobs.

This is also an API/design discussion.

I agree with Felix -- this is more than just an API question.  It has a huge 
impact on the complexity of what you're proposing.  You might be proposing big 
changes to a core and brittle part of spark, which is already short of experts.
I don't see any value in having a vote on "does feature X sound cool?"  We have 
to evaluate the potential benefit against the risks the feature brings and the 
continued maintenance cost.  We don't need super low-level details, but we have 
to a sketch of the design to be able to make that tradeoff.

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Reply via email to