Re: planning & discussion for larger scheduler changes

Tom Graves Thu, 30 Mar 2017 11:25:24 -0700

If we are worried about major changes destabilizing current code (which I can 
understand) only way around that is to make it pluggable or configurable.  For 
major changes it seems like making it pluggable is cleaner from a code being 
cluttered point of view. But it also means you may have to make the same or 
similar change in 2 places.We could make the interfaces more well defined but 
if the major changes would require interfaces changes that doesn't help.  It 
still seems like if we had a list of things we would like to accomplish and get 
an idea of the rough overall design we could see if defining the interfaces 
better or making them pluggable would help.
There seem to be 3 jiras all related to handling fetch failures: 
https://issues.apache.org/jira/browse/SPARK-20091,  SPARK-14649 , and 
SPARK-19753.  It might be nice to create one epic jira where we think about a 
design as a whole and discuss that more.  Any objections to this?  If not I'll 
create an epic and link the others to it.


Tom
    On Monday, March 27, 2017 9:01 PM, Kay Ousterhout <k...@eecs.berkeley.edu> 
wrote:
 

 (1) I'm pretty hesitant to merge these larger changes, even if they're feature 
flagged, because:   (a) For some of these changes, it's not obvious that 
they'll always improve performance. e.g., for SPARK-14649, it's possible that 
the tasks that got re-started (and temporarily are running in two places) are 
going to fail in the first attempt (because they haven't read the missing map 
output yet).  In that case, not re-starting them will lead to worse 
performance.   (b) The scheduler already has some secret flags that aren't 
documented and are used by only a few people.  I'd like to avoid adding more of 
these (e.g., by merging these features, but having them off by default), 
because very few users use them (since it's hard to learn about them), they add 
complexity to the scheduler that we have to maintain, and for users who are 
considering using them, they often hide advanced behavior that's hard to reason 
about anyway (e.g., the point above for SPARK-14649).    (c) The worst 
performance problem is when jobs just hang or crash; we've seen a few cases of 
that in recent bugs, and I'm worried that merging these complex performance 
improvements trades better performance in a small number of cases for the 
possibility of worse performance via job crashes/hangs in other cases.
Roughly I think our standards for merging performance fixes to the scheduler 
should be that the performance improvement either (a) is simple / easy to 
reason about or (b) unambiguously fixes a serious performance problem.  In the 
case of SPARK-14649, for example, it is complex, and improves performance in 
some cases but hurts it in others, so doesn't fit either (a) or (b).
(2) I do think there are some scheduler re-factorings that would improve 
testability and our ability to reason about correctness, but think there are 
some what surgical, smaller things we could do in the vein of Imran's comment 
about reducing shared state.  Right now we have these super wide interfaces 
between different components of the scheduler, and it means you have to reason 
about the TSM, TSI, CGSB, and DAGSched to figure out whether something works.  
I think we could have an effort to make each component have a much narrower 
interface, so that each part hides a bunch of complexity from other components. 
 The most obvious place to do this in the short term is to remove a bunch of 
info tracking from the DAGScheduler; I filed a JIRA for that here.  I suspect 
there are similar things that could be done in other parts of the scheduler.
Tom's comments re: (2) are more about performance improvements rather than 
readability / testability / debuggability, but also seem important and it does 
seem useful to have a JIRA tracking those.
-Kay
On Mon, Mar 27, 2017 at 11:06 AM, Tom Graves <tgraves...@yahoo.com> wrote:

1) I think this depends on individual case by case jira.  I haven't looked in 
detail at spark-14649 seems much larger although more the way I think we want 
to go. While SPARK-13669 seems less risky and easily configurable.
2) I don't know whether it needs an entire rewrite but I think there need to be 
some major changes made especially in the handling of reduces and fetch 
failures.  We could do a much better job of not throwing away existing work and 
more optimally handling the failure cases.  For this would it make sense for us 
to start with a jira that has a bullet list of things we would like to improve 
and get more cohesive view and see really how invasive it would be?
Tom 

    On Friday, March 24, 2017 10:41 AM, Imran Rashid <iras...@cloudera.com> 
wrote:
 

 Kay and I were discussing some of the  bigger scheduler changes getting 
proposed lately, and realized there is a broader discussion to have with the 
community, outside of any single jira.  I'll start by sharing my initial 
thoughts, I know Kay has thoughts on this too, but it would be good to input 
from everyone.
In particular, SPARK-14649 & SPARK-13669 have got me thinking.  These are 
proposed changes in behavior that are not fixes for *correctness* in fault 
tolerance, but to improve the performance when there faults.  The changes make 
some intuitive sense, but its also hard to judge whether they are necessarily 
better; its hard to verify the correctness of the changes; and its hard to even 
know that we haven't broken the old behavior (because of how brittle the 
scheduler seems to be).
So I'm wondering:
1) in the short-term, can we find ways to get these changes merged, but turned 
off by default, in a way that we feel confident won't break existing code?
2) a bit longer-term -- should we be considering bigger rewrites to the 
scheduler?  Particularly, to improve testability?  eg., maybe if it was 
rewritten to more completely follow the actor model and eliminate shared state, 
the code would be cleaner and more testable.  Or maybe this is a crazy idea, 
and we'd just lose everything we'd learned so far and be stuck fixing the as 
many bugs in the new version.
Imran

Re: planning & discussion for larger scheduler changes

Reply via email to