Re: planning & discussion for larger scheduler changes

Kay Ousterhout Mon, 27 Mar 2017 19:24:42 -0700

(1) I'm pretty hesitant to merge these larger changes, even if they're
feature flagged, because:
   (a) For some of these changes, it's not obvious that they'll always
improve performance. e.g., for SPARK-14649, it's possible that the tasks
that got re-started (and temporarily are running in two places) are going
to fail in the first attempt (because they haven't read the missing map
output yet).  In that case, not re-starting them will lead to worse
performance.
   (b) The scheduler already has some secret flags that aren't documented
and are used by only a few people.  I'd like to avoid adding more of these
(e.g., by merging these features, but having them off by default), because
very few users use them (since it's hard to learn about them), they add
complexity to the scheduler that we have to maintain, and for users who are
considering using them, they often hide advanced behavior that's hard to
reason about anyway (e.g., the point above for SPARK-14649).
   (c) The worst performance problem is when jobs just hang or crash; we've
seen a few cases of that in recent bugs, and I'm worried that merging these
complex performance improvements trades better performance in a small
number of cases for the possibility of worse performance via job
crashes/hangs in other cases.

Roughly I think our standards for merging performance fixes to the
scheduler should be that the performance improvement either (a) is simple /
easy to reason about or (b) unambiguously fixes a serious performance
problem.  In the case of SPARK-14649, for example, it is complex, and
improves performance in some cases but hurts it in others, so doesn't fit
either (a) or (b).

(2) I do think there are some scheduler re-factorings that would improve
testability and our ability to reason about correctness, but think there
are some what surgical, smaller things we could do in the vein of Imran's
comment about reducing shared state.  Right now we have these super wide
interfaces between different components of the scheduler, and it means you
have to reason about the TSM, TSI, CGSB, and DAGSched to figure out whether
something works.  I think we could have an effort to make each component
have a much narrower interface, so that each part hides a bunch of
complexity from other components.  The most obvious place to do this in the
short term is to remove a bunch of info tracking from the DAGScheduler; I
filed a JIRA for that here
<https://issues.apache.org/jira/browse/SPARK-20116>.  I suspect there are
similar things that could be done in other parts of the scheduler.

Tom's comments re: (2) are more about performance improvements rather than
readability / testability / debuggability, but also seem important and it
does seem useful to have a JIRA tracking those.

-Kay

On Mon, Mar 27, 2017 at 11:06 AM, Tom Graves <tgraves...@yahoo.com> wrote:

> 1) I think this depends on individual case by case jira.  I haven't looked
> in detail at spark-14649 seems much larger although more the way I think we
> want to go. While SPARK-13669 seems less risky and easily configurable.
>
> 2) I don't know whether it needs an entire rewrite but I think there need
> to be some major changes made especially in the handling of reduces and
> fetch failures.  We could do a much better job of not throwing away
> existing work and more optimally handling the failure cases.  For this
> would it make sense for us to start with a jira that has a bullet list of
> things we would like to improve and get more cohesive view and see really
> how invasive it would be?
>
> Tom
>
>
> On Friday, March 24, 2017 10:41 AM, Imran Rashid <iras...@cloudera.com>
> wrote:
>
>
> Kay and I were discussing some of the  bigger scheduler changes getting
> proposed lately, and realized there is a broader discussion to have with
> the community, outside of any single jira.  I'll start by sharing my
> initial thoughts, I know Kay has thoughts on this too, but it would be good
> to input from everyone.
>
> In particular, SPARK-14649 & SPARK-13669 have got me thinking.  These are
> proposed changes in behavior that are not fixes for *correctness* in fault
> tolerance, but to improve the performance when there faults.  The changes
> make some intuitive sense, but its also hard to judge whether they are
> necessarily better; its hard to verify the correctness of the changes; and
> its hard to even know that we haven't broken the old behavior (because of
> how brittle the scheduler seems to be).
>
> So I'm wondering:
>
> 1) in the short-term, can we find ways to get these changes merged, but
> turned off by default, in a way that we feel confident won't break existing
> code?
>
> 2) a bit longer-term -- should we be considering bigger rewrites to the
> scheduler?  Particularly, to improve testability?  eg., maybe if it was
> rewritten to more completely follow the actor model and eliminate shared
> state, the code would be cleaner and more testable.  Or maybe this is a
> crazy idea, and we'd just lose everything we'd learned so far and be stuck
> fixing the as many bugs in the new version.
>
> Imran
>
>
>

Re: planning & discussion for larger scheduler changes

Reply via email to