Kay and I were discussing some of the  bigger scheduler changes getting
proposed lately, and realized there is a broader discussion to have with
the community, outside of any single jira.  I'll start by sharing my
initial thoughts, I know Kay has thoughts on this too, but it would be good
to input from everyone.

In particular, SPARK-14649 & SPARK-13669 have got me thinking.  These are
proposed changes in behavior that are not fixes for *correctness* in fault
tolerance, but to improve the performance when there faults.  The changes
make some intuitive sense, but its also hard to judge whether they are
necessarily better; its hard to verify the correctness of the changes; and
its hard to even know that we haven't broken the old behavior (because of
how brittle the scheduler seems to be).

So I'm wondering:

1) in the short-term, can we find ways to get these changes merged, but
turned off by default, in a way that we feel confident won't break existing
code?

2) a bit longer-term -- should we be considering bigger rewrites to the
scheduler?  Particularly, to improve testability?  eg., maybe if it was
rewritten to more completely follow the actor model and eliminate shared
state, the code would be cleaner and more testable.  Or maybe this is a
crazy idea, and we'd just lose everything we'd learned so far and be stuck
fixing the as many bugs in the new version.

Imran

Reply via email to