On Fri, Mar 24, 2017 at 4:41 PM, Imran Rashid <iras...@cloudera.com> wrote:
> Kay and I were discussing some of the bigger scheduler changes getting > proposed lately, and realized there is a broader discussion to have with > the community, outside of any single jira. I'll start by sharing my > initial thoughts, I know Kay has thoughts on this too, but it would be good > to input from everyone. > > In particular, SPARK-14649 & SPARK-13669 have got me thinking. These are > proposed changes in behavior that are not fixes for *correctness* in fault > tolerance, but to improve the performance when there faults. The changes > make some intuitive sense, but its also hard to judge whether they are > necessarily better; its hard to verify the correctness of the changes; and > its hard to even know that we haven't broken the old behavior (because of > how brittle the scheduler seems to be). > > So I'm wondering: > > 1) in the short-term, can we find ways to get these changes merged, but > turned off by default, in a way that we feel confident won't break existing > code? > +1 For risky features that's how we often do it. Feature flag it and turn it on later. > > 2) a bit longer-term -- should we be considering bigger rewrites to the > scheduler? Particularly, to improve testability? eg., maybe if it was > rewritten to more completely follow the actor model and eliminate shared > state, the code would be cleaner and more testable. Or maybe this is a > crazy idea, and we'd just lose everything we'd learned so far and be stuck > fixing the as many bugs in the new version. > This of course depends. Refactoring a large complicated piece of code is one of the most challenging tasks in engineering. It is extremely difficult to ensure things are correct even after that, especially in areas that don't have amazing test coverage.