1) I think this depends on individual case by case jira.  I haven't looked in 
detail at spark-14649 seems much larger although more the way I think we want 
to go. While SPARK-13669 seems less risky and easily configurable.
2) I don't know whether it needs an entire rewrite but I think there need to be 
some major changes made especially in the handling of reduces and fetch 
failures.  We could do a much better job of not throwing away existing work and 
more optimally handling the failure cases.  For this would it make sense for us 
to start with a jira that has a bullet list of things we would like to improve 
and get more cohesive view and see really how invasive it would be?
Tom 

    On Friday, March 24, 2017 10:41 AM, Imran Rashid <iras...@cloudera.com> 
wrote:
 

 Kay and I were discussing some of the  bigger scheduler changes getting 
proposed lately, and realized there is a broader discussion to have with the 
community, outside of any single jira.  I'll start by sharing my initial 
thoughts, I know Kay has thoughts on this too, but it would be good to input 
from everyone.
In particular, SPARK-14649 & SPARK-13669 have got me thinking.  These are 
proposed changes in behavior that are not fixes for *correctness* in fault 
tolerance, but to improve the performance when there faults.  The changes make 
some intuitive sense, but its also hard to judge whether they are necessarily 
better; its hard to verify the correctness of the changes; and its hard to even 
know that we haven't broken the old behavior (because of how brittle the 
scheduler seems to be).
So I'm wondering:
1) in the short-term, can we find ways to get these changes merged, but turned 
off by default, in a way that we feel confident won't break existing code?
2) a bit longer-term -- should we be considering bigger rewrites to the 
scheduler?  Particularly, to improve testability?  eg., maybe if it was 
rewritten to more completely follow the actor model and eliminate shared state, 
the code would be cleaner and more testable.  Or maybe this is a crazy idea, 
and we'd just lose everything we'd learned so far and be stuck fixing the as 
many bugs in the new version.
Imran


   

Reply via email to