Sun Rui created SPARK-17519:
-------------------------------

             Summary: [MESOS] Enhance robustness when ExternalShuffleService is 
broken
                 Key: SPARK-17519
                 URL: https://issues.apache.org/jira/browse/SPARK-17519
             Project: Spark
          Issue Type: Improvement
          Components: Mesos
    Affects Versions: 2.0.0
            Reporter: Sun Rui


This is intended to be a complement to SPARK-17370 which addressed Standalone 
mode only.
For Mesos, it seems we could enhance MesosExternalShuffleClient to detect if 
any of the external shuffle services is lost when sending heartbeats. In such 
case, the MesosCoarseGrainedSchedulerBackend can notify ExecutorLost with 
workerlost=true. Also it can put the slave where the external shuffle service 
run to the blacklist, preventing launching tasks further on it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to