Thanks Aaron. From: Aaron Davidson <ilike...@gmail.com> Reply-To: <user@spark.apache.org> Date: Monday, April 14, 2014 at 10:30 AM To: <user@spark.apache.org> Subject: Re: Spark resilience
Master and slave are somewhat overloaded terms in the Spark ecosystem (see the glossary: http://spark.apache.org/docs/latest/cluster-overview.html#glossary). Are you actually asking about the Spark "driver" and "executors", or the standalone cluster "master" and "workers"? To briefly answer for either possibility: (1) Drivers are not fault tolerant but can be restarted automatically, Executors may be removed at any point without failing the job (though losing an Executor may slow the job significantly), and Executors may be added at any point and will be immediately used. (2) Standalone cluster Masters are fault tolerant and failure will only temporarily stall new jobs from starting or getting new resources, but does not affect currently-running jobs. Workers can fail and will simply cause jobs to lose their current Executors. New Workers can be added at any point. On Mon, Apr 14, 2014 at 11:00 AM, Ian Ferreira <ianferre...@hotmail.com> wrote: > Folks, > > I was wondering what the failure support modes where for Spark while running > jobs > > 1. What happens when a master fails > 2. What happens when a slave fails > 3. Can you mid job add and remove slaves > > Regarding the install on Meso, if I understand correctly the Spark master is > behind a Zookeeper quorum so that isolates the slaves from a master failure, > but what about the masters behind quorum? > > Cheers > - Ian >