I use spot instances for 100 slaves cluster (r3.2xlarge on us-west-1) Jobs I run usually take about 15 hours - cluster is stable and fast. 1-2 computers might be terminated but it's very rare event and Spark can handle it.
On Fri, Mar 25, 2016 at 6:28 PM, Sven Krasser <kras...@gmail.com> wrote: > When a spot instance terminates, you lose all data (RDD partitions) stored > in the executors that ran on that instance. Spark can recreate the > partitions from input data, but if that requires going through multiple > preceding shuffles a good chunk of the job will need to be redone. > -Sven > > On Thu, Mar 24, 2016 at 10:15 PM, Dillian Murphey <crackshotm...@gmail.com > > wrote: > >> I'm very new to apache spark. I'm just a user not a developer. >> >> I'm running a cluster with many spot instances. Am I correct in >> understanding that spark can handle an unlimited number of spot instance >> failures and restarts? Sometimes all the spot instances will dissapear >> without warning, and then they come back. Can I trust spark to pickup all >> jobs where it left off? >> >> I'm noticing some instability with my system. I'm suspecting it could be >> disk or RAM issues. When I add a lot of slaves I run low on RAM on my >> master. Maybe that's part of the problem. But jut want to confirm my >> understanding. >> > > > > -- > www.skrasser.com <http://www.skrasser.com/?utm_source=sig> >