Which version is this with? I haven’t seen standalone masters lose workers. Is 
there other stuff on the machines that’s killing them, or what errors do you 
see?

Matei

On May 16, 2014, at 9:53 AM, Josh Marcus <jmar...@meetup.com> wrote:

> Hey folks,
> 
> I'm wondering what strategies other folks are using for maintaining and 
> monitoring the stability of stand-alone spark clusters.
> 
> Our master very regularly loses workers, and they (as expected) never rejoin 
> the cluster.  This is the same behavior I've seen
> using akka cluster (if that's what spark is using in stand-alone mode) -- are 
> there configuration options we could be setting
> to make the cluster more robust?
> 
> We have a custom script which monitors the number of workers (through the web 
> interface) and restarts the cluster when
> necessary, as well as resolving other issues we face (like spark shells left 
> open permanently claiming resources), and it
> works, but it's no where close to a great solution.
> 
> What are other folks doing?  Is this something that other folks observe as 
> well?  I suspect that the loss of workers is tied to 
> jobs that run out of memory on the client side or our use of very large 
> broadcast variables, but I don't have an isolated test case.
> I'm open to general answers here: for example, perhaps we should simply be 
> using mesos or yarn instead of stand-alone mode.
> 
> --j
> 

Reply via email to