I'd just like to point out that, along with Matei, I have not seen workers drop even under the most exotic job failures. We're running pretty close to master, though; perhaps it is related to an uncaught exception in the Worker from a prior version of Spark.
On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja <aahuj...@gmail.com> wrote: > Hi Matei, > > Unfortunately, I don't have more detailed information, but we have seen > the loss of workers in standalone mode as well. If a job is killed through > CTRL-C we will often see in the Spark Master page the number of workers and > cores decrease. They are still alive and well in the Cloudera Manager > page, but not visible on the Spark master, simply restarting the workers > usually resolves this, but we often seen workers disappear after a failed > or killed job. > > If we see this occur again, I'll try and provide some logs. > > > > > On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia > <matei.zaha...@gmail.com>wrote: > >> Which version is this with? I haven’t seen standalone masters lose >> workers. Is there other stuff on the machines that’s killing them, or what >> errors do you see? >> >> Matei >> >> On May 16, 2014, at 9:53 AM, Josh Marcus <jmar...@meetup.com> wrote: >> >> > Hey folks, >> > >> > I'm wondering what strategies other folks are using for maintaining and >> monitoring the stability of stand-alone spark clusters. >> > >> > Our master very regularly loses workers, and they (as expected) never >> rejoin the cluster. This is the same behavior I've seen >> > using akka cluster (if that's what spark is using in stand-alone mode) >> -- are there configuration options we could be setting >> > to make the cluster more robust? >> > >> > We have a custom script which monitors the number of workers (through >> the web interface) and restarts the cluster when >> > necessary, as well as resolving other issues we face (like spark shells >> left open permanently claiming resources), and it >> > works, but it's no where close to a great solution. >> > >> > What are other folks doing? Is this something that other folks observe >> as well? I suspect that the loss of workers is tied to >> > jobs that run out of memory on the client side or our use of very large >> broadcast variables, but I don't have an isolated test case. >> > I'm open to general answers here: for example, perhaps we should simply >> be using mesos or yarn instead of stand-alone mode. >> > >> > --j >> > >> >> >