Actually, it looks like even when the job shuts down cleanly, there can be a few minutes of overlap between the time the next job launches and the first job actually terminates it's process. Here's some relevant lines from my log:
14/10/09 20:49:20 INFO Worker: Asked to kill executor app-20141009204127-0029/1 14/10/09 20:49:20 INFO ExecutorRunner: Runner thread for executor app-20141009204127-0029/1 interrupted 14/10/09 20:49:20 INFO ExecutorRunner: Killing process! 14/10/09 20:49:20 INFO Worker: Asked to launch executor app-20141009204508-0030/1 for Job ... More lines about launching new job... 14/10/09 20:51:17 INFO Worker: Executor app-20141009204127-0029/1 finished with state KILLED As you can see, the first app didn't actually shutdown until two minutes after the new job launched. During that time, I was at double the worker memory limit. Keith On Thu, Oct 9, 2014 at 5:06 PM, Keith Simmons <ke...@pulse.io> wrote: > Hi Folks, > > We have a spark job that is occasionally running out of memory and hanging > (I believe in GC). This is it's own issue we're debugging, but in the > meantime, there's another unfortunate side effect. When the job is killed > (most often because of GC errors), each worker attempts to kill its > respective executor. However, it appears that several of the executors > fail to shut themselves down (I actually have to kill -9 them). However, > even though the worker fails to successfully cleanup the executor, it > starts the next job as though all the resources have been freed up. This > is causing the spark worker to exceed it's configured memory limit, which > is in turn running our boxes out of memory. Is there a setting I can > configure to prevent this issue? Perhaps by having the worker force kill > the executor or not start the next job until it's confirmed the executor > has exited? Let me know if there's any additional information I can > provide. > > Keith > > P.S. We're running spark 1.0.2 >