Hi,

we have been experiencing issues in production over the past couple weeks
with Spark Standalone Worker JVMs seeming to have memory leaks. They
accumulate Old Gen until it reaches max and then reach a failed state that
starts critically failing some applications running against the cluster.

I've done some exploration of the Spark code base related to Worker in
search of potential sources of this problem and am looking for some
commentary on a couple theories I have:

Observation 1: The `finishedExecutors` HashMap seem to only accumulate new
entries over time unbounded. It only seems to be appended and never
periodically purged or cleaned of older executors in line with something
like the worker cleanup scheduler.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L473

I feel somewhat confident that over time this will exhibit a "leak". I
quote it just because it may be intentional to hold these references to
support functionality versus a true leak where you just accidentally hold
onto memory.

Observation 2: I feel much less certain about this, but it seemed like if
the Worker is messaged with `KillExecutor` then it only kills the `
ExecutorRunner` but does not clean it up from the executor map.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L492

I haven't been able to sort out whether I'm missing something indirect
where it before/after cleans that executor from the map. However, if it
does not, then it may be leaking references on this map.

One final observation related to our production metrics and not the
codebase itself. We used to periodically see that our completed
applications had the status of "Killed" instead of "Exited" for all the
executors. However, now we see every completed application has a final
state of "Killed" for all the executors. I might speculatively correlate
this to Observation 2 as a potential reason we have started seeing this
issue more recently.

We also have a larger and increasing workload over the past few weeks and
possibly code changes to the application description that could be
exacerbating these potential underlying issues. We run a lot of smaller
applications per day, something in the range of hundreds to maybe 1000
applications per day with 16 executors per application.

Thanks
-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>

Reply via email to