Hi Thomas, I have a sort of question regarding the class loader issue, as it seems interesting. My understanding is that at least user class loader is unregistered and re-registered (from/to library cache on TM) across task restart. If I understand it correctly, unregistered one should be GCed as long as no object loaded by the user class loader is lingering across task restart. Indeed, however, there is no guarantee that UDF cleans up everything on close(). I've seen that some libraries used in UDF let a daemon thread outlive a task, so any object loaded by unregistered user class loader in the thread causes the class loader to be leaked (also daemon threads are also leaked since those keep being spawned, albeit singleton, due to newly registered class loader). If a job keeps restarting, this behavior leads to metaspace OOM or out of threads/OOM. So, my question is if the memory issue you've seen is due to whether Flink issue or the side-effect that UDF causes (as I described). Second question is if there's anything else other than class loader issue. Of course, I also wonder if any prior discussion is going on.
Best, Hwanju On 5/16/19, 8:01 AM, "Thomas Weise" <t...@apache.org> wrote: Hi, When a job fails and is recovered by Flink, task manager JVMs are reused. That can cause problems when the failed job wasn't cleaned up properly, for example leaving behind the user class loader. This would manifest in rising base for memory usage, leading to a death spiral. It would be good to provide an option that guarantees isolation, by restarting the task manager processes. Managing the processes would depend on how Flink is deployed, but the recovery sequence would need to provide a hook for the user. Has there been prior discussion or related work? Thanks, Thomas