Re: Job recovery with task manager restart

Kim, Hwanju Thu, 16 May 2019 10:28:48 -0700

Hi Thomas,

I have a sort of question regarding the class loader issue, as it seems 
interesting. 
My understanding is that at least user class loader is unregistered and 
re-registered (from/to library cache on TM) across task restart. If I 
understand it correctly, unregistered one should be GCed as long as no object 
loaded by the user class loader is lingering across task restart. Indeed, 
however, there is no guarantee that UDF cleans up everything on close(). I've 
seen that some libraries used in UDF let a daemon thread outlive a task, so any 
object loaded by unregistered user class loader in the thread causes the class 
loader to be leaked (also daemon threads are also leaked since those keep being 
spawned, albeit singleton, due to newly registered class loader). If a job 
keeps restarting, this behavior leads to metaspace OOM or out of threads/OOM. 
So, my question is if the memory issue you've seen is due to whether Flink 
issue or the side-effect that UDF causes (as I described). Second question is 
if there's anything else other than class loader issue. Of course, I also 
wonder if any prior discussion is going on.


Best,
Hwanju

On 5/16/19, 8:01 AM, "Thomas Weise" <t...@apache.org> wrote:

    Hi,
    
    When a job fails and is recovered by Flink, task manager JVMs are reused.
    That can cause problems when the failed job wasn't cleaned up properly, for
    example leaving behind the user class loader. This would manifest in rising
    base for memory usage, leading to a death spiral.
    
    It would be good to provide an option that guarantees isolation, by
    restarting the task manager processes. Managing the processes would depend
    on how Flink is deployed, but the recovery sequence would need to provide a
    hook for the user.
    
    Has there been prior discussion or related work?
    
    Thanks,
    Thomas

Re: Job recovery with task manager restart

Reply via email to