Hi Flink users!

TL;DR: My Flink taskmanagers frequently permanently hang in a shutdown 
handler’s Thread.sleep() call when I issue a stop. Hitting a wall trying to 
debug.  https://issues.apache.org/jira/browse/FLINK-17470

I’m really scratching my head at this issue. On a particular environment in 
which we have setup Flink 1.10 (on AWS boxes/centos7) with HA job managers, 
we’re running into an issue where the flink taskmanagers will sometimes (fairly 
often) enter a permanent hang when we try to stop them with the taskmanager 
script. This seems to be triggered by the 
org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run in a 
Thread.sleep() call. My googling turned up issues around hangs in 
Thread.sleep() being caused by deadlocks at an OS (?) 
level<https://blogs.oracle.com/poonam/hung-jvm-due-to-the-threads-stuck-in-pthreadcondtimedwait>.
 The most obvious difference to me is that in our case every thread in the jvm 
is blocked on the pthread_wait() syscall.

Anyways, I’m at a loss here. If anyone in the flink community has ever seen an 
issue like this, would love to hear your insight! Stack traces & OS version 
information are in the linked ticket if anyones curious.

Thanks!
-Hunter Herman


Reply via email to