Hi Flink users! TL;DR: My Flink taskmanagers frequently permanently hang in a shutdown handler’s Thread.sleep() call when I issue a stop. Hitting a wall trying to debug. https://issues.apache.org/jira/browse/FLINK-17470
I’m really scratching my head at this issue. On a particular environment in which we have setup Flink 1.10 (on AWS boxes/centos7) with HA job managers, we’re running into an issue where the flink taskmanagers will sometimes (fairly often) enter a permanent hang when we try to stop them with the taskmanager script. This seems to be triggered by the org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run in a Thread.sleep() call. My googling turned up issues around hangs in Thread.sleep() being caused by deadlocks at an OS (?) level<https://blogs.oracle.com/poonam/hung-jvm-due-to-the-threads-stuck-in-pthreadcondtimedwait>. The most obvious difference to me is that in our case every thread in the jvm is blocked on the pthread_wait() syscall. Anyways, I’m at a loss here. If anyone in the flink community has ever seen an issue like this, would love to hear your insight! Stack traces & OS version information are in the linked ticket if anyones curious. Thanks! -Hunter Herman