[ https://issues.apache.org/jira/browse/FLINK-17470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Metzger updated FLINK-17470: ----------------------------------- Priority: Blocker (was: Critical) > Flink task executor process permanently hangs on `flink-daemon.sh stop`, > deletes PID file > ----------------------------------------------------------------------------------------- > > Key: FLINK-17470 > URL: https://issues.apache.org/jira/browse/FLINK-17470 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.10.0 > Environment: > {code:java} > $ uname -a > Linux hostname.local 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC > 2019 x86_64 x86_64 x86_64 GNU/Linux > $ lsb_release -a > LSB Version: :core-4.1-amd64:core-4.1-noarch > Distributor ID: CentOS > Description: CentOS Linux release 7.7.1908 (Core) > Release: 7.7.1908 > Codename: Core > {code} > Flink version 1.10 > > Reporter: Hunter Herman > Assignee: Robert Metzger > Priority: Blocker > Labels: pull-request-available > Fix For: 1.12.0 > > Attachments: flink_jstack.log, flink_mixed_jstack.log > > > Hi Flink team! > We've attempted to upgrade our flink 1.9 cluster to 1.10, but are > experiencing reproducible instability on shutdown. Speciically, it appears > that the `kill` issued in the `stop` case of flink-daemon.sh is causing the > task executor process to hang permanently. Specifically, the process seems to > be hanging in the > `org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run` in > a `Thread.sleep()` call. I think this is a bizarre behavior. Also note that > every thread in the process is BLOCKED. on a `pthread_cond_wait` call. Is > this an OS level issue? Banging my head on a wall here. See attached stack > traces for details. -- This message was sent by Atlassian Jira (v8.3.4#803005)