Scenerio * savepoint with Cancel followed by a restore on the Job. It brings down the JM and relaunches on a different IP, thus the resolution of dns is a new IP. * The TMs deployment is not rolled ( recreated ) * Note that `flink-conf.yaml:metrics.internal.query-service.port` is hardcoded.
Remote connection to [null] failed with org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException: connection timed out: [dns]/172.17.6.135:6666 Solution: Restart the TM deployment ( though that should not be and will cause latency issues on a shared Resource Manager as k8s ) PS I am sure that a cancel/restart or restart of JM b'coz of any issue will create the same above issue ( not tested ) . Regards