Hey all, So we are doing some experimenting around large keyed state in Flink 1.2 on a single task manager and we keep having our task manager killed by the job manager after about 10 minutes due to this exception:
Fetching metrics failed. akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@flink-s-load-uscen-a-c001-n011: 37244/user/MetricQueryService_0f7bba0b16b18e83b69c4a50e657bb1f]] after [10000 ms] The task manager logs show nothing out of the ordinary, but the job manager logs shows this: 2017-04-19 20:56:52,230 Association with remote system [akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 2017-04-19 20:56:53,986 Fetching metrics failed. 2017-04-19 20:57:43,584 Association with remote system [akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244]] Caused by: [Connection refused: flink-s-load-uscen-a-c001-n011/10.34.48.40:37244] 2017-04-19 20:57:49,517 Detected unreachable: [akka.tcp://flink@flink-s- load-uscen-a-c001-n011:37244] 2017-04-19 20:57:49,517 Task manager akka.tcp://flink@flink-s-load- uscen-a-c001-n011:37244/user/taskmanager terminated. The weird part is, we have not set up any metrics reporters or anything so I am not really sure why the Job Manager is asking the task manager about them. Is there a way to disable these metrics requests, or does anyone know what is causing them? Thanks, -- *Jason Brelloch* | Product Developer 3405 Piedmont Rd. NE, Suite 325, Atlanta, GA 30305 <http://www.bettercloud.com/> Subscribe to the BetterCloud Monitor <https://www.bettercloud.com/monitor?utm_source=bettercloud_email&utm_medium=email_signature&utm_campaign=monitor_launch> - Get IT delivered to your inbox