Hey all,

So we are doing some experimenting around large keyed state in Flink 1.2 on
a single task manager and we keep having our task manager killed by the job
manager after about 10 minutes due to this exception:

Fetching metrics failed.
akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka.tcp://flink@flink-s-load-uscen-a-c001-n011:
37244/user/MetricQueryService_0f7bba0b16b18e83b69c4a50e657bb1f]] after
[10000 ms]

The task manager logs show nothing out of the ordinary, but the job manager
logs shows this:

2017-04-19 20:56:52,230 Association with remote system
[akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244] has failed, address
is now gated for [5000] ms. Reason: [Disassociated]
2017-04-19 20:56:53,986 Fetching metrics failed.
2017-04-19 20:57:43,584 Association with remote system
[akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244] has failed, address
is now gated for [5000] ms. Reason: [Association failed with
[akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244]] Caused by:
[Connection refused: flink-s-load-uscen-a-c001-n011/10.34.48.40:37244]
2017-04-19 20:57:49,517 Detected unreachable: [akka.tcp://flink@flink-s-
load-uscen-a-c001-n011:37244]
2017-04-19 20:57:49,517 Task manager akka.tcp://flink@flink-s-load-
uscen-a-c001-n011:37244/user/taskmanager terminated.

The weird part is, we have not set up any metrics reporters or anything so
I am not really sure why the Job Manager is asking the task manager about
them.  Is there a way to disable these metrics requests, or does anyone
know what is causing them?

Thanks,
-- 
*Jason Brelloch* | Product Developer
3405 Piedmont Rd. NE, Suite 325, Atlanta, GA 30305
<http://www.bettercloud.com/>
Subscribe to the BetterCloud Monitor
<https://www.bettercloud.com/monitor?utm_source=bettercloud_email&utm_medium=email_signature&utm_campaign=monitor_launch>
-
Get IT delivered to your inbox

Reply via email to