[ https://issues.apache.org/jira/browse/FLINK-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Metzger updated FLINK-14328: ----------------------------------- Component/s: Deployment / Kubernetes > JobCluster cannot reach TaskManager in K8s > ------------------------------------------ > > Key: FLINK-14328 > URL: https://issues.apache.org/jira/browse/FLINK-14328 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Reporter: Tim > Priority: Major > Fix For: 1.9.2 > > > I have a Job Cluster which I am running in K8s. It consists of > * job manager deployment (1) > * task manager deployment (1) > * service > This is more or less following the standard "Job Cluster" setup. > Additionally, (due to known issues of TMs talking to JMs), I have set > taskmanager.network.bind-policy to "ip", so that the task manager binds on > the IP of the pod rather than the pod name (which is not reachable via DNS). > So far so good. > > Once the cluster is started, I can see the job running. I also see that the > JM's resource msnager has registered the TM. > {code:java} > 2019-10-05 20:37:14.554 [flink-akka.actor.default-dispatcher-4] DEBUG > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Slot Pool Status: > status: connected to > akka.tcp://flink@data-capture-enrichedtrans-raw-jobcluster:6123/user/resourcemanager > registered TaskManagers: [f34656491b8dfae726d992d276dc6d39] > available slots: [] > allocated slots: [[AllocatedSlot a00f44d19f38ca36da3ae5083c2d02ae @ > f34656491b8dfae726d992d276dc6d39 @ > data-capture-enrichedtrans-raw-taskmanager-674476f57c-26kxr (dataPort=35815) > - 0]] > pending requests: [] > } > {code} > However, I see several errors like below, before the job eventually fails > (maybe after 5 minutes), and goes into recovery. This happens until all > restarts are exhaused, at which point the cluster completely fails. > {code:java} > 2019-10-05 20:42:14.768 [flink-akka.actor.default-dispatcher-19] WARN > akka.remote.ReliableDeliverySupervisor > flink-akka.remote.default-remote-dispatcher-6 - Association with remote > system [akka.tcp://flink@10.107.38.92:50100] has failed, address is now gated > for [50] ms. Reason: [Association failed with > [akka.tcp://flink@10.107.38.92:50100]] Caused by: [java.net.ConnectException: > Connection refused: /10.107.38.92:50100] > {code} > {{To me it looks like the JM is not able to make a connection on the RPC port > of the taskmanager (50100 is the taskmanager.rpc.port setting, and > 10.107.38.92 is the IP address of the task manager pod as seen by "kubectl > describe pod".)}} > {{Has anyone come across this issue?}} -- This message was sent by Atlassian Jira (v8.3.4#803005)