[ 
https://issues.apache.org/jira/browse/FLINK-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-14328:
-----------------------------------
    Component/s: Deployment / Kubernetes

> JobCluster cannot reach TaskManager in K8s
> ------------------------------------------
>
>                 Key: FLINK-14328
>                 URL: https://issues.apache.org/jira/browse/FLINK-14328
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>            Reporter: Tim
>            Priority: Major
>             Fix For: 1.9.2
>
>
> I have a Job Cluster which I am running in K8s.  It consists of
>  * job manager deployment (1)
>  * task manager deployment (1)
>  * service
> This is more or less following the standard "Job Cluster" setup.   
> Additionally, (due to known issues of TMs talking to JMs), I have set 
> taskmanager.network.bind-policy to "ip", so that the task manager binds on 
> the IP of the pod rather than the pod name (which is not reachable via DNS).  
>  So far so good.
>  
> Once the cluster is started, I can see the job running.  I also see that the 
> JM's resource msnager has registered the TM.
> {code:java}
> 2019-10-05 20:37:14.554 [flink-akka.actor.default-dispatcher-4] DEBUG 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Slot Pool Status:
>         status: connected to 
> akka.tcp://flink@data-capture-enrichedtrans-raw-jobcluster:6123/user/resourcemanager
>         registered TaskManagers: [f34656491b8dfae726d992d276dc6d39]
>         available slots: []
>         allocated slots: [[AllocatedSlot a00f44d19f38ca36da3ae5083c2d02ae @ 
> f34656491b8dfae726d992d276dc6d39 @ 
> data-capture-enrichedtrans-raw-taskmanager-674476f57c-26kxr (dataPort=35815) 
> - 0]]
>         pending requests: []
>         }
> {code}
> However, I see several errors like below, before the job eventually fails 
> (maybe after 5 minutes), and goes into recovery.   This happens until all 
> restarts are exhaused, at which point the cluster completely fails.
> {code:java}
> 2019-10-05 20:42:14.768 [flink-akka.actor.default-dispatcher-19] WARN  
> akka.remote.ReliableDeliverySupervisor 
> flink-akka.remote.default-remote-dispatcher-6 - Association with remote 
> system [akka.tcp://flink@10.107.38.92:50100] has failed, address is now gated 
> for [50] ms. Reason: [Association failed with 
> [akka.tcp://flink@10.107.38.92:50100]] Caused by: [java.net.ConnectException: 
> Connection refused: /10.107.38.92:50100]
> {code}
> {{To me it looks like the JM is not able to make a connection on the RPC port 
> of the taskmanager (50100 is the taskmanager.rpc.port setting, and 
> 10.107.38.92 is the IP address of the task manager pod as seen by "kubectl 
> describe pod".)}}
> {{Has anyone come across this issue?}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to