Could you share the JobManager logs so that we could check whether it received the registration from TasManager?
In a non-HA Flink cluster, the TaskManager is using the service to talk to JobManager. Currently, Flink creates a headless service for JobManager. You could use `kubectl get svc` to find it. And then start a busybox to check the network connectivity. And maybe you could share more information about the environment. I could not reproduce your issue in a typical K8s cluster. Best, Yang Yun Gao <yungao...@aliyun.com> 于2020年10月30日周五 上午11:53写道: > Hi Liangde, > > I pull in Yang Wang who is the expert for Flink on K8s. > > Best, > Yun > > ------------------Original Mail ------------------ > *Sender:*Chen Liangde <lian...@gmail.com> > *Send Date:*Fri Oct 30 05:30:40 2020 > *Recipients:*Flink ML <user@flink.apache.org> > *Subject:*Native kubernetes setup failed to start job > >> I created a flink cluster in kubernetes following this guide: >> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html >> >> The job manager was running. When a job was submitted to the job manager, >> it spawned a task manager pod, but the task manager failed to connect to >> the job manager. And in the job manager web ui I can't find the task >> manager. >> >> This error is >> suspicious: >> org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: >> Adjusted frame length exceeds 10485760: 352518404 - discarded >> >> 2020-10-29 13:22:51,069 INFO >> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Connecting >> to ResourceManager >> akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).2020-10-29 >> 13:22:51,176 WARN akka.remote.transport.netty.NettyTransport >> [] - Remote connection to >> [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with >> java.io.IOException: Connection reset by peer2020-10-29 13:22:51,176 WARN >> akka.remote.transport.netty.NettyTransport [] - Remote >> connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] >> failed with >> org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: >> Adjusted frame length exceeds 10485760: 352518404 - discarded2020-10-29 >> 13:22:51,180 WARN akka.remote.ReliableDeliverySupervisor >> [] - Association with remote system >> [akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123] has failed, >> address is now gated for [50] ms. Reason: [Association failed with >> [akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123]] Caused by: >> [The remote system explicitly disassociated (reason unknown).]2020-10-29 >> 13:22:51,183 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >> [] - Could not resolve ResourceManager address >> akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*, >> retrying in 10000 ms: Could not connect to rpc endpoint under address >> akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*.2020-10-29 >> 13:23:01,203 WARN akka.remote.transport.netty.NettyTransport >> [] - Remote connection to >> [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with >> java.io.IOException: Connection reset by peer >> >>