Please find attached logs. The kubernetes cluster is an aws EKS cluster but managed by our infra's team. I created a service account "flink" for it and it has permission to create, list, delete pods along with some other types of resources in the "team-anti-cheat" namespace.
Below command was used to create the flink cluster: ./bin/kubernetes-session.sh \ -Dexecution.attached=true \ -Dkubernetes.cluster-id=detection-engine-dev \ -Dkubernetes.namespace=team-anti-cheat \ -Dkubernetes.container-start-command-template="%java% %classpath% %jvmmem% %jvmopts% %logging% %class% %args%" \ -Dkubernetes.jobmanager.service-account=flink Thanks Liangde Chen On Mon, 2 Nov 2020 at 08:20, Yang Wang <danrtsey...@gmail.com> wrote: > Could you share the JobManager logs so that we could check whether it > received the > registration from TasManager? > > In a non-HA Flink cluster, the TaskManager is using the service to talk to > JobManager. > Currently, Flink creates a headless service for JobManager. You could use > `kubectl get svc` > to find it. And then start a busybox to check the network connectivity. > > And maybe you could share more information about the environment. I could > not reproduce > your issue in a typical K8s cluster. > > Best, > Yang > > Yun Gao <yungao...@aliyun.com> 于2020年10月30日周五 上午11:53写道: > >> Hi Liangde, >> >> I pull in Yang Wang who is the expert for Flink on K8s. >> >> Best, >> Yun >> >> ------------------Original Mail ------------------ >> *Sender:*Chen Liangde <lian...@gmail.com> >> *Send Date:*Fri Oct 30 05:30:40 2020 >> *Recipients:*Flink ML <user@flink.apache.org> >> *Subject:*Native kubernetes setup failed to start job >> >>> I created a flink cluster in kubernetes following this guide: >>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html >>> >>> The job manager was running. When a job was submitted to the job >>> manager, it spawned a task manager pod, but the task manager failed to >>> connect to the job manager. And in the job manager web ui I can't find the >>> task manager. >>> >>> This error is >>> suspicious: >>> org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: >>> Adjusted frame length exceeds 10485760: 352518404 - discarded >>> >>> 2020-10-29 13:22:51,069 INFO >>> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - >>> Connecting to ResourceManager >>> akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).2020-10-29 >>> 13:22:51,176 WARN akka.remote.transport.netty.NettyTransport >>> [] - Remote connection to >>> [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with >>> java.io.IOException: Connection reset by peer2020-10-29 13:22:51,176 WARN >>> akka.remote.transport.netty.NettyTransport [] - Remote >>> connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] >>> failed with >>> org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: >>> Adjusted frame length exceeds 10485760: 352518404 - discarded2020-10-29 >>> 13:22:51,180 WARN akka.remote.ReliableDeliverySupervisor >>> [] - Association with remote system >>> [akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123] has failed, >>> address is now gated for [50] ms. Reason: [Association failed with >>> [akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123]] Caused by: >>> [The remote system explicitly disassociated (reason unknown).]2020-10-29 >>> 13:22:51,183 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor >>> [] - Could not resolve ResourceManager address >>> akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*, >>> retrying in 10000 ms: Could not connect to rpc endpoint under address >>> akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*.2020-10-29 >>> 13:23:01,203 WARN akka.remote.transport.netty.NettyTransport >>> [] - Remote connection to >>> [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with >>> java.io.IOException: Connection reset by peer >>> >>>
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE detection-engine-dev ClusterIP None <none> 6123/TCP,6124/TCP 5m28s detection-engine-dev-rest LoadBalancer 172.20.210.124 a375eab7dc75f42fc9935d5940107811-1454167660.us-east-1.elb.amazonaws.com 8081:32256/TCP 5m28s
jobmanager.log
Description: Binary data
taskmanager.log
Description: Binary data