Re: Native kubernetes setup failed to start job

Yang Wang Mon, 02 Nov 2020 01:42:16 -0800

Hi Liangde Chen,

Thanks for providing the logs. After checking the logs, I am afraid that
there is something wrong with
your K8s cluster. Since detection-engine-dev-taskmanager-1-2 has been
started and registered to JobManager
successfully.


I suggest finding which K8s node detection-engine-dev-taskmanager-1-1 is
running on and disable
the scheduling on it. Then restart the Flink K8s session and have a try
again.

Best,
Yang

Chen Liangde <lian...@gmail.com> 于2020年11月2日周一 下午3:55写道：

> Please find attached logs.
>
> The kubernetes cluster is an aws EKS cluster but managed by our infra's
> team.
> I created a service account "flink" for it and it has permission to
> create, list, delete pods along with  some other types of resources in the
> "team-anti-cheat" namespace.
>
> Below command was used to create the flink cluster:
> ./bin/kubernetes-session.sh \
>         -Dexecution.attached=true \
>         -Dkubernetes.cluster-id=detection-engine-dev \
>         -Dkubernetes.namespace=team-anti-cheat \
>         -Dkubernetes.container-start-command-template="%java% %classpath%
> %jvmmem% %jvmopts% %logging% %class% %args%" \
>         -Dkubernetes.jobmanager.service-account=flink
>
> Thanks
> Liangde Chen
>
>
> On Mon, 2 Nov 2020 at 08:20, Yang Wang <danrtsey...@gmail.com> wrote:
>
>> Could you share the JobManager logs so that we could check whether it
>> received the
>> registration from TasManager?
>>
>> In a non-HA Flink cluster, the TaskManager is using the service to talk
>> to JobManager.
>> Currently, Flink creates a headless service for JobManager. You could use
>> `kubectl get svc`
>> to find it. And then start a busybox to check the network connectivity.
>>
>> And maybe you could share more information about the environment. I could
>> not reproduce
>> your issue in a typical K8s cluster.
>>
>> Best,
>> Yang
>>
>> Yun Gao <yungao...@aliyun.com> 于2020年10月30日周五 上午11:53写道：
>>
>>> Hi Liangde,
>>>
>>>    I pull in Yang Wang who is the expert for Flink on K8s.
>>>
>>> Best,
>>>  Yun
>>>
>>> ------------------Original Mail ------------------
>>> *Sender:*Chen Liangde <lian...@gmail.com>
>>> *Send Date:*Fri Oct 30 05:30:40 2020
>>> *Recipients:*Flink ML <user@flink.apache.org>
>>> *Subject:*Native kubernetes setup failed to start job
>>>
>>>> I created a flink cluster in kubernetes following this guide:
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html
>>>>
>>>> The job manager was running. When a job was submitted to the job
>>>> manager, it spawned a task manager pod, but the task manager failed to
>>>> connect to the job manager. And in the job manager web ui I can't find the
>>>> task manager.
>>>>
>>>> This error is
>>>> suspicious: 
>>>> org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException:
>>>> Adjusted frame length exceeds 10485760: 352518404 - discarded
>>>>
>>>> 2020-10-29 13:22:51,069 INFO  
>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - 
>>>> Connecting to ResourceManager 
>>>> akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).2020-10-29
>>>>  13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport             
>>>>       [] - Remote connection to 
>>>> [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with 
>>>> java.io.IOException: Connection reset by peer2020-10-29 13:22:51,176 WARN  
>>>> akka.remote.transport.netty.NettyTransport                   [] - Remote 
>>>> connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] 
>>>> failed with 
>>>> org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException:
>>>>  Adjusted frame length exceeds 10485760: 352518404 - discarded2020-10-29 
>>>> 13:22:51,180 WARN  akka.remote.ReliableDeliverySupervisor                  
>>>>      [] - Association with remote system 
>>>> [akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123] has failed, 
>>>> address is now gated for [50] ms. Reason: [Association failed with 
>>>> [akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123]] Caused by: 
>>>> [The remote system explicitly disassociated (reason unknown).]2020-10-29 
>>>> 13:22:51,183 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor      
>>>>      [] - Could not resolve ResourceManager address 
>>>> akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*,
>>>>  retrying in 10000 ms: Could not connect to rpc endpoint under address 
>>>> akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*.2020-10-29
>>>>  13:23:01,203 WARN  akka.remote.transport.netty.NettyTransport             
>>>>       [] - Remote connection to 
>>>> [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with 
>>>> java.io.IOException: Connection reset by peer
>>>>
>>>>

Re: Native kubernetes setup failed to start job

Reply via email to