Re: Native kubernetes setup failed to start job

Chen Liangde Sun, 01 Nov 2020 23:56:32 -0800

Please find attached logs.

The kubernetes cluster is an aws EKS cluster but managed by our infra's
team.
I created a service account "flink" for it and it has permission to create,
list, delete pods along with  some other types of resources in the
"team-anti-cheat" namespace.


Below command was used to create the flink cluster:
./bin/kubernetes-session.sh \
        -Dexecution.attached=true \
        -Dkubernetes.cluster-id=detection-engine-dev \
        -Dkubernetes.namespace=team-anti-cheat \
        -Dkubernetes.container-start-command-template="%java% %classpath%
%jvmmem% %jvmopts% %logging% %class% %args%" \
        -Dkubernetes.jobmanager.service-account=flink

Thanks
Liangde Chen


On Mon, 2 Nov 2020 at 08:20, Yang Wang <danrtsey...@gmail.com> wrote:

> Could you share the JobManager logs so that we could check whether it
> received the
> registration from TasManager?
>
> In a non-HA Flink cluster, the TaskManager is using the service to talk to
> JobManager.
> Currently, Flink creates a headless service for JobManager. You could use
> `kubectl get svc`
> to find it. And then start a busybox to check the network connectivity.
>
> And maybe you could share more information about the environment. I could
> not reproduce
> your issue in a typical K8s cluster.
>
> Best,
> Yang
>
> Yun Gao <yungao...@aliyun.com> 于2020年10月30日周五 上午11:53写道：
>
>> Hi Liangde,
>>
>>    I pull in Yang Wang who is the expert for Flink on K8s.
>>
>> Best,
>>  Yun
>>
>> ------------------Original Mail ------------------
>> *Sender:*Chen Liangde <lian...@gmail.com>
>> *Send Date:*Fri Oct 30 05:30:40 2020
>> *Recipients:*Flink ML <user@flink.apache.org>
>> *Subject:*Native kubernetes setup failed to start job
>>
>>> I created a flink cluster in kubernetes following this guide:
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html
>>>
>>> The job manager was running. When a job was submitted to the job
>>> manager, it spawned a task manager pod, but the task manager failed to
>>> connect to the job manager. And in the job manager web ui I can't find the
>>> task manager.
>>>
>>> This error is
>>> suspicious: 
>>> org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException:
>>> Adjusted frame length exceeds 10485760: 352518404 - discarded
>>>
>>> 2020-10-29 13:22:51,069 INFO  
>>> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - 
>>> Connecting to ResourceManager 
>>> akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).2020-10-29
>>>  13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport              
>>>      [] - Remote connection to 
>>> [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with 
>>> java.io.IOException: Connection reset by peer2020-10-29 13:22:51,176 WARN  
>>> akka.remote.transport.netty.NettyTransport                   [] - Remote 
>>> connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] 
>>> failed with 
>>> org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException:
>>>  Adjusted frame length exceeds 10485760: 352518404 - discarded2020-10-29 
>>> 13:22:51,180 WARN  akka.remote.ReliableDeliverySupervisor                   
>>>     [] - Association with remote system 
>>> [akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123] has failed, 
>>> address is now gated for [50] ms. Reason: [Association failed with 
>>> [akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123]] Caused by: 
>>> [The remote system explicitly disassociated (reason unknown).]2020-10-29 
>>> 13:22:51,183 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor       
>>>     [] - Could not resolve ResourceManager address 
>>> akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*,
>>>  retrying in 10000 ms: Could not connect to rpc endpoint under address 
>>> akka.tcp://fl...@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*.2020-10-29
>>>  13:23:01,203 WARN  akka.remote.transport.netty.NettyTransport              
>>>      [] - Remote connection to 
>>> [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with 
>>> java.io.IOException: Connection reset by peer
>>>
>>>

NAME                                           TYPE           CLUSTER-IP       
EXTERNAL-IP                                                               
PORT(S)             AGE
detection-engine-dev                           ClusterIP      None             
<none>                                                                    
6123/TCP,6124/TCP   5m28s
detection-engine-dev-rest                      LoadBalancer   172.20.210.124   
a375eab7dc75f42fc9935d5940107811-1454167660.us-east-1.elb.amazonaws.com   
8081:32256/TCP      5m28s

jobmanager.log
Description: Binary data

taskmanager.log
Description: Binary data

Re: Native kubernetes setup failed to start job

Reply via email to