找不到TM的日志。因为TM还没有启动起来,pod就挂了
我看下是否是这个原因,目前确实没有增加-Dkubernetes.taskmanager.service-account这个参数
-Dkubernetes.taskmanager.service-account这个参数是在./bin/kubernetes-session.sh启动session集群的时候加的吗

在 2022/8/31 下午4:10,“Yang Wang”<danrtsey...@gmail.com> 写入:

    我猜测你是因为没有给TM设置service account,导致TM没有权限从K8s ConfigMap拿到leader,从而注册到RM、JM

    -Dkubernetes.taskmanager.service-account=wuzhiheng \


    Best,
    Yang

    Xuyang <xyzhong...@163.com> 于2022年8月30日周二 23:22写道:

    > Hi, 能贴一下TM的日志吗,看Warn的日志貌似是TM一直起不来
    > 在 2022-08-30 03:45:43,"Wu,Zhiheng" <wuzhih...@baidu.com> 写道:
    > >【问题描述】
    > >启用HA配置之后,taskmanager pod一直处于创建-停止-创建的过程,无法启动任务
    > >
    > >1.     任务配置和启动过程
    > >
    > >a)      修改conf/flink.yaml配置文件,增加HA配置
    > >kubernetes.cluster-id: realtime-monitor
    > >high-availability:
    > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
    > >high-availability.storageDir:
    > file:///opt/flink/checkpoint/recovery/monitor            //
    > 这是一个NFS路径,以pvc挂载到pod
    > >
    > >b)      先通过以下命令创建一个无状态部署,建立一个session集群
    > >
    > >./bin/kubernetes-session.sh \
    > >
    > 
>-Dkubernetes.secrets=cdn-res-bd-keystore:/opt/flink/kafka/res/keystore/bd,cdn-res-bd-truststore:/opt/flink/kafka/res/truststore/bd,cdn-res-bj-keystore://opt/flink/kafka/res/keystore/bj,cdn-res-bj-truststore:/opt/flink/kafka/res/truststore/bj
    > \
    > >
    > >-Dkubernetes.pod-template-file=./conf/pod-template.yaml \
    > >
    > >-Dkubernetes.cluster-id=realtime-monitor \
    > >
    > >-Dkubernetes.jobmanager.service-account=wuzhiheng \
    > >
    > >-Dkubernetes.namespace=monitor \
    > >
    > >-Dtaskmanager.numberOfTaskSlots=6 \
    > >
    > >-Dtaskmanager.memory.process.size=8192m \
    > >
    > >-Djobmanager.memory.process.size=2048m
    > >
    > >c)      最后通过web ui提交一个jar包任务,jobmanager 出现如下日志
    > >
    > >2022-08-29 23:49:04,150 INFO
    > org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod
    > realtime-monitor-taskmanager-1-13 is created.
    > >
    > >2022-08-29 23:49:04,152 INFO
    > org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod
    > realtime-monitor-taskmanager-1-12 is created.
    > >
    > >2022-08-29 23:49:04,161 INFO
    > org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received
    > new TaskManager pod: realtime-monitor-taskmanager-1-12
    > >
    > >2022-08-29 23:49:04,162 INFO
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Requested worker realtime-monitor-taskmanager-1-12 with resource spec
    > WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
    > taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
    > managedMemSize=0 bytes, numSlots=6}.
    > >
    > >2022-08-29 23:49:04,162 INFO
    > org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received
    > new TaskManager pod: realtime-monitor-taskmanager-1-13
    > >
    > >2022-08-29 23:49:04,162 INFO
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Requested worker realtime-monitor-taskmanager-1-13 with resource spec
    > WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
    > taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
    > managedMemSize=0 bytes, numSlots=6}.
    > >
    > >2022-08-29 23:49:07,176 WARN
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Reaching max start worker failure rate: 12 events detected in the recent
    > interval, reaching the threshold 10.000000.
    > >
    > >2022-08-29 23:49:07,176 INFO
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Will not retry creating worker in 3000 ms.
    > >
    > >2022-08-29 23:49:07,176 INFO
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Worker realtime-monitor-taskmanager-1-12 with resource spec
    > WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
    > taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
    > managedMemSize=0 bytes, numSlots=6} was requested in current attempt and
    > has not registered. Current pending count after removing: 1.
    > >
    > >2022-08-29 23:49:07,176 INFO
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Worker realtime-monitor-taskmanager-1-12 is terminated. Diagnostics: Pod
    > terminated, container termination statuses:
    > [flink-main-container(exitCode=1, reason=Error, message=null)], pod 
status:
    > Failed(reason=null, message=null)
    > >
    > >2022-08-29 23:49:07,176 INFO
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0,
    > taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes,
    > networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes,
    > numSlots=6}, current pending count: 2.
    > >
    > >2022-08-29 23:49:07,514 WARN
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Reaching max start worker failure rate: 13 events detected in the recent
    > interval, reaching the threshold 10.000000.
    > >
    > >2022-08-29 23:49:07,514 INFO
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Worker realtime-monitor-taskmanager-1-13 with resource spec
    > WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
    > taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
    > managedMemSize=0 bytes, numSlots=6} was requested in current attempt and
    > has not registered. Current pending count after removing: 1.
    > >
    > >2022-08-29 23:49:07,514 INFO
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Worker realtime-monitor-taskmanager-1-13 is terminated. Diagnostics: Pod
    > terminated, container termination statuses:
    > [flink-main-container(exitCode=1, reason=Error, message=null)], pod 
status:
    > Failed(reason=null, message=null)
    > >
    > >2022-08-29 23:49:07,515 INFO
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0,
    > taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes,
    > networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes,
    > numSlots=6}, current pending count: 2.
    > >
    > >
    > >
    > >2022-08-29 23:49:10,190 INFO
    > org.apache.flink.runtime.externalresource.ExternalResourceUtils [] -
    > Enabled external resources: []
    > >
    > >2022-08-29 23:49:10,192 INFO
    > org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Creating
    > new TaskManager pod with name realtime-monitor-taskmanager-1-14 and
    > resource <8192,6.0>.
    > >
    > >2022-08-29 23:49:10,192 INFO
    > org.apache.flink.runtime.externalresource.ExternalResourceUtils [] -
    > Enabled external resources: []
    > >
    > >2022-08-29 23:49:10,194 INFO
    > org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Creating
    > new TaskManager pod with name realtime-monitor-taskmanager-1-15 and
    > resource <8192,6.0>.
    > >
    > >2022-08-29 23:49:10,214 INFO
    > org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod
    > realtime-monitor-taskmanager-1-15 is created.
    > >
    > >2022-08-29 23:49:10,215 INFO
    > org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod
    > realtime-monitor-taskmanager-1-14 is created.
    > >
    > >2022-08-29 23:49:10,237 INFO
    > org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received
    > new TaskManager pod: realtime-monitor-taskmanager-1-14
    > >
    > >2022-08-29 23:49:10,238 INFO
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Requested worker realtime-monitor-taskmanager-1-14 with resource spec
    > WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
    > taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
    > managedMemSize=0 bytes, numSlots=6}
    > >
    > >2022-08-29 23:49:10,238 INFO
    > org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received
    > new TaskManager pod: realtime-monitor-taskmanager-1-15
    > >
    > >2022-08-29 23:49:10,238 INFO
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Requested worker realtime-monitor-taskmanager-1-15 with resource spec
    > WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
    > taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
    > managedMemSize=0 bytes, numSlots=6}.
    > >
    > >2022-08-29 23:49:13,239 INFO
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Will not retry creating worker in 3000 ms.
    > >
    > >2022-08-29 23:49:13,239 INFO
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Worker realtime-monitor-taskmanager-1-14 with resource spec
    > WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
    > taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
    > managedMemSize=0 bytes, numSlots=6} was requested in current attempt and
    > has not registered. Current pending count after removing: 1.
    > >
    > >2022-08-29 23:49:13,239 INFO
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Worker realtime-monitor-taskmanager-1-14 is terminated. Diagnostics: Pod
    > terminated, container termination statuses:
    > [flink-main-container(exitCode=1, reason=Error, message=null)], pod 
status:
    > Failed(reason=null, message=null)
    > >
    > >2022-08-29 23:49:13,239 INFO
    > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
    > Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0,
    > taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes,
    > networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes,
    > numSlots=6}, current pending count: 2.
    > >
    > >2.     不启用HA配置是没有问题的,flink 1.13.6和1.14.5都尝试过,都有这个问题
    > >
    > >3.     问题看起来类似:
    > user-zh@flink.apache.org <https://www.mail-archive.com/<a 
href=>/msg11942.html">https://www.mail-archive.com/user-zh@flink.apache.org/msg11942.html
    > >
    > >请问下,这可能是哪里出现问题,之前有遇到过吗
    >

回复