Kubernetes skip sidecar failure

Evgeniy Lyutikov Tue, 21 Mar 2023 01:01:23 -0700

Hello everybody!
We're using Flink 1.14 and kubernetes operator 1.2.0, the pod template 
configures the use of the haproxy sidecar container for load balancing on a 
persistence checkpoint in S3 storage.
Sometimes this haproxy sidecar exits and flink completely restarts the 
taskmamager module and the running job.


2023-03-20 04:59:59,526 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker my-job-taskmanager-5-15 is terminated. Diagnostics: Pod terminated, 
container termination statuses: [haproxy(exitCode=139, reason=Error, 
message=null)]
2023-03-20 04:59:59,526 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Closing TaskExecutor connection my-job-taskmanager-5-15 because: Pod 
terminated, container termination statuses: [haproxy(exitCode=139, 
reason=Error, message=null)]
2023-03-20 04:59:59,527 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requesting new worker with resource spec WorkerResourceSpec {cpuCores=10.0, 
taskHeapSize=5.400gb (5798205768 bytes), taskOffHeapSize=1024.000mb (1073741824 
bytes), networkMemSize=1024.000mb (1073741824 bytes), managedMemSize=5.100gb 
(5476083384 bytes), numSlots=3}, current pending count: 1.
2023-03-20 04:59:59,527 INFO  
org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled 
external resources: []
2023-03-20 04:59:59,529 INFO  org.apache.flink.configuration.Configuration      
           [] - Config uses fallback configuration key 
'kubernetes.service-account' instead of key 
'kubernetes.taskmanager.service-account'
2023-03-20 04:59:59,529 INFO  org.apache.flink.configuration.Configuration      
           [] - Config uses fallback configuration key 
'kubernetes.service-account' instead of key 
'kubernetes.taskmanager.service-account'
2023-03-20 04:59:59,529 INFO  org.apache.flink.kubernetes.utils.KubernetesUtils 
           [] - The service account configured in pod template will be 
overwritten to 'flink' because of explicitly configured options.
2023-03-20 04:59:59,531 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Creating new 
TaskManager pod with name my-job-taskmanager-5-45 and resource <14336,10.0>.
2023-03-20 04:59:59,560 WARN  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Discard registration from TaskExecutor my-job-taskmanager-5-15 at 
(akka.tcp://flink@10.68.15.205:6122/user/rpc/taskmanager_0) because the 
framework did not recognize it
2023-03-20 04:59:59,607 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod 
my-job-taskmanager-5-45 is created.
2023-03-20 04:59:59,617 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received new 
TaskManager pod: my-job-taskmanager-5-45
2023-03-20 04:59:59,617 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requested worker my-job-taskmanager-5-45 with resource spec WorkerResourceSpec 
{cpuCores=10.0, taskHeapSize=5.400gb (5798205768 bytes), 
taskOffHeapSize=1024.000mb (1073741824 bytes), networkMemSize=1024.000mb 
(1073741824 bytes), managedMemSize=5.100gb (5476083384 bytes), numSlots=3}.

Is there some way to specify that only the flink-main-container status should 
be monitored and not react to sidecar crashes?

________________________________
"This message contains confidential information/commercial secret. If you are 
not the intended addressee of this message you may not copy, save, print or 
forward it to any third party and you are kindly requested to destroy this 
message and notify the sender thereof by email.
Данное сообщение содержит конфиденциальную информацию/информацию, являющуюся 
коммерческой тайной. Если Вы не являетесь надлежащим адресатом данного 
сообщения, Вы не вправе копировать, сохранять, печатать или пересылать его 
каким либо иным лицам. Просьба уничтожить данное сообщение и уведомить об этом 
отправителя электронным письмом."

Kubernetes skip sidecar failure

Reply via email to