Yun Tang created FLINK-36965:
--------------------------------

             Summary: Enable to allow re-create the pod watch with many retries 
on k8s cluster failure
                 Key: FLINK-36965
                 URL: https://issues.apache.org/jira/browse/FLINK-36965
             Project: Flink
          Issue Type: Improvement
          Components: Deployment / Kubernetes
    Affects Versions: 1.20.0
            Reporter: Yun Tang


FLINK-33728 introduce the backoff strategy when creating the watch to pods. By 
doing so, we can set the {{kubernetes.transactional-operation.max-retries}} to 
a very large value to tolerate the k8s cluster downtime for a long time. 
However, there still exists two problems:
1. If we set the {{kubernetes.transactional-operation.max-retries}} to {{100}} 
+ times, which means we hope the JobMaster would not crash to tolerate more 
than one hour k8s cluster downtime. However, this would also make the 
{{FlinkKubeClient#checkAndUpdateConfigMap}} much longer, which is not necessary.
2. Moreover, creating the watch to pods is not a transactional operation, 
current config option 
{{kubernetes.transactional-operation.initial-retry-delay}} and 
{{kubernetes.transactional-operation.max-retry-delay}} is misleading.

Thus, I think we should introduce another new 
{{kubernetes.watch-operation.max-retries}} with 
{{kubernetes.watch-operation.initial-retry-delay}} and 
{{kubernetes.watch-operation.max-retry-delay}} to deprecate the previous two 
options.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to