Yun Tang created FLINK-36965: -------------------------------- Summary: Enable to allow re-create the pod watch with many retries on k8s cluster failure Key: FLINK-36965 URL: https://issues.apache.org/jira/browse/FLINK-36965 Project: Flink Issue Type: Improvement Components: Deployment / Kubernetes Affects Versions: 1.20.0 Reporter: Yun Tang
FLINK-33728 introduce the backoff strategy when creating the watch to pods. By doing so, we can set the {{kubernetes.transactional-operation.max-retries}} to a very large value to tolerate the k8s cluster downtime for a long time. However, there still exists two problems: 1. If we set the {{kubernetes.transactional-operation.max-retries}} to {{100}} + times, which means we hope the JobMaster would not crash to tolerate more than one hour k8s cluster downtime. However, this would also make the {{FlinkKubeClient#checkAndUpdateConfigMap}} much longer, which is not necessary. 2. Moreover, creating the watch to pods is not a transactional operation, current config option {{kubernetes.transactional-operation.initial-retry-delay}} and {{kubernetes.transactional-operation.max-retry-delay}} is misleading. Thus, I think we should introduce another new {{kubernetes.watch-operation.max-retries}} with {{kubernetes.watch-operation.initial-retry-delay}} and {{kubernetes.watch-operation.max-retry-delay}} to deprecate the previous two options. -- This message was sent by Atlassian Jira (v8.20.10#820010)