[ 
https://issues.apache.org/jira/browse/FLINK-32883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoyuki NAKAMURA updated FLINK-32883:
--------------------------------------
    Description: 
[https://docs.ververica.com/user_guide/application_operations/deployments/scaling.html#run-with-standby-taskmanager]
I would like to be able to support standby task managers. Because on K8s, pods 
are often evicted or deleted due to node failure or autoscaling.

With the current implementation, it is not possible to set up a standby task 
manager, and jobs cannot run until all task managers are up and running. If a 
standby task manager could be supported, jobs could continue to run without 
downtime using the standby task manager, even if the task manager is 
unexpectedly deleted.

[https://github.com/apache/flink-kubernetes-operator/blob/release-1.6.0/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/config/FlinkConfigBuilder.java#L370-L380|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/config/FlinkConfigBuilder.java#L370-L380]
If the job manager's number of replicas is set, the job's parallelism setting 
is ignored, but it should be possible to support a standby task manager by 
automatically setting parallelism to the replicas*task slot only if the job's 
parallelism is not set (i.e. 0) and using that value if parallelism is set. 

If this change looks good, I will send a PR on GitHub.

  was:
[https://docs.ververica.com/user_guide/application_operations/deployments/scaling.html#run-with-standby-taskmanager]
I would like to be able to support standby task managers. Because on K8s, pods 
are often evicted or deleted due to node failure or autoscaling.

With the current implementation, it is not possible to set up a standby task 
manager, and jobs cannot run until all task managers are up and running. If a 
standby task manager could be supported, jobs could continue to run without 
downtime using the standby task manager, even if the task manager is 
unexpectedly deleted.

[https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/config/FlinkConfigBuilder.java#L370-L380]
If the job manager's number of replicas is set, the job's parallelism setting 
is ignored, but it should be possible to support a standby task manager by 
automatically setting parallelism to the replicas*task slot only if the job's 
parallelism is not set (i.e. 0) and using that value if parallelism is set. 

If this change looks good, I will send a PR on GitHub.


> Support for standby task managers
> ---------------------------------
>
>                 Key: FLINK-32883
>                 URL: https://issues.apache.org/jira/browse/FLINK-32883
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.6.0
>            Reporter: Tomoyuki NAKAMURA
>            Priority: Major
>
> [https://docs.ververica.com/user_guide/application_operations/deployments/scaling.html#run-with-standby-taskmanager]
> I would like to be able to support standby task managers. Because on K8s, 
> pods are often evicted or deleted due to node failure or autoscaling.
> With the current implementation, it is not possible to set up a standby task 
> manager, and jobs cannot run until all task managers are up and running. If a 
> standby task manager could be supported, jobs could continue to run without 
> downtime using the standby task manager, even if the task manager is 
> unexpectedly deleted.
> [https://github.com/apache/flink-kubernetes-operator/blob/release-1.6.0/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/config/FlinkConfigBuilder.java#L370-L380|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/config/FlinkConfigBuilder.java#L370-L380]
> If the job manager's number of replicas is set, the job's parallelism setting 
> is ignored, but it should be possible to support a standby task manager by 
> automatically setting parallelism to the replicas*task slot only if the job's 
> parallelism is not set (i.e. 0) and using that value if parallelism is set. 
> If this change looks good, I will send a PR on GitHub.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to