[jira] [Commented] (FLINK-32700) Support job drain for Savepoint upgrade mode jobs in Flink Operator

Talat Uyarer (Jira) Fri, 28 Jul 2023 15:41:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-32700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17748759#comment-17748759
 ]


Talat Uyarer commented on FLINK-32700:
--------------------------------------

[~gyfora] Currently there is an issue on Operator's delete. We want to drain 
job because when we call
{code:java}
kubectl delete flinkdeployment{code}
Operator delete immediately for stateless/stateful jobs. So we lose in flight 
data. I believe by default Operator should delete jobs by emitting max 
Watermark. How emitting mac watermark also has issue, If sink is in stuck state 
we can not delete the job and we created deadlock situation. 

Does not matter Flinkdeployment upgrade mode we use but if we use 
last-state/savepoint state we should drain in flight data to prevent 
unnecessary data duplication. We are not silently cancel the jobs. Actually we 
wait until savepoint/checkpoint timeout to when user delete their 
flinkdeployment. Current situation even Operator does not wait for timeout, 
delete immediately. 

We would like to follow your suggestion. But please keep in your mind We have 
20K+ stateful/stateless job those are triggered by programmatically. There is 
no way to change their deployment manually for us. 

 

cc [~mmangal] 

> Support job drain for Savepoint upgrade mode jobs in Flink Operator
> -------------------------------------------------------------------
>
>                 Key: FLINK-32700
>                 URL: https://issues.apache.org/jira/browse/FLINK-32700
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.5.0
>            Reporter: Manan Mangal
>            Assignee: Manan Mangal
>            Priority: Major
>
> During cancel job with savepoint upgrade mode, jobs can be allowed to drain 
> by advancing the watermark to the end, before they are stopped, so that the 
> in-flight data is not lost. 
> If the job fails to drain and hits timeout or any other error, it can be 
> cancelled without taking a savepoint.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-32700) Support job drain for Savepoint upgrade mode jobs in Flink Operator

Reply via email to