[jira] [Commented] (FLINK-32700) Support job drain for Savepoint upgrade mode jobs in Flink Operator

Talat Uyarer (Jira) Sun, 30 Jul 2023 02:09:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-32700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17748908#comment-17748908
 ]


Talat Uyarer commented on FLINK-32700:
--------------------------------------

Our customers are ok to lose data if The job is not able to recoverable. 
Because stop job is triggered by customer and most of time they want to roll 
out new code or change a settings. If there are not able drain job we dont want 
to wait 1 day to apply fix. Because there is no way to recover if sink is down. 
And also why we want to drain the job because we want to reduce data 
duplication as much as possible. We trigger drain so job stop reading from 
kafka and if savepoint is successful we will commit offset and when we start 
that job again it will start where it is exactly left. We dont see much cancel 
job action much on our current production. 

For Google Dataflow we implement auto cancel for the jobs. Without any human 
interaction we cancel job after a certain timeout it is usually less than 1 
hour. For Flink we want to do similar thing. 

We are ok to handle savepoint cancel issue under different ticket. This fix 
will introduce Job stuck issue if we merge only drain part. But nothing 
different than current master. 

 

Have a good vacation. I will be off also too for next week :) 

[~mmangal] Could you update your mr according [~gyfora] 's suggestion ?

> Support job drain for Savepoint upgrade mode jobs in Flink Operator
> -------------------------------------------------------------------
>
>                 Key: FLINK-32700
>                 URL: https://issues.apache.org/jira/browse/FLINK-32700
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.5.0
>            Reporter: Manan Mangal
>            Assignee: Manan Mangal
>            Priority: Major
>
> During cancel job with savepoint upgrade mode, jobs can be allowed to drain 
> by advancing the watermark to the end, before they are stopped, so that the 
> in-flight data is not lost. 
> If the job fails to drain and hits timeout or any other error, it can be 
> cancelled without taking a savepoint.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-32700) Support job drain for Savepoint upgrade mode jobs in Flink Operator

Reply via email to