[ https://issues.apache.org/jira/browse/FLINK-32700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17748908#comment-17748908 ]
Talat Uyarer commented on FLINK-32700: -------------------------------------- Our customers are ok to lose data if The job is not able to recoverable. Because stop job is triggered by customer and most of time they want to roll out new code or change a settings. If there are not able drain job we dont want to wait 1 day to apply fix. Because there is no way to recover if sink is down. And also why we want to drain the job because we want to reduce data duplication as much as possible. We trigger drain so job stop reading from kafka and if savepoint is successful we will commit offset and when we start that job again it will start where it is exactly left. We dont see much cancel job action much on our current production. For Google Dataflow we implement auto cancel for the jobs. Without any human interaction we cancel job after a certain timeout it is usually less than 1 hour. For Flink we want to do similar thing. We are ok to handle savepoint cancel issue under different ticket. This fix will introduce Job stuck issue if we merge only drain part. But nothing different than current master. Have a good vacation. I will be off also too for next week :) [~mmangal] Could you update your mr according [~gyfora] 's suggestion ? > Support job drain for Savepoint upgrade mode jobs in Flink Operator > ------------------------------------------------------------------- > > Key: FLINK-32700 > URL: https://issues.apache.org/jira/browse/FLINK-32700 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.5.0 > Reporter: Manan Mangal > Assignee: Manan Mangal > Priority: Major > > During cancel job with savepoint upgrade mode, jobs can be allowed to drain > by advancing the watermark to the end, before they are stopped, so that the > in-flight data is not lost. > If the job fails to drain and hits timeout or any other error, it can be > cancelled without taking a savepoint. -- This message was sent by Atlassian Jira (v8.20.10#820010)