[jira] [Created] (FLINK-12869) Add yarn acls capability to flink containers
Nicolas Fraison created FLINK-12869: --- Summary: Add yarn acls capability to flink containers Key: FLINK-12869 URL: https://issues.apache.org/jira/browse/FLINK-12869 Project: Flink Issue Type: New Feature Components: Deployment / YARN Reporter: Nicolas Fraison Yarn provide application acls mechanism to be able to provide specific rights to other users than the one running the job (view logs through the resourcemanager/job history, kill the application) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-35489) Metaspace size can be too little after autotuning change memory setting
[ https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853647#comment-17853647 ] Nicolas Fraison commented on FLINK-35489: - Thks [~mxm] for the feedback. I've updated the [PR|https://github.com/apache/flink-kubernetes-operator/pull/833/files] to not limit the metaspace size. > Metaspace size can be too little after autotuning change memory setting > --- > > Key: FLINK-35489 > URL: https://issues.apache.org/jira/browse/FLINK-35489 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Affects Versions: 1.8.0 >Reporter: Nicolas Fraison >Priority: Major > Labels: pull-request-available > > We have enable the autotuning feature on one of our flink job with below > config > {code:java} > # Autoscaler configuration > job.autoscaler.enabled: "true" > job.autoscaler.stabilization.interval: 1m > job.autoscaler.metrics.window: 10m > job.autoscaler.target.utilization: "0.8" > job.autoscaler.target.utilization.boundary: "0.1" > job.autoscaler.restart.time: 2m > job.autoscaler.catch-up.duration: 10m > job.autoscaler.memory.tuning.enabled: true > job.autoscaler.memory.tuning.overhead: 0.5 > job.autoscaler.memory.tuning.maximize-managed-memory: true{code} > During a scale down the autotuning decided to give all the memory to to JVM > (having heap being scale by 2) settting taskmanager.memory.managed.size to 0b. > Here is the config that was compute by the autotuning for a TM running on a > 4GB pod: > {code:java} > taskmanager.memory.network.max: 4063232b > taskmanager.memory.network.min: 4063232b > taskmanager.memory.jvm-overhead.max: 433791712b > taskmanager.memory.task.heap.size: 3699934605b > taskmanager.memory.framework.off-heap.size: 134217728b > taskmanager.memory.jvm-metaspace.size: 22960020b > taskmanager.memory.framework.heap.size: "0 bytes" > taskmanager.memory.flink.size: 3838215565b > taskmanager.memory.managed.size: 0b {code} > This has lead to some issue starting the TM because we are relying on some > javaagent performing some memory allocation outside of the JVM (rely on some > C bindings). > Tuning the overhead or disabling the scale-down-compensation.enabled could > have helped for that particular event but this can leads to other issue as it > could leads to too little HEAP size being computed. > It would be interesting to be able to set a min memory.managed.size to be > taken in account by the autotuning. > What do you think about this? Do you think that some other specific config > should have been applied to avoid this issue? > > Edit see this comment that leads to the metaspace issue: > https://issues.apache.org/jira/browse/FLINK-35489?focusedCommentId=17850694&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17850694 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34131) Checkpoint check window should take in account checkpoint job configuration
Nicolas Fraison created FLINK-34131: --- Summary: Checkpoint check window should take in account checkpoint job configuration Key: FLINK-34131 URL: https://issues.apache.org/jira/browse/FLINK-34131 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Nicolas Fraison When enabling checkpoint progress check (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to define cluster health the operator rely detect if a checkpoint has been performed during the kubernetes.operator.cluster.health-check.checkpoint-progress.window As indicated in the doc it must be bigger to checkpointing interval. But this is a manual configuration which can leads to misconfiguration and unwanted restart of the flink cluster if the checkpointing interval is bigger than the window one. The operator must check that the config is healthy before to rely on this check. If it is not well set it should not execute the check (return true on [evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50]) and log a WARN message. Also flink jobs have other checkpointing parameters that should be taken in account for this window configuration which are execution.checkpointing.timeout and execution.checkpointing.tolerable-failed-checkpoints The idea would be to check that kubernetes.operator.cluster.health-check.checkpoint-progress.window is at >= to (execution.checkpointing.interval + execution.checkpointing.timeout) * execution.checkpointing.tolerable-failed-checkpoints -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34131) Checkpoint check window should take in account checkpoint job configuration
[ https://issues.apache.org/jira/browse/FLINK-34131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-34131: Priority: Minor (was: Major) > Checkpoint check window should take in account checkpoint job configuration > --- > > Key: FLINK-34131 > URL: https://issues.apache.org/jira/browse/FLINK-34131 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Nicolas Fraison >Priority: Minor > > When enabling checkpoint progress check > (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to > define cluster health the operator rely detect if a checkpoint has been > performed during the > kubernetes.operator.cluster.health-check.checkpoint-progress.window > As indicated in the doc it must be bigger to checkpointing interval. > But this is a manual configuration which can leads to misconfiguration and > unwanted restart of the flink cluster if the checkpointing interval is bigger > than the window one. > The operator must check that the config is healthy before to rely on this > check. If it is not well set it should not execute the check (return true on > [evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50]) > and log a WARN message. > Also flink jobs have other checkpointing parameters that should be taken in > account for this window configuration which are > execution.checkpointing.timeout and > execution.checkpointing.tolerable-failed-checkpoints > The idea would be to check that > kubernetes.operator.cluster.health-check.checkpoint-progress.window is at >= > to (execution.checkpointing.interval + execution.checkpointing.timeout) * > execution.checkpointing.tolerable-failed-checkpoints -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34131) Checkpoint check window should take in account checkpoint job configuration
[ https://issues.apache.org/jira/browse/FLINK-34131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-34131: Description: When enabling checkpoint progress check (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to define cluster health the operator rely detect if a checkpoint has been performed during the kubernetes.operator.cluster.health-check.checkpoint-progress.window As indicated in the doc it must be bigger to checkpointing interval. But this is a manual configuration which can leads to misconfiguration and unwanted restart of the flink cluster if the checkpointing interval is bigger than the window one. The operator must check that the config is healthy before to rely on this check. If it is not well set it should not execute the check (return true on [evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50]) and log a WARN message. Also flink jobs have other checkpointing parameters that should be taken in account for this window configuration which are execution.checkpointing.timeout and execution.checkpointing.tolerable-failed-checkpoints The idea would be to check that kubernetes.operator.cluster.health-check.checkpoint-progress.window >= max(execution.checkpointing.interval, execution.checkpointing.timeout * execution.checkpointing.tolerable-failed-checkpoints) was: When enabling checkpoint progress check (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to define cluster health the operator rely detect if a checkpoint has been performed during the kubernetes.operator.cluster.health-check.checkpoint-progress.window As indicated in the doc it must be bigger to checkpointing interval. But this is a manual configuration which can leads to misconfiguration and unwanted restart of the flink cluster if the checkpointing interval is bigger than the window one. The operator must check that the config is healthy before to rely on this check. If it is not well set it should not execute the check (return true on [evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50]) and log a WARN message. Also flink jobs have other checkpointing parameters that should be taken in account for this window configuration which are execution.checkpointing.timeout and execution.checkpointing.tolerable-failed-checkpoints The idea would be to check that kubernetes.operator.cluster.health-check.checkpoint-progress.window is at >= to (execution.checkpointing.interval + execution.checkpointing.timeout) * execution.checkpointing.tolerable-failed-checkpoints > Checkpoint check window should take in account checkpoint job configuration > --- > > Key: FLINK-34131 > URL: https://issues.apache.org/jira/browse/FLINK-34131 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Nicolas Fraison >Priority: Minor > > When enabling checkpoint progress check > (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to > define cluster health the operator rely detect if a checkpoint has been > performed during the > kubernetes.operator.cluster.health-check.checkpoint-progress.window > As indicated in the doc it must be bigger to checkpointing interval. > But this is a manual configuration which can leads to misconfiguration and > unwanted restart of the flink cluster if the checkpointing interval is bigger > than the window one. > The operator must check that the config is healthy before to rely on this > check. If it is not well set it should not execute the check (return true on > [evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50]) > and log a WARN message. > Also flink jobs have other checkpointing parameters that should be taken in > account for this window configuration which are > execution.checkpointing.timeout and > execution.checkpointing.tolerable-failed-checkpoints > The idea would be to check that > kubernetes.operator.cluster.health-check.checkpoint-progress.window >= > max(execution.checkpointing.interval, execution.checkpointing.timeout * > execution.checkpointing.tolerable-failed-checkpoints) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34131) Checkpoint check window should take in account checkpoint job configuration
[ https://issues.apache.org/jira/browse/FLINK-34131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-34131: Description: When enabling checkpoint progress check (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to define cluster health the operator rely detect if a checkpoint has been performed during the kubernetes.operator.cluster.health-check.checkpoint-progress.window As indicated in the doc it must be bigger to checkpointing interval. But this is a manual configuration which can leads to misconfiguration and unwanted restart of the flink cluster if the checkpointing interval is bigger than the window one. The operator must check that the config is healthy before to rely on this check. If it is not well set it should not execute the check (return true on [evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50]) and log a WARN message. Also flink jobs have other checkpointing parameters that should be taken in account for this window configuration which are execution.checkpointing.timeout and execution.checkpointing.tolerable-failed-checkpoints The idea would be to check that kubernetes.operator.cluster.health-check.checkpoint-progress.window >= max(execution.checkpointing.interval * execution.checkpointing.tolerable-failed-checkpoints, execution.checkpointing.timeout * execution.checkpointing.tolerable-failed-checkpoints) was: When enabling checkpoint progress check (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to define cluster health the operator rely detect if a checkpoint has been performed during the kubernetes.operator.cluster.health-check.checkpoint-progress.window As indicated in the doc it must be bigger to checkpointing interval. But this is a manual configuration which can leads to misconfiguration and unwanted restart of the flink cluster if the checkpointing interval is bigger than the window one. The operator must check that the config is healthy before to rely on this check. If it is not well set it should not execute the check (return true on [evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50]) and log a WARN message. Also flink jobs have other checkpointing parameters that should be taken in account for this window configuration which are execution.checkpointing.timeout and execution.checkpointing.tolerable-failed-checkpoints The idea would be to check that kubernetes.operator.cluster.health-check.checkpoint-progress.window >= max(execution.checkpointing.interval, execution.checkpointing.timeout * execution.checkpointing.tolerable-failed-checkpoints) > Checkpoint check window should take in account checkpoint job configuration > --- > > Key: FLINK-34131 > URL: https://issues.apache.org/jira/browse/FLINK-34131 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Nicolas Fraison >Priority: Minor > > When enabling checkpoint progress check > (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to > define cluster health the operator rely detect if a checkpoint has been > performed during the > kubernetes.operator.cluster.health-check.checkpoint-progress.window > As indicated in the doc it must be bigger to checkpointing interval. > But this is a manual configuration which can leads to misconfiguration and > unwanted restart of the flink cluster if the checkpointing interval is bigger > than the window one. > The operator must check that the config is healthy before to rely on this > check. If it is not well set it should not execute the check (return true on > [evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50]) > and log a WARN message. > Also flink jobs have other checkpointing parameters that should be taken in > account for this window configuration which are > execution.checkpointing.timeout and > execution.checkpointing.tolerable-failed-checkpoints > The idea would be to check that > kubernetes.operator.cluster.health-check.checkpoint-progress.window >= > max(execution.checkpointing.interval * > execution.checkpointing.tolerable-failed-checkpoints, > execution.checkpointing.timeout * > execution.checkpointing.tolerable-failed-checkpoints) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33797) Flink application FAILED log in info instead of warn
Nicolas Fraison created FLINK-33797: --- Summary: Flink application FAILED log in info instead of warn Key: FLINK-33797 URL: https://issues.apache.org/jira/browse/FLINK-33797 Project: Flink Issue Type: Bug Reporter: Nicolas Fraison When the flink application end in cancelled or failed state the log is reported with INFO log level while it should be warn: https://github.com/apache/flink/blob/548e4b5188bb3f092206182d779a909756408660/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L174 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35489) Add capability to set min taskmanager.memory.managed.size when enabling autotuning
Nicolas Fraison created FLINK-35489: --- Summary: Add capability to set min taskmanager.memory.managed.size when enabling autotuning Key: FLINK-35489 URL: https://issues.apache.org/jira/browse/FLINK-35489 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: 1.8.0 Reporter: Nicolas Fraison We have enable the autotuning feature on one of our flink job with below config {code:java} # Autoscaler configuration job.autoscaler.enabled: "true" job.autoscaler.stabilization.interval: 1m job.autoscaler.metrics.window: 10m job.autoscaler.target.utilization: "0.8" job.autoscaler.target.utilization.boundary: "0.1" job.autoscaler.restart.time: 2m job.autoscaler.catch-up.duration: 10m job.autoscaler.memory.tuning.enabled: true job.autoscaler.memory.tuning.overhead: 0.5 job.autoscaler.memory.tuning.maximize-managed-memory: true{code} During a scale down the autotuning decided to give all the memory to to JVM (having heap being scale by 2) settting taskmanager.memory.managed.size to 0b. Here is the config that was compute by the autotuning for a TM running on a 4GB pod: {code:java} taskmanager.memory.network.max: 4063232b taskmanager.memory.network.min: 4063232b taskmanager.memory.jvm-overhead.max: 433791712b taskmanager.memory.task.heap.size: 3699934605b taskmanager.memory.framework.off-heap.size: 134217728b taskmanager.memory.jvm-metaspace.size: 22960020b taskmanager.memory.framework.heap.size: "0 bytes" taskmanager.memory.flink.size: 3838215565b taskmanager.memory.managed.size: 0b {code} This has lead to some issue starting the TM because we are relying on some javaagent performing some memory allocation outside of the JVM (rely on some C bindings). Tuning the overhead or disabling the scale-down-compensation.enabled could have helped for that particular event but this can leads to other issue as it could leads to too little HEAP size being computed. It would be interesting to be able to set a min memory.managed.size to be taken in account by the autotuning. What do you think about this? Do you think that some other specific config should have been applied to avoid this issue? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-35489) Add capability to set min taskmanager.memory.managed.size when enabling autotuning
[ https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850694#comment-17850694 ] Nicolas Fraison commented on FLINK-35489: - Thks [~fanrui] for the feedback it help me realise that my analysis was wrong. The issue we are facing ifs the JVM crashing after the [autotuning|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autotuning/] change some memory config: {code:java} Starting kubernetes-taskmanager as a console application on host flink-kafka-job-apache-right-taskmanager-1-1. Exception in thread "main" *** java.lang.instrument ASSERTION FAILED ***: "result" with message agent load/premain call failed at src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 422 FATAL ERROR in native method: processing of -javaagent failed, processJavaStart failed Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x78dee4] jni_FatalError+0x70 V [libjvm.so+0x88df00] JvmtiExport::post_vm_initialized()+0x240 V [libjvm.so+0xc353fc] Threads::create_vm(JavaVMInitArgs*, bool*)+0x7ac V [libjvm.so+0x79c05c] JNI_CreateJavaVM+0x7c C [libjli.so+0x3b2c] JavaMain+0x7c C [libjli.so+0x7fdc] ThreadJavaMain+0xc C [libpthread.so.0+0x7624] start_thread+0x184 {code} Seeing this big increase of HEAP (from 1.5 to more than 3GB and the fact that the memory.managed.size was shrink to 0b make me thing that it was linked to missing off heap. But you are right that jvm-overhead already reserved some memory for the off heap (and we indeed have around 400 MB with that config) So looking back to the new config I've identified the issue which is on the jvm-metaspace having been shrink to 22MB while it was set at 256MB. I've done a test increasing this parameter and the TM is now able to start. For the meta space computation size I can see the autotuning computing METASPACE_MEMORY_USED=1.41521584E8 which seems to be appropriate metaspace sizing. But due to the the memBudget management it ends up setting only 22MB to the metaspace ([first allocate remaining memory to the heap and then this new remaining to metaspace and finally to managed memory|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/tuning/MemoryTuning.java#L130]) > Add capability to set min taskmanager.memory.managed.size when enabling > autotuning > -- > > Key: FLINK-35489 > URL: https://issues.apache.org/jira/browse/FLINK-35489 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Affects Versions: 1.8.0 >Reporter: Nicolas Fraison >Priority: Major > > We have enable the autotuning feature on one of our flink job with below > config > {code:java} > # Autoscaler configuration > job.autoscaler.enabled: "true" > job.autoscaler.stabilization.interval: 1m > job.autoscaler.metrics.window: 10m > job.autoscaler.target.utilization: "0.8" > job.autoscaler.target.utilization.boundary: "0.1" > job.autoscaler.restart.time: 2m > job.autoscaler.catch-up.duration: 10m > job.autoscaler.memory.tuning.enabled: true > job.autoscaler.memory.tuning.overhead: 0.5 > job.autoscaler.memory.tuning.maximize-managed-memory: true{code} > During a scale down the autotuning decided to give all the memory to to JVM > (having heap being scale by 2) settting taskmanager.memory.managed.size to 0b. > Here is the config that was compute by the autotuning for a TM running on a > 4GB pod: > {code:java} > taskmanager.memory.network.max: 4063232b > taskmanager.memory.network.min: 4063232b > taskmanager.memory.jvm-overhead.max: 433791712b > taskmanager.memory.task.heap.size: 3699934605b > taskmanager.memory.framework.off-heap.size: 134217728b > taskmanager.memory.jvm-metaspace.size: 22960020b > taskmanager.memory.framework.heap.size: "0 bytes" > taskmanager.memory.flink.size: 3838215565b > taskmanager.memory.managed.size: 0b {code} > This has lead to some issue starting the TM because we are relying on some > javaagent performing some memory allocation outside of the JVM (rely on some > C bindings). > Tuning the overhead or disabling the scale-down-compensation.enabled could > have helped for that particular event but this can leads to other issue as it > could leads to too little HEAP size being computed. > It would be interesting to be able to set a min memory.managed.size to be > taken in account by the autotuning. > What do you think about this? Do you think that some other specific config > should have been applied to avoid this issue? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-35489) Metaspace size can be too little after autotuning change memory setting
[ https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-35489: Summary: Metaspace size can be too little after autotuning change memory setting (was: Add capability to set min taskmanager.memory.managed.size when enabling autotuning) > Metaspace size can be too little after autotuning change memory setting > --- > > Key: FLINK-35489 > URL: https://issues.apache.org/jira/browse/FLINK-35489 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Affects Versions: 1.8.0 >Reporter: Nicolas Fraison >Priority: Major > > We have enable the autotuning feature on one of our flink job with below > config > {code:java} > # Autoscaler configuration > job.autoscaler.enabled: "true" > job.autoscaler.stabilization.interval: 1m > job.autoscaler.metrics.window: 10m > job.autoscaler.target.utilization: "0.8" > job.autoscaler.target.utilization.boundary: "0.1" > job.autoscaler.restart.time: 2m > job.autoscaler.catch-up.duration: 10m > job.autoscaler.memory.tuning.enabled: true > job.autoscaler.memory.tuning.overhead: 0.5 > job.autoscaler.memory.tuning.maximize-managed-memory: true{code} > During a scale down the autotuning decided to give all the memory to to JVM > (having heap being scale by 2) settting taskmanager.memory.managed.size to 0b. > Here is the config that was compute by the autotuning for a TM running on a > 4GB pod: > {code:java} > taskmanager.memory.network.max: 4063232b > taskmanager.memory.network.min: 4063232b > taskmanager.memory.jvm-overhead.max: 433791712b > taskmanager.memory.task.heap.size: 3699934605b > taskmanager.memory.framework.off-heap.size: 134217728b > taskmanager.memory.jvm-metaspace.size: 22960020b > taskmanager.memory.framework.heap.size: "0 bytes" > taskmanager.memory.flink.size: 3838215565b > taskmanager.memory.managed.size: 0b {code} > This has lead to some issue starting the TM because we are relying on some > javaagent performing some memory allocation outside of the JVM (rely on some > C bindings). > Tuning the overhead or disabling the scale-down-compensation.enabled could > have helped for that particular event but this can leads to other issue as it > could leads to too little HEAP size being computed. > It would be interesting to be able to set a min memory.managed.size to be > taken in account by the autotuning. > What do you think about this? Do you think that some other specific config > should have been applied to avoid this issue? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-35489) Metaspace size can be too little after autotuning change memory setting
[ https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-35489: Description: We have enable the autotuning feature on one of our flink job with below config {code:java} # Autoscaler configuration job.autoscaler.enabled: "true" job.autoscaler.stabilization.interval: 1m job.autoscaler.metrics.window: 10m job.autoscaler.target.utilization: "0.8" job.autoscaler.target.utilization.boundary: "0.1" job.autoscaler.restart.time: 2m job.autoscaler.catch-up.duration: 10m job.autoscaler.memory.tuning.enabled: true job.autoscaler.memory.tuning.overhead: 0.5 job.autoscaler.memory.tuning.maximize-managed-memory: true{code} During a scale down the autotuning decided to give all the memory to to JVM (having heap being scale by 2) settting taskmanager.memory.managed.size to 0b. Here is the config that was compute by the autotuning for a TM running on a 4GB pod: {code:java} taskmanager.memory.network.max: 4063232b taskmanager.memory.network.min: 4063232b taskmanager.memory.jvm-overhead.max: 433791712b taskmanager.memory.task.heap.size: 3699934605b taskmanager.memory.framework.off-heap.size: 134217728b taskmanager.memory.jvm-metaspace.size: 22960020b taskmanager.memory.framework.heap.size: "0 bytes" taskmanager.memory.flink.size: 3838215565b taskmanager.memory.managed.size: 0b {code} This has lead to some issue starting the TM because we are relying on some javaagent performing some memory allocation outside of the JVM (rely on some C bindings). Tuning the overhead or disabling the scale-down-compensation.enabled could have helped for that particular event but this can leads to other issue as it could leads to too little HEAP size being computed. It would be interesting to be able to set a min memory.managed.size to be taken in account by the autotuning. What do you think about this? Do you think that some other specific config should have been applied to avoid this issue? Edit was: We have enable the autotuning feature on one of our flink job with below config {code:java} # Autoscaler configuration job.autoscaler.enabled: "true" job.autoscaler.stabilization.interval: 1m job.autoscaler.metrics.window: 10m job.autoscaler.target.utilization: "0.8" job.autoscaler.target.utilization.boundary: "0.1" job.autoscaler.restart.time: 2m job.autoscaler.catch-up.duration: 10m job.autoscaler.memory.tuning.enabled: true job.autoscaler.memory.tuning.overhead: 0.5 job.autoscaler.memory.tuning.maximize-managed-memory: true{code} During a scale down the autotuning decided to give all the memory to to JVM (having heap being scale by 2) settting taskmanager.memory.managed.size to 0b. Here is the config that was compute by the autotuning for a TM running on a 4GB pod: {code:java} taskmanager.memory.network.max: 4063232b taskmanager.memory.network.min: 4063232b taskmanager.memory.jvm-overhead.max: 433791712b taskmanager.memory.task.heap.size: 3699934605b taskmanager.memory.framework.off-heap.size: 134217728b taskmanager.memory.jvm-metaspace.size: 22960020b taskmanager.memory.framework.heap.size: "0 bytes" taskmanager.memory.flink.size: 3838215565b taskmanager.memory.managed.size: 0b {code} This has lead to some issue starting the TM because we are relying on some javaagent performing some memory allocation outside of the JVM (rely on some C bindings). Tuning the overhead or disabling the scale-down-compensation.enabled could have helped for that particular event but this can leads to other issue as it could leads to too little HEAP size being computed. It would be interesting to be able to set a min memory.managed.size to be taken in account by the autotuning. What do you think about this? Do you think that some other specific config should have been applied to avoid this issue? > Metaspace size can be too little after autotuning change memory setting > --- > > Key: FLINK-35489 > URL: https://issues.apache.org/jira/browse/FLINK-35489 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Affects Versions: 1.8.0 >Reporter: Nicolas Fraison >Priority: Major > > We have enable the autotuning feature on one of our flink job with below > config > {code:java} > # Autoscaler configuration > job.autoscaler.enabled: "true" > job.autoscaler.stabilization.interval: 1m > job.autoscaler.metrics.window: 10m > job.autoscaler.target.utilization: "0.8" > job.autoscaler.target.utilization.boundary: "0.1" > job.autoscaler.restart.time: 2m > job.autoscaler.catch-up.duration: 10m > job.autoscaler.memory.tuning.enabled: true > job.autoscaler.memory.tuning.overhead: 0.5 > job.autoscaler.memory.tuning.maximize-managed-memory: true{code} > During a scale
[jira] [Updated] (FLINK-35489) Metaspace size can be too little after autotuning change memory setting
[ https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-35489: Description: We have enable the autotuning feature on one of our flink job with below config {code:java} # Autoscaler configuration job.autoscaler.enabled: "true" job.autoscaler.stabilization.interval: 1m job.autoscaler.metrics.window: 10m job.autoscaler.target.utilization: "0.8" job.autoscaler.target.utilization.boundary: "0.1" job.autoscaler.restart.time: 2m job.autoscaler.catch-up.duration: 10m job.autoscaler.memory.tuning.enabled: true job.autoscaler.memory.tuning.overhead: 0.5 job.autoscaler.memory.tuning.maximize-managed-memory: true{code} During a scale down the autotuning decided to give all the memory to to JVM (having heap being scale by 2) settting taskmanager.memory.managed.size to 0b. Here is the config that was compute by the autotuning for a TM running on a 4GB pod: {code:java} taskmanager.memory.network.max: 4063232b taskmanager.memory.network.min: 4063232b taskmanager.memory.jvm-overhead.max: 433791712b taskmanager.memory.task.heap.size: 3699934605b taskmanager.memory.framework.off-heap.size: 134217728b taskmanager.memory.jvm-metaspace.size: 22960020b taskmanager.memory.framework.heap.size: "0 bytes" taskmanager.memory.flink.size: 3838215565b taskmanager.memory.managed.size: 0b {code} This has lead to some issue starting the TM because we are relying on some javaagent performing some memory allocation outside of the JVM (rely on some C bindings). Tuning the overhead or disabling the scale-down-compensation.enabled could have helped for that particular event but this can leads to other issue as it could leads to too little HEAP size being computed. It would be interesting to be able to set a min memory.managed.size to be taken in account by the autotuning. What do you think about this? Do you think that some other specific config should have been applied to avoid this issue? Edit see this comment that leads to the metaspace issue: https://issues.apache.org/jira/browse/FLINK-35489?focusedCommentId=17850694&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17850694 was: We have enable the autotuning feature on one of our flink job with below config {code:java} # Autoscaler configuration job.autoscaler.enabled: "true" job.autoscaler.stabilization.interval: 1m job.autoscaler.metrics.window: 10m job.autoscaler.target.utilization: "0.8" job.autoscaler.target.utilization.boundary: "0.1" job.autoscaler.restart.time: 2m job.autoscaler.catch-up.duration: 10m job.autoscaler.memory.tuning.enabled: true job.autoscaler.memory.tuning.overhead: 0.5 job.autoscaler.memory.tuning.maximize-managed-memory: true{code} During a scale down the autotuning decided to give all the memory to to JVM (having heap being scale by 2) settting taskmanager.memory.managed.size to 0b. Here is the config that was compute by the autotuning for a TM running on a 4GB pod: {code:java} taskmanager.memory.network.max: 4063232b taskmanager.memory.network.min: 4063232b taskmanager.memory.jvm-overhead.max: 433791712b taskmanager.memory.task.heap.size: 3699934605b taskmanager.memory.framework.off-heap.size: 134217728b taskmanager.memory.jvm-metaspace.size: 22960020b taskmanager.memory.framework.heap.size: "0 bytes" taskmanager.memory.flink.size: 3838215565b taskmanager.memory.managed.size: 0b {code} This has lead to some issue starting the TM because we are relying on some javaagent performing some memory allocation outside of the JVM (rely on some C bindings). Tuning the overhead or disabling the scale-down-compensation.enabled could have helped for that particular event but this can leads to other issue as it could leads to too little HEAP size being computed. It would be interesting to be able to set a min memory.managed.size to be taken in account by the autotuning. What do you think about this? Do you think that some other specific config should have been applied to avoid this issue? Edit > Metaspace size can be too little after autotuning change memory setting > --- > > Key: FLINK-35489 > URL: https://issues.apache.org/jira/browse/FLINK-35489 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Affects Versions: 1.8.0 >Reporter: Nicolas Fraison >Priority: Major > > We have enable the autotuning feature on one of our flink job with below > config > {code:java} > # Autoscaler configuration > job.autoscaler.enabled: "true" > job.autoscaler.stabilization.interval: 1m > job.autoscaler.metrics.window: 10m > job.autoscaler.target.utilization: "0.8" > job.autoscaler.target.utilization.boundary: "0.1" > job.autoscaler.res
[jira] [Comment Edited] (FLINK-35489) Metaspace size can be too little after autotuning change memory setting
[ https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850694#comment-17850694 ] Nicolas Fraison edited comment on FLINK-35489 at 5/30/24 12:07 PM: --- Thks [~fanrui] for the feedback it help me realise that my analysis was wrong. The issue we are facing is the JVM crashing after the [autotuning|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autotuning/] change some memory config: {code:java} Starting kubernetes-taskmanager as a console application on host flink-kafka-job-apache-right-taskmanager-1-1. Exception in thread "main" *** java.lang.instrument ASSERTION FAILED ***: "result" with message agent load/premain call failed at src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 422 FATAL ERROR in native method: processing of -javaagent failed, processJavaStart failed Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x78dee4] jni_FatalError+0x70 V [libjvm.so+0x88df00] JvmtiExport::post_vm_initialized()+0x240 V [libjvm.so+0xc353fc] Threads::create_vm(JavaVMInitArgs*, bool*)+0x7ac V [libjvm.so+0x79c05c] JNI_CreateJavaVM+0x7c C [libjli.so+0x3b2c] JavaMain+0x7c C [libjli.so+0x7fdc] ThreadJavaMain+0xc C [libpthread.so.0+0x7624] start_thread+0x184 {code} Seeing this big increase of HEAP (from 1.5 to more than 3GB and the fact that the memory.managed.size was shrink to 0b make me thing that it was linked to missing off heap. But you are right that jvm-overhead already reserved some memory for the off heap (and we indeed have around 400 MB with that config) So looking back to the new config I've identified the issue which is on the jvm-metaspace having been shrink to 22MB while it was set at 256MB. I've done a test increasing this parameter and the TM is now able to start. For the meta space computation size I can see the autotuning computing METASPACE_MEMORY_USED=1.41521584E8 which seems to be appropriate metaspace sizing. But due to the the memBudget management it ends up setting only 22MB to the metaspace ([first allocate remaining memory to the heap and then this new remaining to metaspace and finally to managed memory|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/tuning/MemoryTuning.java#L130]) was (Author: JIRAUSER299678): Thks [~fanrui] for the feedback it help me realise that my analysis was wrong. The issue we are facing ifs the JVM crashing after the [autotuning|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autotuning/] change some memory config: {code:java} Starting kubernetes-taskmanager as a console application on host flink-kafka-job-apache-right-taskmanager-1-1. Exception in thread "main" *** java.lang.instrument ASSERTION FAILED ***: "result" with message agent load/premain call failed at src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 422 FATAL ERROR in native method: processing of -javaagent failed, processJavaStart failed Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x78dee4] jni_FatalError+0x70 V [libjvm.so+0x88df00] JvmtiExport::post_vm_initialized()+0x240 V [libjvm.so+0xc353fc] Threads::create_vm(JavaVMInitArgs*, bool*)+0x7ac V [libjvm.so+0x79c05c] JNI_CreateJavaVM+0x7c C [libjli.so+0x3b2c] JavaMain+0x7c C [libjli.so+0x7fdc] ThreadJavaMain+0xc C [libpthread.so.0+0x7624] start_thread+0x184 {code} Seeing this big increase of HEAP (from 1.5 to more than 3GB and the fact that the memory.managed.size was shrink to 0b make me thing that it was linked to missing off heap. But you are right that jvm-overhead already reserved some memory for the off heap (and we indeed have around 400 MB with that config) So looking back to the new config I've identified the issue which is on the jvm-metaspace having been shrink to 22MB while it was set at 256MB. I've done a test increasing this parameter and the TM is now able to start. For the meta space computation size I can see the autotuning computing METASPACE_MEMORY_USED=1.41521584E8 which seems to be appropriate metaspace sizing. But due to the the memBudget management it ends up setting only 22MB to the metaspace ([first allocate remaining memory to the heap and then this new remaining to metaspace and finally to managed memory|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/tuning/MemoryTuning.java#L130]) > Metaspace size can be too little after autotuning change memory setting > --- > > Key: FLINK-35489 > URL: h
[jira] [Commented] (FLINK-35489) Metaspace size can be too little after autotuning change memory setting
[ https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850953#comment-17850953 ] Nicolas Fraison commented on FLINK-35489: - [~mxm] what do you think about providing memory to the METASPACE_MEMORY before the HEAP_MEMORY (switching those lines [https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/tuning/MemoryTuning.java#L128-L131)] And also ensuring that the METASPACE_MEMORY computed will never be bigger than the one assigned by default (from config taskmanager.memory.jvm-metaspace.size) Looks to me that this space should not grow with some load change and the default one must be sufficiently big to have the TaskManager running fine. The autotuning should only scale down this memory space depending on the usage, no? > Metaspace size can be too little after autotuning change memory setting > --- > > Key: FLINK-35489 > URL: https://issues.apache.org/jira/browse/FLINK-35489 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Affects Versions: 1.8.0 >Reporter: Nicolas Fraison >Priority: Major > > We have enable the autotuning feature on one of our flink job with below > config > {code:java} > # Autoscaler configuration > job.autoscaler.enabled: "true" > job.autoscaler.stabilization.interval: 1m > job.autoscaler.metrics.window: 10m > job.autoscaler.target.utilization: "0.8" > job.autoscaler.target.utilization.boundary: "0.1" > job.autoscaler.restart.time: 2m > job.autoscaler.catch-up.duration: 10m > job.autoscaler.memory.tuning.enabled: true > job.autoscaler.memory.tuning.overhead: 0.5 > job.autoscaler.memory.tuning.maximize-managed-memory: true{code} > During a scale down the autotuning decided to give all the memory to to JVM > (having heap being scale by 2) settting taskmanager.memory.managed.size to 0b. > Here is the config that was compute by the autotuning for a TM running on a > 4GB pod: > {code:java} > taskmanager.memory.network.max: 4063232b > taskmanager.memory.network.min: 4063232b > taskmanager.memory.jvm-overhead.max: 433791712b > taskmanager.memory.task.heap.size: 3699934605b > taskmanager.memory.framework.off-heap.size: 134217728b > taskmanager.memory.jvm-metaspace.size: 22960020b > taskmanager.memory.framework.heap.size: "0 bytes" > taskmanager.memory.flink.size: 3838215565b > taskmanager.memory.managed.size: 0b {code} > This has lead to some issue starting the TM because we are relying on some > javaagent performing some memory allocation outside of the JVM (rely on some > C bindings). > Tuning the overhead or disabling the scale-down-compensation.enabled could > have helped for that particular event but this can leads to other issue as it > could leads to too little HEAP size being computed. > It would be interesting to be able to set a min memory.managed.size to be > taken in account by the autotuning. > What do you think about this? Do you think that some other specific config > should have been applied to avoid this issue? > > Edit see this comment that leads to the metaspace issue: > https://issues.apache.org/jira/browse/FLINK-35489?focusedCommentId=17850694&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17850694 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33125) Upgrade JOSDK to 4.4.4
Nicolas Fraison created FLINK-33125: --- Summary: Upgrade JOSDK to 4.4.4 Key: FLINK-33125 URL: https://issues.apache.org/jira/browse/FLINK-33125 Project: Flink Issue Type: Bug Components: Kubernetes Operator Reporter: Nicolas Fraison Fix For: kubernetes-operator-1.7.0 JOSDK [4.4.4|https://github.com/operator-framework/java-operator-sdk/releases/tag/v4.4.4] contains fix for leader election issue we face in our environment Here are more information on the [issue|https://github.com/operator-framework/java-operator-sdk/issues/2056] faced -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33222) Operator rollback app when it should not
Nicolas Fraison created FLINK-33222: --- Summary: Operator rollback app when it should not Key: FLINK-33222 URL: https://issues.apache.org/jira/browse/FLINK-33222 Project: Flink Issue Type: Bug Components: Kubernetes Operator Environment: Flink operator 1.6 - Flink 1.17.1 Reporter: Nicolas Fraison The operator can decide to rollback when an update of the job spec is performed on savepointTriggerNonce or initialSavepointPath if the app has been deployed since more than KubernetesOperatorConfigOptions.DEPLOYMENT_READINESS_TIMEOUT. This is due to the objectmeta generation being [updated|https://github.com/apache/flink-kubernetes-operator/blob/release-1.6/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L169] when changing those spec and leading to the lastReconcileSpec not being aligned with the stableReconcileSpec while those spec are well ignored when checking for upgrade diff Looking at the main branch we should still face the same issue as the same [update|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L169] is performed at the end of the reconcile loop -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33222) Operator rollback app when it should not
[ https://issues.apache.org/jira/browse/FLINK-33222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773606#comment-17773606 ] Nicolas Fraison commented on FLINK-33222: - So I was wrong the release 1.7-snapshot is not affected by this bug thanks to [https://github.com/apache/flink-kubernetes-operator/pull/681] patch. Indeed deploying the app with an {{{}initialSavepointPath{}}}: * lastReconciledSpec get the update of generation from N to N+1 while stable spec generation stay at N. But no rollback detected as the [update|https://github.com/apache/flink-kubernetes-operator/pull/681/files#diff-29ea38a50cac5b4432dd0969bc3e2177e29a5507f8c7bb01b80f605a8740de41R169] is done after the [rollback|https://github.com/apache/flink-kubernetes-operator/pull/681/files#diff-29ea38a50cac5b4432dd0969bc3e2177e29a5507f8c7bb01b80f605a8740de41R146] check deployment is consider as DEPLOYED * then on second reconcile loop the stable spec generation is also updated from N to N+1 (in [patchAndCacheStatus|[https://github.com/apache/flink-kubernetes-operator/blob/release-1.6/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java#L135]])and the deployment is consider as STABLE But this look quite brittle to me as just changing the position of the shouldRollBack or ReconciliationUtils.updateReconciliationMetadata could lead to that bad behaviour again. I'm wondering if we could not take in account the generation field in the [isLastReconciledSpecStable|https://github.com/apache/flink-kubernetes-operator/blob/release-1.6/flink-kubernetes-operator-api/src/main/java/org/apache/flink/kubernetes/operator/api/status/ReconciliationStatus.java#L91] > Operator rollback app when it should not > > > Key: FLINK-33222 > URL: https://issues.apache.org/jira/browse/FLINK-33222 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Environment: Flink operator 1.6 - Flink 1.17.1 >Reporter: Nicolas Fraison >Priority: Major > > The operator can decide to rollback when an update of the job spec is > performed on > savepointTriggerNonce or initialSavepointPath if the app has been deployed > since more than KubernetesOperatorConfigOptions.DEPLOYMENT_READINESS_TIMEOUT. > > This is due to the objectmeta generation being > [updated|https://github.com/apache/flink-kubernetes-operator/blob/release-1.6/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L169] > when changing those spec and leading to the lastReconcileSpec not being > aligned with the stableReconcileSpec while those spec are well ignored when > checking for upgrade diff > > Looking at the main branch we should still face the same issue as the same > [update|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L169] > is performed at the end of the reconcile loop -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32551) Provide the possibility to have a savepoint taken by the operator when deleting a flinkdeployment
Nicolas Fraison created FLINK-32551: --- Summary: Provide the possibility to have a savepoint taken by the operator when deleting a flinkdeployment Key: FLINK-32551 URL: https://issues.apache.org/jira/browse/FLINK-32551 Project: Flink Issue Type: Improvement Reporter: Nicolas Fraison Currently if a flinkdeployment is deleted all the HA metadata is removed and no checkpoint is taken. It would be great (for ex. in case of fat finger) to be able to configure deployment in order to take a savepoint if the deployment is deleted -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32551) Provide the possibility to have a savepoint taken by the operator when deleting a flinkdeployment
[ https://issues.apache.org/jira/browse/FLINK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-32551: Component/s: Kubernetes Operator > Provide the possibility to have a savepoint taken by the operator when > deleting a flinkdeployment > - > > Key: FLINK-32551 > URL: https://issues.apache.org/jira/browse/FLINK-32551 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Nicolas Fraison >Priority: Major > > Currently if a flinkdeployment is deleted all the HA metadata is removed and > no checkpoint is taken. > It would be great (for ex. in case of fat finger) to be able to configure > deployment in order to take a savepoint if the deployment is deleted -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32551) Provide the possibility to take a savepoint when deleting a flinkdeployment
[ https://issues.apache.org/jira/browse/FLINK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-32551: Summary: Provide the possibility to take a savepoint when deleting a flinkdeployment (was: Provide the possibility to have a savepoint taken by the operator when deleting a flinkdeployment) > Provide the possibility to take a savepoint when deleting a flinkdeployment > --- > > Key: FLINK-32551 > URL: https://issues.apache.org/jira/browse/FLINK-32551 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Nicolas Fraison >Priority: Major > > Currently if a flinkdeployment is deleted all the HA metadata is removed and > no checkpoint is taken. > It would be great (for ex. in case of fat finger) to be able to configure > deployment in order to take a savepoint if the deployment is deleted -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32551) Provide the possibility to take a savepoint when deleting a flinkdeployment
[ https://issues.apache.org/jira/browse/FLINK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-32551: Description: Currently if a flinkdeployment is deleted all the HA metadata is removed and no savepoint is taken. It would be great (for ex. in case of fat finger) to be able to configure deployment in order to take a savepoint if the deployment is deleted was: Currently if a flinkdeployment is deleted all the HA metadata is removed and no checkpoint is taken. It would be great (for ex. in case of fat finger) to be able to configure deployment in order to take a savepoint if the deployment is deleted > Provide the possibility to take a savepoint when deleting a flinkdeployment > --- > > Key: FLINK-32551 > URL: https://issues.apache.org/jira/browse/FLINK-32551 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Nicolas Fraison >Priority: Major > > Currently if a flinkdeployment is deleted all the HA metadata is removed and > no savepoint is taken. > It would be great (for ex. in case of fat finger) to be able to configure > deployment in order to take a savepoint if the deployment is deleted -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32551) Provide the possibility to take a savepoint when deleting a flinkdeployment
[ https://issues.apache.org/jira/browse/FLINK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740556#comment-17740556 ] Nicolas Fraison commented on FLINK-32551: - It is probably not perfect but if the user select this option it probably means that he really wants it or wants to manage it by himself if there are some issues to take it before to delete the app So I would failed the deletion if the savepoint failed and retry. > Provide the possibility to take a savepoint when deleting a flinkdeployment > --- > > Key: FLINK-32551 > URL: https://issues.apache.org/jira/browse/FLINK-32551 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Nicolas Fraison >Priority: Major > > Currently if a flinkdeployment is deleted all the HA metadata is removed and > no savepoint is taken. > It would be great (for ex. in case of fat finger) to be able to configure > deployment in order to take a savepoint if the deployment is deleted -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32551) Provide the possibility to take a savepoint when deleting a flinkdeployment
[ https://issues.apache.org/jira/browse/FLINK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740561#comment-17740561 ] Nicolas Fraison commented on FLINK-32551: - Yes we could in case of accidental deletion or intentional deletion rely on those actions. Still I would prefer not to have manual actions from user when doing such or having users to search for last checkpoint in case of accidental deletion (just looking at a log indicating the path of the savepoint taken during job cancelation) Looks to me also more secure > Provide the possibility to take a savepoint when deleting a flinkdeployment > --- > > Key: FLINK-32551 > URL: https://issues.apache.org/jira/browse/FLINK-32551 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Nicolas Fraison >Priority: Major > > Currently if a flinkdeployment is deleted all the HA metadata is removed and > no savepoint is taken. > It would be great (for ex. in case of fat finger) to be able to configure > deployment in order to take a savepoint if the deployment is deleted -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32551) Provide the possibility to take a savepoint when deleting a flinkdeployment
[ https://issues.apache.org/jira/browse/FLINK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740564#comment-17740564 ] Nicolas Fraison commented on FLINK-32551: - Yes > Provide the possibility to take a savepoint when deleting a flinkdeployment > --- > > Key: FLINK-32551 > URL: https://issues.apache.org/jira/browse/FLINK-32551 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Nicolas Fraison >Priority: Major > > Currently if a flinkdeployment is deleted all the HA metadata is removed and > no savepoint is taken. > It would be great (for ex. in case of fat finger) to be able to configure > deployment in order to take a savepoint if the deployment is deleted -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32012) Operator failed to rollback due to missing HA metadata
[ https://issues.apache.org/jira/browse/FLINK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723156#comment-17723156 ] Nicolas Fraison commented on FLINK-32012: - I'd like to have your feeling on the first approach I'd like to do to merge upgrade/rollback. Currently to do a rollback we * First check if spec change * If not check if we should rollback * If yes initiate rollback where we set status to ROLLING_BACK and wait for next reconcile loop * Next reconcile loop will do same checks * As it should rollback and state already up to date it will rollback relying on last stable spec but doesn't update the spec The idea would be to really rollback the spec spec to rely for both mechanism on reconcileSpecChange: * Still keep the check if spec change * Then check if should rollback * In the initiate rollback set status to ROLLING_BACK and restore last stable spec and wait for next reconcile loop * Next reconcile loop will check if spec change * It will be the case so will run reconcileSpecChange but with status ROLLING_BACK so could then applied specificities of rollback Would also simplify the operation that need to reapply the working spec like [resubmitJob|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractJobReconciler.java#L309] Still I'm wondering if we will then be really able to simplify things as from your [comment|https://github.com/apache/flink-kubernetes-operator/pull/590#discussion_r1189847280] the rollback doesn't seem to be aligned to the upgrade? For ex for savepoint upgrade mode: * upgrade ** delete ha metadata ** restore from savepoint * rollback ** keep ha metadata if exist ** restore from it if exist ** restore from savepoint if not exist and JM pod never started > Operator failed to rollback due to missing HA metadata > -- > > Key: FLINK-32012 > URL: https://issues.apache.org/jira/browse/FLINK-32012 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.4.0 >Reporter: Nicolas Fraison >Priority: Major > Labels: pull-request-available > > The operator has well detected that the job was failing and initiate the > rollback but this rollback has failed due to `Rollback is not possible due to > missing HA metadata` > We are relying on saevpoint upgrade mode and zookeeper HA. > The operator is performing a set of action to also delete this HA data in > savepoint upgrade mode: > * [flink-kubernetes-operator/AbstractFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346] > : Suspend job with savepoint and deleteClusterDeployment > * [flink-kubernetes-operator/StandaloneFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158] > : Remove JM + TM deployment and delete HA data > * [flink-kubernetes-operator/AbstractFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008] > : Wait cluster shutdown and delete zookeeper HA data > * [flink-kubernetes-operator/FlinkUtils.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155] > : Remove all child znode > Then when running rollback the operator is looking for HA data even if we > rely on sevepoint upgrade mode: > * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164] > Perform reconcile of rollback if it should rollback > * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387] > Rollback failed as HA data is not available > * [flink-kubernetes-operator/FlinkUtils.jav
[jira] [Commented] (FLINK-32012) Operator failed to rollback due to missing HA metadata
[ https://issues.apache.org/jira/browse/FLINK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725413#comment-17725413 ] Nicolas Fraison commented on FLINK-32012: - [~gyfora], I started an implementation which do not modify the upgradeMode to last-state as the [restoreJob|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractJobReconciler.java#L143%5D] for LAST_STATE upgrade mode will enforce the requirement for HA metadata which is not what we want when relying on SAVEPOINT. Also when restoring last stable spec there is a case where the UpgradeMode is set to STATELESS in this spec even if the chosen mode is SAVEPOINT ([updateStatusBeforeFirstDeployment|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L193]) In order to avoid restoring this bad state leading to rollback not taking in account the savepoint I enforce the upgrade mode of the restored spec to be the one currently set on the job. But I'm wondering why we have decided to not persist the really used upgrade mode in the last stable spec for first deployment? Here is the diff of my [WIP|https://github.com/apache/flink-kubernetes-operator/compare/main...ashangit:flink-kubernetes-operator:nfraison/FLINK-32012?expand=1] if approach is not clear (not to review...) > Operator failed to rollback due to missing HA metadata > -- > > Key: FLINK-32012 > URL: https://issues.apache.org/jira/browse/FLINK-32012 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.4.0 >Reporter: Nicolas Fraison >Priority: Major > Labels: pull-request-available > > The operator has well detected that the job was failing and initiate the > rollback but this rollback has failed due to `Rollback is not possible due to > missing HA metadata` > We are relying on saevpoint upgrade mode and zookeeper HA. > The operator is performing a set of action to also delete this HA data in > savepoint upgrade mode: > * [flink-kubernetes-operator/AbstractFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346] > : Suspend job with savepoint and deleteClusterDeployment > * [flink-kubernetes-operator/StandaloneFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158] > : Remove JM + TM deployment and delete HA data > * [flink-kubernetes-operator/AbstractFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008] > : Wait cluster shutdown and delete zookeeper HA data > * [flink-kubernetes-operator/FlinkUtils.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155] > : Remove all child znode > Then when running rollback the operator is looking for HA data even if we > rely on sevepoint upgrade mode: > * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164] > Perform reconcile of rollback if it should rollback > * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387] > Rollback failed as HA data is not available > * [flink-kubernetes-operator/FlinkUtils.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L220] > Check if some child znodes are available > For both step the pattern looks to be the same for kubernetes HA so it > doesn't loo
[jira] [Commented] (FLINK-32012) Operator failed to rollback due to missing HA metadata
[ https://issues.apache.org/jira/browse/FLINK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725450#comment-17725450 ] Nicolas Fraison commented on FLINK-32012: - But the whole point of this request was the fact that when JobManager failed to start the HA metadata was not available during RollBack while the savepoint taken for the upgrade is available. So relying on SAVEPOINT for rollback will ensure that Flink is aware of the availability of a savepoint. >From my understanding of >[tryRestoreExecutionGraphFromSavepoint|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultExecutionGraphFactory.java#L198] > Flink will rely on the saveoint if no HA metadata exist otherwise it will >load the checkpoint > Operator failed to rollback due to missing HA metadata > -- > > Key: FLINK-32012 > URL: https://issues.apache.org/jira/browse/FLINK-32012 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.4.0 >Reporter: Nicolas Fraison >Priority: Major > Labels: pull-request-available > > The operator has well detected that the job was failing and initiate the > rollback but this rollback has failed due to `Rollback is not possible due to > missing HA metadata` > We are relying on saevpoint upgrade mode and zookeeper HA. > The operator is performing a set of action to also delete this HA data in > savepoint upgrade mode: > * [flink-kubernetes-operator/AbstractFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346] > : Suspend job with savepoint and deleteClusterDeployment > * [flink-kubernetes-operator/StandaloneFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158] > : Remove JM + TM deployment and delete HA data > * [flink-kubernetes-operator/AbstractFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008] > : Wait cluster shutdown and delete zookeeper HA data > * [flink-kubernetes-operator/FlinkUtils.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155] > : Remove all child znode > Then when running rollback the operator is looking for HA data even if we > rely on sevepoint upgrade mode: > * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164] > Perform reconcile of rollback if it should rollback > * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387] > Rollback failed as HA data is not available > * [flink-kubernetes-operator/FlinkUtils.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L220] > Check if some child znodes are available > For both step the pattern looks to be the same for kubernetes HA so it > doesn't looks to be linked to a bug with zookeeper. > > From https://issues.apache.org/jira/browse/FLINK-30305 it looks to be > expected that the HA data has been deleted (as it is also performed by flink > when relying on savepoint upgrade mode). > Still the use case seems to differ from > https://issues.apache.org/jira/browse/FLINK-30305 as the operator is aware of > the failure and treat a specific rollback event. > So I'm wondering why we enforce such a check when performing rollback if we > rely on savepoint upgrade mode. Would it be fine to not rely on the HA data > and rollback from the last savepoint (the one we used in the deployment step)? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-32012) Operator failed to rollback due to missing HA metadata
[ https://issues.apache.org/jira/browse/FLINK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725450#comment-17725450 ] Nicolas Fraison edited comment on FLINK-32012 at 5/23/23 3:05 PM: -- But the whole point of this request was the fact that when JobManager failed to start the HA metadata was not available during RollBack while the savepoint taken for the upgrade is available. So relying on SAVEPOINT for rollback will ensure that Flink is aware of the availability of a savepoint. >From my understanding of >[tryRestoreExecutionGraphFromSavepoint|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultExecutionGraphFactory.java#L198] > Flink will rely on the saveoint if no HA metadata exist otherwise it will >load the checkpoint Forgot to mention it but indeed when calling [cancelJob|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L353%5D] it fallback to LAST_STATE in order to ensure no savepoint is created and HA metadata is not deleted was (Author: JIRAUSER299678): But the whole point of this request was the fact that when JobManager failed to start the HA metadata was not available during RollBack while the savepoint taken for the upgrade is available. So relying on SAVEPOINT for rollback will ensure that Flink is aware of the availability of a savepoint. >From my understanding of >[tryRestoreExecutionGraphFromSavepoint|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultExecutionGraphFactory.java#L198] > Flink will rely on the saveoint if no HA metadata exist otherwise it will >load the checkpoint > Operator failed to rollback due to missing HA metadata > -- > > Key: FLINK-32012 > URL: https://issues.apache.org/jira/browse/FLINK-32012 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.4.0 >Reporter: Nicolas Fraison >Priority: Major > Labels: pull-request-available > > The operator has well detected that the job was failing and initiate the > rollback but this rollback has failed due to `Rollback is not possible due to > missing HA metadata` > We are relying on saevpoint upgrade mode and zookeeper HA. > The operator is performing a set of action to also delete this HA data in > savepoint upgrade mode: > * [flink-kubernetes-operator/AbstractFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346] > : Suspend job with savepoint and deleteClusterDeployment > * [flink-kubernetes-operator/StandaloneFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158] > : Remove JM + TM deployment and delete HA data > * [flink-kubernetes-operator/AbstractFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008] > : Wait cluster shutdown and delete zookeeper HA data > * [flink-kubernetes-operator/FlinkUtils.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155] > : Remove all child znode > Then when running rollback the operator is looking for HA data even if we > rely on sevepoint upgrade mode: > * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164] > Perform reconcile of rollback if it should rollback > * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387] > Rollback failed as HA data is not available > * [flink-kubernetes-operator/FlinkUtils.java at main · > apache/flink-kubernet
[jira] [Created] (FLINK-32334) Operator failed to create taskmanager deployment because it already exist
Nicolas Fraison created FLINK-32334: --- Summary: Operator failed to create taskmanager deployment because it already exist Key: FLINK-32334 URL: https://issues.apache.org/jira/browse/FLINK-32334 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.5.0 Reporter: Nicolas Fraison During a job upgrade the operator has failed to start the new job because it has failed to create the taskmanager deployment: {code:java} Jun 12 19:45:28.115 >>> Status | Error | UPGRADING | {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException: Could not create Kubernetes cluster \"flink-metering\".","throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"Could not create Kubernetes cluster \"flink-metering\"."},{"type":"org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure executing: POST at: https://10.129.144.1/apis/apps/v1/namespaces/metering/deployments. Message: object is being deleted: deployments.apps \"flink-metering-taskmanager\" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=apps, kind=deployments, name=flink-metering-taskmanager, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=object is being deleted: deployments.apps \"flink-metering-taskmanager\" already exists, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={})."}]} {code} As indicated in the error log this is due to taskmanger deployment still existing while it is under deletion. Looking at the [source code|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L150] we are well relying on FOREGROUND policy by default. Still it seems that the REST API call to delete only wait until the resource has been modified and the {{deletionTimestamp}} has been added to the metadata: [ensure delete returns only when the delete operation is fully finished - Issue #3246 - fabric8io/kubernetes-client|https://github.com/fabric8io/kubernetes-client/issues/3246#issuecomment-874019899] So we could face this issue if the k8s cluster is slow to "really" delete the deployment -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32334) Operator failed to create taskmanager deployment because it already exist
[ https://issues.apache.org/jira/browse/FLINK-32334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732431#comment-17732431 ] Nicolas Fraison commented on FLINK-32334: - Adding a check that object has been well deleted should be enough for this issue {code:java} kubernetesClient .apps() .deployments() .inNamespace(namespace) .withName(StandaloneKubernetesUtils.getTaskManagerDeploymentName(clusterId)) .waitUntilCondition(Objects::isNull, 30, TimeUnit.SECONDS); {code} > Operator failed to create taskmanager deployment because it already exist > - > > Key: FLINK-32334 > URL: https://issues.apache.org/jira/browse/FLINK-32334 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.5.0 >Reporter: Nicolas Fraison >Priority: Major > > During a job upgrade the operator has failed to start the new job because it > has failed to create the taskmanager deployment: > > {code:java} > Jun 12 19:45:28.115 >>> Status | Error | UPGRADING | > {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException: > Could not create Kubernetes cluster > \"flink-metering\".","throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"Could > not create Kubernetes cluster > \"flink-metering\"."},{"type":"org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure > executing: POST at: > https://10.129.144.1/apis/apps/v1/namespaces/metering/deployments. Message: > object is being deleted: deployments.apps \"flink-metering-taskmanager\" > already exists. Received status: Status(apiVersion=v1, code=409, > details=StatusDetails(causes=[], group=apps, kind=deployments, > name=flink-metering-taskmanager, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=object is being deleted: > deployments.apps \"flink-metering-taskmanager\" already exists, > metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), > reason=AlreadyExists, status=Failure, additionalProperties={})."}]} {code} > As indicated in the error log this is due to taskmanger deployment still > existing while it is under deletion. > Looking at the [source > code|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L150] > we are well relying on FOREGROUND policy by default. > Still it seems that the REST API call to delete only wait until the resource > has been modified and the {{deletionTimestamp}} has been added to the > metadata: [ensure delete returns only when the delete operation is fully > finished - Issue #3246 - > fabric8io/kubernetes-client|https://github.com/fabric8io/kubernetes-client/issues/3246#issuecomment-874019899] > So we could face this issue if the k8s cluster is slow to "really" delete the > deployment > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32334) Operator failed to create taskmanager deployment because it already exist
[ https://issues.apache.org/jira/browse/FLINK-32334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732475#comment-17732475 ] Nicolas Fraison commented on FLINK-32334: - I can see in the log that we have wait once and then "found" the cluster to be shutdown {code:java} Jun 12 19:45:25.133 for cluster shutdown... Jun 12 19:45:27.192 shutdown completed. {code} I think the issue is due to the fact that we only look for JM pods not for TM one I will see to add a check for the TM in it (and only for standalone clusters) > Operator failed to create taskmanager deployment because it already exist > - > > Key: FLINK-32334 > URL: https://issues.apache.org/jira/browse/FLINK-32334 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.5.0 >Reporter: Nicolas Fraison >Priority: Major > > During a job upgrade the operator has failed to start the new job because it > has failed to create the taskmanager deployment: > > {code:java} > Jun 12 19:45:28.115 >>> Status | Error | UPGRADING | > {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException: > Could not create Kubernetes cluster > \"flink-metering\".","throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"Could > not create Kubernetes cluster > \"flink-metering\"."},{"type":"org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure > executing: POST at: > https://10.129.144.1/apis/apps/v1/namespaces/metering/deployments. Message: > object is being deleted: deployments.apps \"flink-metering-taskmanager\" > already exists. Received status: Status(apiVersion=v1, code=409, > details=StatusDetails(causes=[], group=apps, kind=deployments, > name=flink-metering-taskmanager, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=object is being deleted: > deployments.apps \"flink-metering-taskmanager\" already exists, > metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), > reason=AlreadyExists, status=Failure, additionalProperties={})."}]} {code} > As indicated in the error log this is due to taskmanger deployment still > existing while it is under deletion. > Looking at the [source > code|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L150] > we are well relying on FOREGROUND policy by default. > Still it seems that the REST API call to delete only wait until the resource > has been modified and the {{deletionTimestamp}} has been added to the > metadata: [ensure delete returns only when the delete operation is fully > finished - Issue #3246 - > fabric8io/kubernetes-client|https://github.com/fabric8io/kubernetes-client/issues/3246#issuecomment-874019899] > So we could face this issue if the k8s cluster is slow to "really" delete the > deployment > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-30609) Add ephemeral storage to CRD
[ https://issues.apache.org/jira/browse/FLINK-30609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17708023#comment-17708023 ] Nicolas Fraison commented on FLINK-30609: - We are looking into using this feature and have some time to implement it. If it is fine by you I can take over on it. > Add ephemeral storage to CRD > > > Key: FLINK-30609 > URL: https://issues.apache.org/jira/browse/FLINK-30609 > Project: Flink > Issue Type: New Feature > Components: Kubernetes Operator >Reporter: Matyas Orhidi >Assignee: Zhenqiu Huang >Priority: Major > Labels: starter > Fix For: kubernetes-operator-1.5.0 > > > We should consider adding ephemeral storage to the existing [resource > specification > |https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/reference/#resource]in > CRD, next to {{cpu}} and {{memory}} > https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#setting-requests-and-limits-for-local-ephemeral-storage -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-31812) SavePoint from /jars/:jarid:/run api on body is not anymore emptyToNull if empty
Nicolas Fraison created FLINK-31812: --- Summary: SavePoint from /jars/:jarid:/run api on body is not anymore emptyToNull if empty Key: FLINK-31812 URL: https://issues.apache.org/jira/browse/FLINK-31812 Project: Flink Issue Type: Bug Affects Versions: 1.17.0 Reporter: Nicolas Fraison Since https://issues.apache.org/jira/browse/FLINK-29543 the savepointPath from the body is not anymore transform to null if empty: https://github.com/apache/flink/pull/21012/files#diff-c6d9a43d970eb07642a87e4bf9ec6a9dc7d363b1b5b557ed76f73d8de1cc5a54R145 This leads to issue running a flink job in release 1.17 with lyft operator which set savePoint in body to "": [https://github.com/lyft/flinkk8soperator/blob/master/pkg/controller/flinkapplication/flink_state_machine.go#L721] Issue faced by the job as the savepointPath is set to "": {code:java} org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster. 3 at org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97) 4 at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) 5 at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) 6 at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) 7 at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705) 8 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 9 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 10 at java.base/java.lang.Thread.run(Thread.java:829) 11Caused by: java.util.concurrent.CompletionException: java.lang.IllegalArgumentException: empty checkpoint pointer 12 at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) 13 at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) 14 at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702) 15 ... 3 more 16Caused by: java.lang.IllegalArgumentException: empty checkpoint pointer 17 at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 18 at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpointPointer(AbstractFsCheckpointStorageAccess.java:240) 19 at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpoint(AbstractFsCheckpointStorageAccess.java:136) 20 at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1824) 21 at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.tryRestoreExecutionGraphFromSavepoint(DefaultExecutionGraphFactory.java:223) 22 at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:198) 23 at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:365) 24 at org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:210) 25 at org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:136) 26 at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:152) 27 at org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:119) 28 at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:371) 29 at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:348) 30 at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:123) 31 at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95) 32 at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112) 33 at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) 34 ... 3 more 35 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31812) SavePoint from /jars/:jarid:/run api on body is not anymore set to null if empty
[ https://issues.apache.org/jira/browse/FLINK-31812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-31812: Summary: SavePoint from /jars/:jarid:/run api on body is not anymore set to null if empty (was: SavePoint from /jars/:jarid:/run api on body is not anymore emptyToNull if empty) > SavePoint from /jars/:jarid:/run api on body is not anymore set to null if > empty > > > Key: FLINK-31812 > URL: https://issues.apache.org/jira/browse/FLINK-31812 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.0 >Reporter: Nicolas Fraison >Priority: Minor > > Since https://issues.apache.org/jira/browse/FLINK-29543 the > savepointPath from the body is not anymore transform to null if empty: > https://github.com/apache/flink/pull/21012/files#diff-c6d9a43d970eb07642a87e4bf9ec6a9dc7d363b1b5b557ed76f73d8de1cc5a54R145 > > This leads to issue running a flink job in release 1.17 with lyft operator > which set savePoint in body to "": > [https://github.com/lyft/flinkk8soperator/blob/master/pkg/controller/flinkapplication/flink_state_machine.go#L721] > > Issue faced by the job as the savepointPath is set to "": > {code:java} > org.apache.flink.runtime.client.JobInitializationException: Could not start > the JobMaster. > 3 at > org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97) > 4 at > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) > 5 at > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) > 6 at > java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > 7 at > java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705) > 8 at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > 9 at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > 10at java.base/java.lang.Thread.run(Thread.java:829) > 11Caused by: java.util.concurrent.CompletionException: > java.lang.IllegalArgumentException: empty checkpoint pointer > 12at > java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) > 13at > java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) > 14at > java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702) > 15... 3 more > 16Caused by: java.lang.IllegalArgumentException: empty checkpoint pointer > 17at > org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) > 18at > org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpointPointer(AbstractFsCheckpointStorageAccess.java:240) > 19at > org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpoint(AbstractFsCheckpointStorageAccess.java:136) > 20at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1824) > 21at > org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.tryRestoreExecutionGraphFromSavepoint(DefaultExecutionGraphFactory.java:223) > 22at > org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:198) > 23at > org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:365) > 24at > org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:210) > 25at > org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:136) > 26at > org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:152) > 27at > org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:119) > 28at > org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:371) > 29at > org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:348) > 30at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:123) > 31at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95) > 32at > org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112) > 33at > java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) > 34...
[jira] [Updated] (FLINK-31812) SavePoint from /jars/:jarid:/run api on body is not anymore set to null if empty
[ https://issues.apache.org/jira/browse/FLINK-31812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-31812: Description: Since https://issues.apache.org/jira/browse/FLINK-29543 the savepointPath from the body is not anymore transform to null if empty: [https://github.com/apache/flink/pull/21012/files#diff-c6d9a43d970eb07642a87e4bf9ec6a9dc7d363b1b5b557ed76f73d8de1cc5a54R145] This leads to issue running a flink job in release 1.17 with lyft operator which set savePoint in body to empty string: [https://github.com/lyft/flinkk8soperator/blob/master/pkg/controller/flinkapplication/flink_state_machine.go#L721] Issue faced by the job as the savepointPath is setto empty string: {code:java} org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster. 3 at org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97) 4 at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) 5 at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) 6 at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) 7 at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705) 8 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 9 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 10 at java.base/java.lang.Thread.run(Thread.java:829) 11Caused by: java.util.concurrent.CompletionException: java.lang.IllegalArgumentException: empty checkpoint pointer 12 at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) 13 at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) 14 at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702) 15 ... 3 more 16Caused by: java.lang.IllegalArgumentException: empty checkpoint pointer 17 at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 18 at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpointPointer(AbstractFsCheckpointStorageAccess.java:240) 19 at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpoint(AbstractFsCheckpointStorageAccess.java:136) 20 at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1824) 21 at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.tryRestoreExecutionGraphFromSavepoint(DefaultExecutionGraphFactory.java:223) 22 at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:198) 23 at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:365) 24 at org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:210) 25 at org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:136) 26 at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:152) 27 at org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:119) 28 at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:371) 29 at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:348) 30 at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:123) 31 at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95) 32 at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112) 33 at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) 34 ... 3 more 35 {code} was: Since https://issues.apache.org/jira/browse/FLINK-29543 the savepointPath from the body is not anymore transform to null if empty: https://github.com/apache/flink/pull/21012/files#diff-c6d9a43d970eb07642a87e4bf9ec6a9dc7d363b1b5b557ed76f73d8de1cc5a54R145 This leads to issue running a flink job in release 1.17 with lyft operator which set savePoint in body to "": [https://github.com/lyft/flinkk8soperator/blob/master/pkg/controller/flinkapplication/flink_state_machine.go#L721] Issue faced by the job as the savepointPath is set to "": {code:java} org.apache.flink.runtime.client.JobInitializationException: Could not start the Jo
[jira] [Created] (FLINK-32012) Operator failed to rollback due to missing HA metadata
Nicolas Fraison created FLINK-32012: --- Summary: Operator failed to rollback due to missing HA metadata Key: FLINK-32012 URL: https://issues.apache.org/jira/browse/FLINK-32012 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.4.0 Reporter: Nicolas Fraison The operator has well detected that the job was failing and initiate the rollback but this rollback has failed due to `Rollback is not possible due to missing HA metadata` We are relying on saevpoint upgrade mode and zookeeper HA. The operator is performing a set of action to also delete this HA data in savepoint upgrade mode: * [flink-kubernetes-operator/AbstractFlinkService.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346] : Suspend job with savepoint and deleteClusterDeployment * [flink-kubernetes-operator/StandaloneFlinkService.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158] : Remove JM + TM deployment and delete HA data * [flink-kubernetes-operator/AbstractFlinkService.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008] : Wait cluster shutdown and delete zookeeper HA data * [flink-kubernetes-operator/FlinkUtils.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155] : Remove all child znode Then when running rollback the operator is looking for HA data even if we rely on sevepoint upgrade mode: * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164] Perform reconcile of rollback if it should rollback * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387] Rollback failed as HA data is not available * [flink-kubernetes-operator/FlinkUtils.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L220] Check if some child znodes are available For both step the pattern looks to be the same for kubernetes HA so it doesn't looks to be linked to a bug with zookeeper. >From https://issues.apache.org/jira/browse/FLINK-30305 it looks to be expected >that the HA data has been deleted (as it is also performed by flink when >relying on savepoint upgrade mode) So I'm wondering why we enforce such a check when performing rollback if we rely on savepoint upgrade mode. Would it be fine to not rely on the HA data and rollback from the last savepoint (the one we used in the deployment step)? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32012) Operator failed to rollback due to missing HA metadata
[ https://issues.apache.org/jira/browse/FLINK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-32012: Description: The operator has well detected that the job was failing and initiate the rollback but this rollback has failed due to `Rollback is not possible due to missing HA metadata` We are relying on saevpoint upgrade mode and zookeeper HA. The operator is performing a set of action to also delete this HA data in savepoint upgrade mode: * [flink-kubernetes-operator/AbstractFlinkService.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346] : Suspend job with savepoint and deleteClusterDeployment * [flink-kubernetes-operator/StandaloneFlinkService.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158] : Remove JM + TM deployment and delete HA data * [flink-kubernetes-operator/AbstractFlinkService.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008] : Wait cluster shutdown and delete zookeeper HA data * [flink-kubernetes-operator/FlinkUtils.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155] : Remove all child znode Then when running rollback the operator is looking for HA data even if we rely on sevepoint upgrade mode: * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164] Perform reconcile of rollback if it should rollback * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387] Rollback failed as HA data is not available * [flink-kubernetes-operator/FlinkUtils.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L220] Check if some child znodes are available For both step the pattern looks to be the same for kubernetes HA so it doesn't looks to be linked to a bug with zookeeper. >From https://issues.apache.org/jira/browse/FLINK-30305 it looks to be expected >that the HA data has been deleted (as it is also performed by flink when >relying on savepoint upgrade mode). Still the use case seems to differ from https://issues.apache.org/jira/browse/FLINK-30305 as the operator is aware of the failure and treat a specific rollback event. So I'm wondering why we enforce such a check when performing rollback if we rely on savepoint upgrade mode. Would it be fine to not rely on the HA data and rollback from the last savepoint (the one we used in the deployment step)? was: The operator has well detected that the job was failing and initiate the rollback but this rollback has failed due to `Rollback is not possible due to missing HA metadata` We are relying on saevpoint upgrade mode and zookeeper HA. The operator is performing a set of action to also delete this HA data in savepoint upgrade mode: * [flink-kubernetes-operator/AbstractFlinkService.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346] : Suspend job with savepoint and deleteClusterDeployment * [flink-kubernetes-operator/StandaloneFlinkService.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158] : Remove JM + TM deployment and delete HA data * [flink-kubernetes-operator/AbstractFlinkService.java at main · apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008]
[jira] [Commented] (FLINK-32012) Operator failed to rollback due to missing HA metadata
[ https://issues.apache.org/jira/browse/FLINK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719825#comment-17719825 ] Nicolas Fraison commented on FLINK-32012: - [~gyfora] do you think that we can implement this specific case for rollback or do you foresee some potential issues with this? > Operator failed to rollback due to missing HA metadata > -- > > Key: FLINK-32012 > URL: https://issues.apache.org/jira/browse/FLINK-32012 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.4.0 >Reporter: Nicolas Fraison >Priority: Major > > The operator has well detected that the job was failing and initiate the > rollback but this rollback has failed due to `Rollback is not possible due to > missing HA metadata` > We are relying on saevpoint upgrade mode and zookeeper HA. > The operator is performing a set of action to also delete this HA data in > savepoint upgrade mode: > * [flink-kubernetes-operator/AbstractFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346] > : Suspend job with savepoint and deleteClusterDeployment > * [flink-kubernetes-operator/StandaloneFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158] > : Remove JM + TM deployment and delete HA data > * [flink-kubernetes-operator/AbstractFlinkService.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008] > : Wait cluster shutdown and delete zookeeper HA data > * [flink-kubernetes-operator/FlinkUtils.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155] > : Remove all child znode > Then when running rollback the operator is looking for HA data even if we > rely on sevepoint upgrade mode: > * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164] > Perform reconcile of rollback if it should rollback > * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387] > Rollback failed as HA data is not available > * [flink-kubernetes-operator/FlinkUtils.java at main · > apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L220] > Check if some child znodes are available > For both step the pattern looks to be the same for kubernetes HA so it > doesn't looks to be linked to a bug with zookeeper. > > From https://issues.apache.org/jira/browse/FLINK-30305 it looks to be > expected that the HA data has been deleted (as it is also performed by flink > when relying on savepoint upgrade mode). > Still the use case seems to differ from > https://issues.apache.org/jira/browse/FLINK-30305 as the operator is aware of > the failure and treat a specific rollback event. > So I'm wondering why we enforce such a check when performing rollback if we > rely on savepoint upgrade mode. Would it be fine to not rely on the HA data > and rollback from the last savepoint (the one we used in the deployment step)? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32890) Flink app rolled back with old savepoints (3 hours back in time) while some checkpoints have been taken in between
Nicolas Fraison created FLINK-32890: --- Summary: Flink app rolled back with old savepoints (3 hours back in time) while some checkpoints have been taken in between Key: FLINK-32890 URL: https://issues.apache.org/jira/browse/FLINK-32890 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Nicolas Fraison Here are all details about the issue (based on an app/scenario reproducing the issue): * Deployed new release of a flink app with a new operator * Flink Operator set the app as stable * After some time the app failed and stay in failed state (due to some issue with our kafka clusters) * Sadly the team in charge of this flink app decided to rollback the app as they were thinking it was linked to this new deployment * Operator detect: {{Job is not running but HA metadata is available for last state restore, ready for upgrade, Deleting JobManager deployment while preserving HA metadata.}} -> rely on last-state (as we do not disable fallback), no savepoint taken * Flink start JM and deployment of the app. It well find the 3 checkpoints * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as Zookeeper namespace.}} * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}} * {{Recovering checkpoints from ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}} * {{Found 3 checkpoints in ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}} * {{Restoring job 6b24a364c1905e924a69f3dbff0d26a6 from Checkpoint 19 @ 1692268003920 for 6b24a364c1905e924a69f3dbff0d26a6 located at }}{{{}s3p://.../flink-kafka-job-apache-nico/checkpoints/6b24a364c1905e924a69f3dbff0d26a6/chk-19{}}}{{{}.{}}} * Job failed because of the missing operator Job 6b24a364c1905e924a69f3dbff0d26a6 reached terminal state FAILED. org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster. Caused by: java.util.concurrent.CompletionException: java.lang.IllegalStateException: There is no operator for the state f298e8715b4d85e6f965b60e1c848cbe * {{Job 6b24a364c1905e924a69f3dbff0d26a6 has been registered for cleanup in the JobResultStore after reaching a terminal state.}} * {{Clean up the high availability data for job 6b24a364c1905e924a69f3dbff0d26a6.}} * {{Removed job graph 6b24a364c1905e924a69f3dbff0d26a6 from ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobgraphs'}.}} * JobManager restart and try to resubmit the job but the job was already submitted so finished * {{Received JobGraph submission 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}} * {{Ignoring JobGraph submission 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6) because the job already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous execution.}} * {{Application completed SUCCESSFULLY}} * Finally the operator rollback the deployment and still indicate that {{Job is not running but HA metadata is available for last state restore, ready for upgrade}} * But the job metadata are not anymore there (clean previously) (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints Path /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints doesn't exist (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico jobgraphs jobs leader * The rolled back app from flink operator finally take the last provided savepoint as no metadata/checkpoints are available. But this last savepoint is an old one as during the upgrade the operator decided to rely on last-state -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32890) Flink app rolled back with old savepoints (3 hours back in time) while some checkpoints have been taken in between
[ https://issues.apache.org/jira/browse/FLINK-32890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-32890: Description: Here are all details about the issue (based on an app/scenario reproducing the issue): * Deployed new release of a flink app with a new operator * Flink Operator set the app as stable * After some time the app failed and stay in failed state (due to some issue with our kafka clusters) * Sadly the team in charge of this flink app decided to rollback the app as they were thinking it was linked to this new deployment * Operator detect: {{Job is not running but HA metadata is available for last state restore, ready for upgrade, Deleting JobManager deployment while preserving HA metadata.}} -> rely on last-state (as we do not disable fallback), no savepoint taken * Flink start JM and deployment of the app. It well find the 3 checkpoints * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as Zookeeper namespace.}} * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}} * {{Recovering checkpoints from ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}} * {{Found 3 checkpoints in ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}} * {{{}Restoring job 6b24a364c1905e924a69f3dbff0d26a6 from Checkpoint 19 @ 1692268003920 for 6b24a364c1905e924a69f3dbff0d26a6 located at }}\{{{}s3p://.../flink-kafka-job-apache-nico/checkpoints/6b24a364c1905e924a69f3dbff0d26a6/chk-19{}}}{{{}.{}}} * Job failed because of the missing operator {code:java} Job 6b24a364c1905e924a69f3dbff0d26a6 reached terminal state FAILED. org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster. Caused by: java.util.concurrent.CompletionException: java.lang.IllegalStateException: There is no operator for the state f298e8715b4d85e6f965b60e1c848cbe * Job 6b24a364c1905e924a69f3dbff0d26a6 has been registered for cleanup in the JobResultStore after reaching a terminal state.{code} * {{Clean up the high availability data for job 6b24a364c1905e924a69f3dbff0d26a6.}} * {{Removed job graph 6b24a364c1905e924a69f3dbff0d26a6 from ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobgraphs'}.}} * JobManager restart and try to resubmit the job but the job was already submitted so finished * {{Received JobGraph submission 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}} * {{Ignoring JobGraph submission 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6) because the job already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous execution.}} * {{Application completed SUCCESSFULLY}} * Finally the operator rollback the deployment and still indicate that {{Job is not running but HA metadata is available for last state restore, ready for upgrade}} * But the job metadata are not anymore there (clean previously) {code:java} (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints Path /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints doesn't exist (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico jobgraphs jobs leader {code} The rolled back app from flink operator finally take the last provided savepoint as no metadata/checkpoints are available. But this last savepoint is an old one as during the upgrade the operator decided to rely on last-state was: Here are all details about the issue (based on an app/scenario reproducing the issue): * Deployed new release of a flink app with a new operator * Flink Operator set the app as stable * After some time the app failed and stay in failed state (due to some issue with our kafka clusters) * Sadly the team in charge of this flink app decided to rollback the app as they were thinking it was linked to this new deployment * Operator detect: {{Job is not running but HA metadata is available for last state restore, ready for upgrade, Deleting JobManager deployment while preserving HA metadata.}} -> rely on last-state (as we do not disable fallback), no savepoint taken * Flink start JM and deployment of the app. It well find the 3 checkpoints * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as Zookeeper namespace.}} * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).
[jira] [Updated] (FLINK-32890) Flink app rolled back with old savepoints (3 hours back in time) while some checkpoints have been taken in between
[ https://issues.apache.org/jira/browse/FLINK-32890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-32890: Issue Type: Bug (was: Improvement) > Flink app rolled back with old savepoints (3 hours back in time) while some > checkpoints have been taken in between > -- > > Key: FLINK-32890 > URL: https://issues.apache.org/jira/browse/FLINK-32890 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Nicolas Fraison >Priority: Major > > Here are all details about the issue (based on an app/scenario reproducing > the issue): * Deployed new release of a flink app with a new operator > * Flink Operator set the app as stable > * After some time the app failed and stay in failed state (due to some issue > with our kafka clusters) > * Sadly the team in charge of this flink app decided to rollback the app as > they were thinking it was linked to this new deployment > * Operator detect: {{Job is not running but HA metadata is available for > last state restore, ready for upgrade, Deleting JobManager deployment while > preserving HA metadata.}} -> rely on last-state (as we do not disable > fallback), no savepoint taken > * Flink start JM and deployment of the app. It well find the 3 checkpoints > * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as > Zookeeper namespace.}} > * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}} > * {{Recovering checkpoints from > ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}} > * {{Found 3 checkpoints in > ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}} > * {{{}Restoring job 6b24a364c1905e924a69f3dbff0d26a6 from Checkpoint 19 @ > 1692268003920 for 6b24a364c1905e924a69f3dbff0d26a6 located at > }}\{{{}s3p://.../flink-kafka-job-apache-nico/checkpoints/6b24a364c1905e924a69f3dbff0d26a6/chk-19{}}}{{{}.{}}} > * Job failed because of the missing operator > {code:java} > Job 6b24a364c1905e924a69f3dbff0d26a6 reached terminal state FAILED. > org.apache.flink.runtime.client.JobInitializationException: Could not start > the JobMaster. > Caused by: java.util.concurrent.CompletionException: > java.lang.IllegalStateException: There is no operator for the state > f298e8715b4d85e6f965b60e1c848cbe * Job 6b24a364c1905e924a69f3dbff0d26a6 has > been registered for cleanup in the JobResultStore after reaching a terminal > state.{code} > * {{Clean up the high availability data for job > 6b24a364c1905e924a69f3dbff0d26a6.}} > * {{Removed job graph 6b24a364c1905e924a69f3dbff0d26a6 from > ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobgraphs'}.}} > * JobManager restart and try to resubmit the job but the job was already > submitted so finished > * {{Received JobGraph submission 'flink-kafka-job' > (6b24a364c1905e924a69f3dbff0d26a6).}} > * {{Ignoring JobGraph submission 'flink-kafka-job' > (6b24a364c1905e924a69f3dbff0d26a6) because the job already reached a > globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous > execution.}} > * {{Application completed SUCCESSFULLY}} > * Finally the operator rollback the deployment and still indicate that {{Job > is not running but HA metadata is available for last state restore, ready for > upgrade}} > * But the job metadata are not anymore there (clean previously) > > {code:java} > (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls > /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints > Path > /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints > doesn't exist > (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls > /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs > (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls > /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico > jobgraphs > jobs > leader > {code} > > The rolled back app from flink operator finally take the last provided > savepoint as no metadata/checkpoints are available. But this last savepoint > is an old one as during the upgrade the operator decided to rely on last-state -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32890) Flink app rolled back with old savepoints (3 hours back in time) while some checkpoints have been taken in between
[ https://issues.apache.org/jira/browse/FLINK-32890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-32890: Description: Here are all details about the issue: * Deployed new release of a flink app with a new operator * Flink Operator set the app as stable * After some time the app failed and stay in failed state (due to some issue with our kafka clusters) * Sadly the team in charge of this flink app decided to rollback the app as they were thinking it was linked to this new deployment * Operator detect: {{Job is not running but HA metadata is available for last state restore, ready for upgrade, Deleting JobManager deployment while preserving HA metadata.}} -> rely on last-state (as we do not disable fallback), no savepoint taken * Flink start JM and deployment of the app. It well find the 3 checkpoints * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as Zookeeper namespace.}} * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}} * {{Recovering checkpoints from ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}} * {{Found 3 checkpoints in ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}} * {{{}Restoring job 6b24a364c1905e924a69f3dbff0d26a6 from Checkpoint 19 @ 1692268003920 for 6b24a364c1905e924a69f3dbff0d26a6 located at }}\{{{}s3p://.../flink-kafka-job-apache-nico/checkpoints/6b24a364c1905e924a69f3dbff0d26a6/chk-19{}}}{{{}.{}}} * Job failed because of the missing operator {code:java} Job 6b24a364c1905e924a69f3dbff0d26a6 reached terminal state FAILED. org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster. Caused by: java.util.concurrent.CompletionException: java.lang.IllegalStateException: There is no operator for the state f298e8715b4d85e6f965b60e1c848cbe * Job 6b24a364c1905e924a69f3dbff0d26a6 has been registered for cleanup in the JobResultStore after reaching a terminal state.{code} * {{Clean up the high availability data for job 6b24a364c1905e924a69f3dbff0d26a6.}} * {{Removed job graph 6b24a364c1905e924a69f3dbff0d26a6 from ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobgraphs'}.}} * JobManager restart and try to resubmit the job but the job was already submitted so finished * {{Received JobGraph submission 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}} * {{Ignoring JobGraph submission 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6) because the job already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous execution.}} * {{Application completed SUCCESSFULLY}} * Finally the operator rollback the deployment and still indicate that {{Job is not running but HA metadata is available for last state restore, ready for upgrade}} * But the job metadata are not anymore there (clean previously) {code:java} (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints Path /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints doesn't exist (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico jobgraphs jobs leader {code} The rolled back app from flink operator finally take the last provided savepoint as no metadata/checkpoints are available. But this last savepoint is an old one as during the upgrade the operator decided to rely on last-state was: Here are all details about the issue (based on an app/scenario reproducing the issue): * Deployed new release of a flink app with a new operator * Flink Operator set the app as stable * After some time the app failed and stay in failed state (due to some issue with our kafka clusters) * Sadly the team in charge of this flink app decided to rollback the app as they were thinking it was linked to this new deployment * Operator detect: {{Job is not running but HA metadata is available for last state restore, ready for upgrade, Deleting JobManager deployment while preserving HA metadata.}} -> rely on last-state (as we do not disable fallback), no savepoint taken * Flink start JM and deployment of the app. It well find the 3 checkpoints * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as Zookeeper namespace.}} * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}} * {{Recovering checkpoints from ZooKeeperSta
[jira] [Updated] (FLINK-32890) Flink app rolled back with old savepoints (3 hours back in time) while some checkpoints have been taken in between
[ https://issues.apache.org/jira/browse/FLINK-32890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated FLINK-32890: Description: Here are all details about the issue: * Deployed new release of a flink app with a new operator * Flink Operator set the app as stable * After some time the app failed and stay in failed state (due to some issue with our kafka clusters) * Finally decided to rollback the new release just in case it could be the root cause of the issue on kafka * Operator detect: {{Job is not running but HA metadata is available for last state restore, ready for upgrade, Deleting JobManager deployment while preserving HA metadata.}} -> rely on last-state (as we do not disable fallback), no savepoint taken * Flink start JM and deployment of the app. It well find the 3 checkpoints * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as Zookeeper namespace.}} * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}} * {{Recovering checkpoints from ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}} * {{Found 3 checkpoints in ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}} * {{{}Restoring job 6b24a364c1905e924a69f3dbff0d26a6 from Checkpoint 19 @ 1692268003920 for 6b24a364c1905e924a69f3dbff0d26a6 located at }}\{{{}s3p://.../flink-kafka-job-apache-nico/checkpoints/6b24a364c1905e924a69f3dbff0d26a6/chk-19{}}}{{{}.{}}} * Job failed because of the missing operator {code:java} Job 6b24a364c1905e924a69f3dbff0d26a6 reached terminal state FAILED. org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster. Caused by: java.util.concurrent.CompletionException: java.lang.IllegalStateException: There is no operator for the state f298e8715b4d85e6f965b60e1c848cbe * Job 6b24a364c1905e924a69f3dbff0d26a6 has been registered for cleanup in the JobResultStore after reaching a terminal state.{code} * {{Clean up the high availability data for job 6b24a364c1905e924a69f3dbff0d26a6.}} * {{Removed job graph 6b24a364c1905e924a69f3dbff0d26a6 from ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobgraphs'}.}} * JobManager restart and try to resubmit the job but the job was already submitted so finished * {{Received JobGraph submission 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}} * {{Ignoring JobGraph submission 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6) because the job already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous execution.}} * {{Application completed SUCCESSFULLY}} * Finally the operator rollback the deployment and still indicate that {{Job is not running but HA metadata is available for last state restore, ready for upgrade}} * But the job metadata are not anymore there (clean previously) {code:java} (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints Path /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints doesn't exist (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico jobgraphs jobs leader {code} The rolled back app from flink operator finally take the last provided savepoint as no metadata/checkpoints are available. But this last savepoint is an old one as during the upgrade the operator decided to rely on last-state (The old savepoint taken is a scheduled one) was: Here are all details about the issue: * Deployed new release of a flink app with a new operator * Flink Operator set the app as stable * After some time the app failed and stay in failed state (due to some issue with our kafka clusters) * Sadly the team in charge of this flink app decided to rollback the app as they were thinking it was linked to this new deployment * Operator detect: {{Job is not running but HA metadata is available for last state restore, ready for upgrade, Deleting JobManager deployment while preserving HA metadata.}} -> rely on last-state (as we do not disable fallback), no savepoint taken * Flink start JM and deployment of the app. It well find the 3 checkpoints * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as Zookeeper namespace.}} * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}} * {{Recovering checkpoints from ZooKeeperStateHandleStore\{namespace='f
[jira] [Commented] (FLINK-32890) Flink app rolled back with old savepoints (3 hours back in time) while some checkpoints have been taken in between
[ https://issues.apache.org/jira/browse/FLINK-32890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755866#comment-17755866 ] Nicolas Fraison commented on FLINK-32890: - The HA metadata check for zookeeper is not fine. It should have check for availability of entries in the /jobs znode. With such a test the rollback would have not being done requiring manual intervention to ensure last taken checkpoint is used to restore the flink app > Flink app rolled back with old savepoints (3 hours back in time) while some > checkpoints have been taken in between > -- > > Key: FLINK-32890 > URL: https://issues.apache.org/jira/browse/FLINK-32890 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Nicolas Fraison >Priority: Major > > Here are all details about the issue: > * Deployed new release of a flink app with a new operator > * Flink Operator set the app as stable > * After some time the app failed and stay in failed state (due to some issue > with our kafka clusters) > * Finally decided to rollback the new release just in case it could be the > root cause of the issue on kafka > * Operator detect: {{Job is not running but HA metadata is available for > last state restore, ready for upgrade, Deleting JobManager deployment while > preserving HA metadata.}} -> rely on last-state (as we do not disable > fallback), no savepoint taken > * Flink start JM and deployment of the app. It well find the 3 checkpoints > * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as > Zookeeper namespace.}} > * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}} > * {{Recovering checkpoints from > ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}} > * {{Found 3 checkpoints in > ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}} > * {{{}Restoring job 6b24a364c1905e924a69f3dbff0d26a6 from Checkpoint 19 @ > 1692268003920 for 6b24a364c1905e924a69f3dbff0d26a6 located at > }}\{{{}s3p://.../flink-kafka-job-apache-nico/checkpoints/6b24a364c1905e924a69f3dbff0d26a6/chk-19{}}}{{{}.{}}} > * Job failed because of the missing operator > {code:java} > Job 6b24a364c1905e924a69f3dbff0d26a6 reached terminal state FAILED. > org.apache.flink.runtime.client.JobInitializationException: Could not start > the JobMaster. > Caused by: java.util.concurrent.CompletionException: > java.lang.IllegalStateException: There is no operator for the state > f298e8715b4d85e6f965b60e1c848cbe * Job 6b24a364c1905e924a69f3dbff0d26a6 has > been registered for cleanup in the JobResultStore after reaching a terminal > state.{code} > * {{Clean up the high availability data for job > 6b24a364c1905e924a69f3dbff0d26a6.}} > * {{Removed job graph 6b24a364c1905e924a69f3dbff0d26a6 from > ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobgraphs'}.}} > * JobManager restart and try to resubmit the job but the job was already > submitted so finished > * {{Received JobGraph submission 'flink-kafka-job' > (6b24a364c1905e924a69f3dbff0d26a6).}} > * {{Ignoring JobGraph submission 'flink-kafka-job' > (6b24a364c1905e924a69f3dbff0d26a6) because the job already reached a > globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous > execution.}} > * {{Application completed SUCCESSFULLY}} > * Finally the operator rollback the deployment and still indicate that {{Job > is not running but HA metadata is available for last state restore, ready for > upgrade}} > * But the job metadata are not anymore there (clean previously) > > {code:java} > (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls > /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints > Path > /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints > doesn't exist > (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls > /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs > (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls > /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico > jobgraphs > jobs > leader > {code} > > The rolled back app from flink operator finally take the last provided > savepoint as no metadata/checkpoints are available. But this last savepoint > is an old one as during the upgrade the operator decided to rely on > last-state (The old savepoint taken is a scheduled one)
[jira] [Created] (FLINK-32973) Documentation for kubernetes.operator.job.savepoint-on-deletion indicate that it can be set on per resource level while it can't
Nicolas Fraison created FLINK-32973: --- Summary: Documentation for kubernetes.operator.job.savepoint-on-deletion indicate that it can be set on per resource level while it can't Key: FLINK-32973 URL: https://issues.apache.org/jira/browse/FLINK-32973 Project: Flink Issue Type: Bug Components: Kubernetes Operator Reporter: Nicolas Fraison Documentation for kubernetes.operator.job.savepoint-on-deletion indicate that it can be set on [per resource level|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#resourceuser-configuration] while it can't. The config is only retrieved from operator configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32973) Documentation for kubernetes.operator.job.savepoint-on-deletion indicate that it can be set on per resource level while it can't
[ https://issues.apache.org/jira/browse/FLINK-32973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759506#comment-17759506 ] Nicolas Fraison commented on FLINK-32973: - I think that we should support the per-resource level configuration and not only update the documentation. Will provide a patch for this > Documentation for kubernetes.operator.job.savepoint-on-deletion indicate that > it can be set on per resource level while it can't > > > Key: FLINK-32973 > URL: https://issues.apache.org/jira/browse/FLINK-32973 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Nicolas Fraison >Priority: Minor > > Documentation for kubernetes.operator.job.savepoint-on-deletion indicate that > it can be set on [per resource > level|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#resourceuser-configuration] > while it can't. > The config is only retrieved from operator configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29199) Support blue-green deployment type
[ https://issues.apache.org/jira/browse/FLINK-29199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759856#comment-17759856 ] Nicolas Fraison commented on FLINK-29199: - We are looking in supporting this deployment for our Flink users: * Is there any plan to support Blue Green deployment within the flink operator? * Are some of you already performing such deployment relying on an external custom solution? If yes I would be happy to discuss it if possible > Support blue-green deployment type > -- > > Key: FLINK-29199 > URL: https://issues.apache.org/jira/browse/FLINK-29199 > Project: Flink > Issue Type: New Feature > Components: Kubernetes Operator > Environment: Kubernetes >Reporter: Oleg Vorobev >Priority: Minor > > Are there any plans to support blue-green deployment/rollout mode similar to > *BlueGreen* in the > [flinkk8soperator|https://github.com/lyft/flinkk8soperator] to avoid downtime > while updating? > The idea is to run a new version in parallel with an old one and remove the > old one only after the stability condition of the new one is satisfied (like > in > [rollbacks|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/docs/custom-resource/job-management/#application-upgrade-rollbacks-experimental]). > For stateful apps with {*}upgradeMode: savepoint{*}, this means: not > cancelling an old job after creating a savepoint -> starting new job from > that savepoint -> waiting for it to become running/one successful > checkpoint/timeout or something else -> cancelling and removing old job. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29199) Support blue-green deployment type
[ https://issues.apache.org/jira/browse/FLINK-29199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760361#comment-17760361 ] Nicolas Fraison commented on FLINK-29199: - This is indeed a concern we have as we rely on kafka data sources. Currently we are thinking on relying on 2 different kafka consumer group (one blue and one green) for those 2 jobs. I need to validate that resuming the green deployment from blue checkpoint with a different consumer group can be done. For the double data output we can tolerate some duplicate for those specific jobs > Support blue-green deployment type > -- > > Key: FLINK-29199 > URL: https://issues.apache.org/jira/browse/FLINK-29199 > Project: Flink > Issue Type: New Feature > Components: Kubernetes Operator > Environment: Kubernetes >Reporter: Oleg Vorobev >Priority: Minor > > Are there any plans to support blue-green deployment/rollout mode similar to > *BlueGreen* in the > [flinkk8soperator|https://github.com/lyft/flinkk8soperator] to avoid downtime > while updating? > The idea is to run a new version in parallel with an old one and remove the > old one only after the stability condition of the new one is satisfied (like > in > [rollbacks|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/docs/custom-resource/job-management/#application-upgrade-rollbacks-experimental]). > For stateful apps with {*}upgradeMode: savepoint{*}, this means: not > cancelling an old job after creating a savepoint -> starting new job from > that savepoint -> waiting for it to become running/one successful > checkpoint/timeout or something else -> cancelling and removing old job. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-12869) Add yarn acls capability to flink containers
[ https://issues.apache.org/jira/browse/FLINK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876774#comment-16876774 ] Nicolas Fraison commented on FLINK-12869: - [~azagrebin] thanks for the feedback. So my proposal is to add 2 new parameter (yarn.view.acls and yarn.admin.acls) and to setApplicationACLs based on those 2 parameters for all spawned yarn containers. For this I will add a method setAclsFor in flink-yarn Utils class. > Add yarn acls capability to flink containers > > > Key: FLINK-12869 > URL: https://issues.apache.org/jira/browse/FLINK-12869 > Project: Flink > Issue Type: New Feature > Components: Deployment / YARN >Reporter: Nicolas Fraison >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Yarn provide application acls mechanism to be able to provide specific rights > to other users than the one running the job (view logs through the > resourcemanager/job history, kill the application) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-37504) Handle TLS Certificate Renewal
[ https://issues.apache.org/jira/browse/FLINK-37504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937301#comment-17937301 ] Nicolas Fraison commented on FLINK-37504: - Hi, thanks for the feedback. I will see to create a FLIP for this. For the implementation I've seen that the Apache Flink Kubernetes Operator was already managing certificate reload so I've just based the implementation from the one from that project: [https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-webhook/src/main/java/org/apache/flink/kubernetes/operator/admission/FlinkOperatorWebhook.java#L154] The requirement for the certificate reload is not linked to the kubernetes usage but linked to a zero trust network policy applied in my company. The created certificate have a one day validity and are renewed every 12 hours which is why we would need this feature to avoid restarting the job every day. > Handle TLS Certificate Renewal > -- > > Key: FLINK-37504 > URL: https://issues.apache.org/jira/browse/FLINK-37504 > Project: Flink > Issue Type: Improvement >Reporter: Nicolas Fraison >Priority: Minor > Labels: pull-request-available > > Flink does not reload certificate if underlying truststore and keytstore are > updated. > We aim at using 1 day validity certificate which currently means having to > restart our jobs every day. > In order to avoid this we will need to add a feature to be able to reload TLS > certificate when underlying truststore and keytstore are updated -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-37504) Handle TLS Certificate Renewal
Nicolas Fraison created FLINK-37504: --- Summary: Handle TLS Certificate Renewal Key: FLINK-37504 URL: https://issues.apache.org/jira/browse/FLINK-37504 Project: Flink Issue Type: Improvement Reporter: Nicolas Fraison Flink does not reload certificate if underlying truststore and keytstore are updated. We aim at using 1 day validity certificate which currently means having to restart our jobs every day. In order to avoid this we will need to add a feature to be able to reload TLS certificate when underlying truststore and keytstore are updated -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-37504) Handle TLS Certificate Renewal
[ https://issues.apache.org/jira/browse/FLINK-37504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945662#comment-17945662 ] Nicolas Fraison commented on FLINK-37504: - Created this FLIP: https://cwiki.apache.org/confluence/display/FLINK/FLIP-523%3A+Handle+TLS+Certificate+Renewal > Handle TLS Certificate Renewal > -- > > Key: FLINK-37504 > URL: https://issues.apache.org/jira/browse/FLINK-37504 > Project: Flink > Issue Type: Improvement >Reporter: Nicolas Fraison >Priority: Minor > Labels: pull-request-available > > Flink does not reload certificate if underlying truststore and keytstore are > updated. > We aim at using 1 day validity certificate which currently means having to > restart our jobs every day. > In order to avoid this we will need to add a feature to be able to reload TLS > certificate when underlying truststore and keytstore are updated -- This message was sent by Atlassian Jira (v8.20.10#820010)