[jira] [Created] (FLINK-12869) Add yarn acls capability to flink containers

2019-06-17 Thread Nicolas Fraison (JIRA)
Nicolas Fraison created FLINK-12869:
---

 Summary: Add yarn acls capability to flink containers
 Key: FLINK-12869
 URL: https://issues.apache.org/jira/browse/FLINK-12869
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / YARN
Reporter: Nicolas Fraison


Yarn provide application acls mechanism to be able to provide specific rights 
to other users than the one running the job (view logs through the 
resourcemanager/job history, kill the application)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-35489) Metaspace size can be too little after autotuning change memory setting

2024-06-10 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853647#comment-17853647
 ] 

Nicolas Fraison commented on FLINK-35489:
-

Thks [~mxm] for the feedback. I've updated the 
[PR|https://github.com/apache/flink-kubernetes-operator/pull/833/files] to not 
limit the metaspace size. 

> Metaspace size can be too little after autotuning change memory setting
> ---
>
> Key: FLINK-35489
> URL: https://issues.apache.org/jira/browse/FLINK-35489
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: 1.8.0
>Reporter: Nicolas Fraison
>Priority: Major
>  Labels: pull-request-available
>
> We have enable the autotuning feature on one of our flink job with below 
> config
> {code:java}
> # Autoscaler configuration
> job.autoscaler.enabled: "true"
> job.autoscaler.stabilization.interval: 1m
> job.autoscaler.metrics.window: 10m
> job.autoscaler.target.utilization: "0.8"
> job.autoscaler.target.utilization.boundary: "0.1"
> job.autoscaler.restart.time: 2m
> job.autoscaler.catch-up.duration: 10m
> job.autoscaler.memory.tuning.enabled: true
> job.autoscaler.memory.tuning.overhead: 0.5
> job.autoscaler.memory.tuning.maximize-managed-memory: true{code}
> During a scale down the autotuning decided to give all the memory to to JVM 
> (having heap being scale by 2) settting taskmanager.memory.managed.size to 0b.
> Here is the config that was compute by the autotuning for a TM running on a 
> 4GB pod:
> {code:java}
> taskmanager.memory.network.max: 4063232b
> taskmanager.memory.network.min: 4063232b
> taskmanager.memory.jvm-overhead.max: 433791712b
> taskmanager.memory.task.heap.size: 3699934605b
> taskmanager.memory.framework.off-heap.size: 134217728b
> taskmanager.memory.jvm-metaspace.size: 22960020b
> taskmanager.memory.framework.heap.size: "0 bytes"
> taskmanager.memory.flink.size: 3838215565b
> taskmanager.memory.managed.size: 0b {code}
> This has lead to some issue starting the TM because we are relying on some 
> javaagent performing some memory allocation outside of the JVM (rely on some 
> C bindings).
> Tuning the overhead or disabling the scale-down-compensation.enabled could 
> have helped for that particular event but this can leads to other issue as it 
> could leads to too little HEAP size being computed.
> It would be interesting to be able to set a min memory.managed.size to be 
> taken in account by the autotuning.
> What do you think about this? Do you think that some other specific config 
> should have been applied to avoid this issue?
>  
> Edit see this comment that leads to the metaspace issue: 
> https://issues.apache.org/jira/browse/FLINK-35489?focusedCommentId=17850694&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17850694



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34131) Checkpoint check window should take in account checkpoint job configuration

2024-01-17 Thread Nicolas Fraison (Jira)
Nicolas Fraison created FLINK-34131:
---

 Summary: Checkpoint check window should take in account checkpoint 
job configuration
 Key: FLINK-34131
 URL: https://issues.apache.org/jira/browse/FLINK-34131
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Nicolas Fraison


When enabling checkpoint progress check 
(kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to 
define cluster health the operator rely detect if a checkpoint has been 
performed during the 
kubernetes.operator.cluster.health-check.checkpoint-progress.window

As indicated in the doc it must be bigger to checkpointing interval.

But this is a manual configuration which can leads to misconfiguration and 
unwanted restart of the flink cluster if the checkpointing interval is bigger 
than the window one.

The operator must check that the config is healthy before to rely on this 
check. If it is not well set it should not execute the check (return true on 
[evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50])
 and log a WARN message.

Also flink jobs have other checkpointing parameters that should be taken in 
account for this window configuration which are execution.checkpointing.timeout 
and execution.checkpointing.tolerable-failed-checkpoints

The idea would be to check that 
kubernetes.operator.cluster.health-check.checkpoint-progress.window is at >= to 
(execution.checkpointing.interval + execution.checkpointing.timeout) * 
execution.checkpointing.tolerable-failed-checkpoints



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34131) Checkpoint check window should take in account checkpoint job configuration

2024-01-18 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-34131:

Priority: Minor  (was: Major)

> Checkpoint check window should take in account checkpoint job configuration
> ---
>
> Key: FLINK-34131
> URL: https://issues.apache.org/jira/browse/FLINK-34131
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Nicolas Fraison
>Priority: Minor
>
> When enabling checkpoint progress check 
> (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to 
> define cluster health the operator rely detect if a checkpoint has been 
> performed during the 
> kubernetes.operator.cluster.health-check.checkpoint-progress.window
> As indicated in the doc it must be bigger to checkpointing interval.
> But this is a manual configuration which can leads to misconfiguration and 
> unwanted restart of the flink cluster if the checkpointing interval is bigger 
> than the window one.
> The operator must check that the config is healthy before to rely on this 
> check. If it is not well set it should not execute the check (return true on 
> [evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50])
>  and log a WARN message.
> Also flink jobs have other checkpointing parameters that should be taken in 
> account for this window configuration which are 
> execution.checkpointing.timeout and 
> execution.checkpointing.tolerable-failed-checkpoints
> The idea would be to check that 
> kubernetes.operator.cluster.health-check.checkpoint-progress.window is at >= 
> to (execution.checkpointing.interval + execution.checkpointing.timeout) * 
> execution.checkpointing.tolerable-failed-checkpoints



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34131) Checkpoint check window should take in account checkpoint job configuration

2024-01-18 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-34131:

Description: 
When enabling checkpoint progress check 
(kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to 
define cluster health the operator rely detect if a checkpoint has been 
performed during the 
kubernetes.operator.cluster.health-check.checkpoint-progress.window

As indicated in the doc it must be bigger to checkpointing interval.

But this is a manual configuration which can leads to misconfiguration and 
unwanted restart of the flink cluster if the checkpointing interval is bigger 
than the window one.

The operator must check that the config is healthy before to rely on this 
check. If it is not well set it should not execute the check (return true on 
[evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50])
 and log a WARN message.

Also flink jobs have other checkpointing parameters that should be taken in 
account for this window configuration which are execution.checkpointing.timeout 
and execution.checkpointing.tolerable-failed-checkpoints

The idea would be to check that 
kubernetes.operator.cluster.health-check.checkpoint-progress.window >= 
max(execution.checkpointing.interval, execution.checkpointing.timeout * 
execution.checkpointing.tolerable-failed-checkpoints)

  was:
When enabling checkpoint progress check 
(kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to 
define cluster health the operator rely detect if a checkpoint has been 
performed during the 
kubernetes.operator.cluster.health-check.checkpoint-progress.window

As indicated in the doc it must be bigger to checkpointing interval.

But this is a manual configuration which can leads to misconfiguration and 
unwanted restart of the flink cluster if the checkpointing interval is bigger 
than the window one.

The operator must check that the config is healthy before to rely on this 
check. If it is not well set it should not execute the check (return true on 
[evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50])
 and log a WARN message.

Also flink jobs have other checkpointing parameters that should be taken in 
account for this window configuration which are execution.checkpointing.timeout 
and execution.checkpointing.tolerable-failed-checkpoints

The idea would be to check that 
kubernetes.operator.cluster.health-check.checkpoint-progress.window is at >= to 
(execution.checkpointing.interval + execution.checkpointing.timeout) * 
execution.checkpointing.tolerable-failed-checkpoints


> Checkpoint check window should take in account checkpoint job configuration
> ---
>
> Key: FLINK-34131
> URL: https://issues.apache.org/jira/browse/FLINK-34131
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Nicolas Fraison
>Priority: Minor
>
> When enabling checkpoint progress check 
> (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to 
> define cluster health the operator rely detect if a checkpoint has been 
> performed during the 
> kubernetes.operator.cluster.health-check.checkpoint-progress.window
> As indicated in the doc it must be bigger to checkpointing interval.
> But this is a manual configuration which can leads to misconfiguration and 
> unwanted restart of the flink cluster if the checkpointing interval is bigger 
> than the window one.
> The operator must check that the config is healthy before to rely on this 
> check. If it is not well set it should not execute the check (return true on 
> [evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50])
>  and log a WARN message.
> Also flink jobs have other checkpointing parameters that should be taken in 
> account for this window configuration which are 
> execution.checkpointing.timeout and 
> execution.checkpointing.tolerable-failed-checkpoints
> The idea would be to check that 
> kubernetes.operator.cluster.health-check.checkpoint-progress.window >= 
> max(execution.checkpointing.interval, execution.checkpointing.timeout * 
> execution.checkpointing.tolerable-failed-checkpoints)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34131) Checkpoint check window should take in account checkpoint job configuration

2024-01-18 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-34131:

Description: 
When enabling checkpoint progress check 
(kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to 
define cluster health the operator rely detect if a checkpoint has been 
performed during the 
kubernetes.operator.cluster.health-check.checkpoint-progress.window

As indicated in the doc it must be bigger to checkpointing interval.

But this is a manual configuration which can leads to misconfiguration and 
unwanted restart of the flink cluster if the checkpointing interval is bigger 
than the window one.

The operator must check that the config is healthy before to rely on this 
check. If it is not well set it should not execute the check (return true on 
[evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50])
 and log a WARN message.

Also flink jobs have other checkpointing parameters that should be taken in 
account for this window configuration which are execution.checkpointing.timeout 
and execution.checkpointing.tolerable-failed-checkpoints

The idea would be to check that 
kubernetes.operator.cluster.health-check.checkpoint-progress.window >= 
max(execution.checkpointing.interval * 
execution.checkpointing.tolerable-failed-checkpoints, 
execution.checkpointing.timeout * 
execution.checkpointing.tolerable-failed-checkpoints)

  was:
When enabling checkpoint progress check 
(kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to 
define cluster health the operator rely detect if a checkpoint has been 
performed during the 
kubernetes.operator.cluster.health-check.checkpoint-progress.window

As indicated in the doc it must be bigger to checkpointing interval.

But this is a manual configuration which can leads to misconfiguration and 
unwanted restart of the flink cluster if the checkpointing interval is bigger 
than the window one.

The operator must check that the config is healthy before to rely on this 
check. If it is not well set it should not execute the check (return true on 
[evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50])
 and log a WARN message.

Also flink jobs have other checkpointing parameters that should be taken in 
account for this window configuration which are execution.checkpointing.timeout 
and execution.checkpointing.tolerable-failed-checkpoints

The idea would be to check that 
kubernetes.operator.cluster.health-check.checkpoint-progress.window >= 
max(execution.checkpointing.interval, execution.checkpointing.timeout * 
execution.checkpointing.tolerable-failed-checkpoints)


> Checkpoint check window should take in account checkpoint job configuration
> ---
>
> Key: FLINK-34131
> URL: https://issues.apache.org/jira/browse/FLINK-34131
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Nicolas Fraison
>Priority: Minor
>
> When enabling checkpoint progress check 
> (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to 
> define cluster health the operator rely detect if a checkpoint has been 
> performed during the 
> kubernetes.operator.cluster.health-check.checkpoint-progress.window
> As indicated in the doc it must be bigger to checkpointing interval.
> But this is a manual configuration which can leads to misconfiguration and 
> unwanted restart of the flink cluster if the checkpointing interval is bigger 
> than the window one.
> The operator must check that the config is healthy before to rely on this 
> check. If it is not well set it should not execute the check (return true on 
> [evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50])
>  and log a WARN message.
> Also flink jobs have other checkpointing parameters that should be taken in 
> account for this window configuration which are 
> execution.checkpointing.timeout and 
> execution.checkpointing.tolerable-failed-checkpoints
> The idea would be to check that 
> kubernetes.operator.cluster.health-check.checkpoint-progress.window >= 
> max(execution.checkpointing.interval * 
> execution.checkpointing.tolerable-failed-checkpoints, 
> execution.checkpointing.timeout * 
> execution.checkpointing.tolerable-failed-checkpoints)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-33797) Flink application FAILED log in info instead of warn

2023-12-11 Thread Nicolas Fraison (Jira)
Nicolas Fraison created FLINK-33797:
---

 Summary: Flink application FAILED log in info instead of warn
 Key: FLINK-33797
 URL: https://issues.apache.org/jira/browse/FLINK-33797
 Project: Flink
  Issue Type: Bug
Reporter: Nicolas Fraison


When the flink application end in cancelled or failed state the log is reported 
with INFO log level while it should be warn:  
https://github.com/apache/flink/blob/548e4b5188bb3f092206182d779a909756408660/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L174



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35489) Add capability to set min taskmanager.memory.managed.size when enabling autotuning

2024-05-30 Thread Nicolas Fraison (Jira)
Nicolas Fraison created FLINK-35489:
---

 Summary: Add capability to set min taskmanager.memory.managed.size 
when enabling autotuning
 Key: FLINK-35489
 URL: https://issues.apache.org/jira/browse/FLINK-35489
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Affects Versions: 1.8.0
Reporter: Nicolas Fraison


We have enable the autotuning feature on one of our flink job with below config
{code:java}
# Autoscaler configuration
job.autoscaler.enabled: "true"
job.autoscaler.stabilization.interval: 1m
job.autoscaler.metrics.window: 10m
job.autoscaler.target.utilization: "0.8"
job.autoscaler.target.utilization.boundary: "0.1"
job.autoscaler.restart.time: 2m
job.autoscaler.catch-up.duration: 10m
job.autoscaler.memory.tuning.enabled: true
job.autoscaler.memory.tuning.overhead: 0.5
job.autoscaler.memory.tuning.maximize-managed-memory: true{code}
During a scale down the autotuning decided to give all the memory to to JVM 
(having heap being scale by 2) settting taskmanager.memory.managed.size to 0b.
Here is the config that was compute by the autotuning for a TM running on a 4GB 
pod:
{code:java}
taskmanager.memory.network.max: 4063232b
taskmanager.memory.network.min: 4063232b
taskmanager.memory.jvm-overhead.max: 433791712b
taskmanager.memory.task.heap.size: 3699934605b
taskmanager.memory.framework.off-heap.size: 134217728b
taskmanager.memory.jvm-metaspace.size: 22960020b
taskmanager.memory.framework.heap.size: "0 bytes"
taskmanager.memory.flink.size: 3838215565b
taskmanager.memory.managed.size: 0b {code}
This has lead to some issue starting the TM because we are relying on some 
javaagent performing some memory allocation outside of the JVM (rely on some C 
bindings).

Tuning the overhead or disabling the scale-down-compensation.enabled could have 
helped for that particular event but this can leads to other issue as it could 
leads to too little HEAP size being computed.

It would be interesting to be able to set a min memory.managed.size to be taken 
in account by the autotuning.
What do you think about this? Do you think that some other specific config 
should have been applied to avoid this issue?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35489) Add capability to set min taskmanager.memory.managed.size when enabling autotuning

2024-05-30 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850694#comment-17850694
 ] 

Nicolas Fraison commented on FLINK-35489:
-

Thks [~fanrui] for the feedback it help me realise that my analysis was wrong.

The issue we are facing ifs the JVM crashing after the 
[autotuning|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autotuning/]
 change some memory config:

 
{code:java}
Starting kubernetes-taskmanager as a console application on host 
flink-kafka-job-apache-right-taskmanager-1-1.
Exception in thread "main" *** java.lang.instrument ASSERTION FAILED ***: 
"result" with message agent load/premain call failed at 
src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 422
FATAL ERROR in native method: processing of -javaagent failed, processJavaStart 
failed
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, 
Vv=VM code, C=native code)
V  [libjvm.so+0x78dee4]  jni_FatalError+0x70
V  [libjvm.so+0x88df00]  JvmtiExport::post_vm_initialized()+0x240
V  [libjvm.so+0xc353fc]  Threads::create_vm(JavaVMInitArgs*, bool*)+0x7ac
V  [libjvm.so+0x79c05c]  JNI_CreateJavaVM+0x7c
C  [libjli.so+0x3b2c]  JavaMain+0x7c
C  [libjli.so+0x7fdc]  ThreadJavaMain+0xc
C  [libpthread.so.0+0x7624]  start_thread+0x184 {code}
Seeing this big increase of HEAP (from 1.5 to more than 3GB and the fact that 
the memory.managed.size was shrink to 0b make me thing that it was linked to 
missing off heap.

But you are right that jvm-overhead already reserved some memory for the off 
heap (and we indeed have around 400 MB with that config)

So looking back to the new config I've identified the issue which is on the 
jvm-metaspace having been shrink to 22MB while it was set at 256MB.
I've done a test increasing this parameter and the TM is now able to start.

For the meta space computation size I can see the autotuning computing 
METASPACE_MEMORY_USED=1.41521584E8 which seems to be appropriate metaspace 
sizing.

But due to the the memBudget management it ends up setting only 22MB to the 
metaspace ([first allocate remaining memory to the heap and then this new 
remaining to metaspace and finally to managed 
memory|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/tuning/MemoryTuning.java#L130])

 

> Add capability to set min taskmanager.memory.managed.size when enabling 
> autotuning
> --
>
> Key: FLINK-35489
> URL: https://issues.apache.org/jira/browse/FLINK-35489
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: 1.8.0
>Reporter: Nicolas Fraison
>Priority: Major
>
> We have enable the autotuning feature on one of our flink job with below 
> config
> {code:java}
> # Autoscaler configuration
> job.autoscaler.enabled: "true"
> job.autoscaler.stabilization.interval: 1m
> job.autoscaler.metrics.window: 10m
> job.autoscaler.target.utilization: "0.8"
> job.autoscaler.target.utilization.boundary: "0.1"
> job.autoscaler.restart.time: 2m
> job.autoscaler.catch-up.duration: 10m
> job.autoscaler.memory.tuning.enabled: true
> job.autoscaler.memory.tuning.overhead: 0.5
> job.autoscaler.memory.tuning.maximize-managed-memory: true{code}
> During a scale down the autotuning decided to give all the memory to to JVM 
> (having heap being scale by 2) settting taskmanager.memory.managed.size to 0b.
> Here is the config that was compute by the autotuning for a TM running on a 
> 4GB pod:
> {code:java}
> taskmanager.memory.network.max: 4063232b
> taskmanager.memory.network.min: 4063232b
> taskmanager.memory.jvm-overhead.max: 433791712b
> taskmanager.memory.task.heap.size: 3699934605b
> taskmanager.memory.framework.off-heap.size: 134217728b
> taskmanager.memory.jvm-metaspace.size: 22960020b
> taskmanager.memory.framework.heap.size: "0 bytes"
> taskmanager.memory.flink.size: 3838215565b
> taskmanager.memory.managed.size: 0b {code}
> This has lead to some issue starting the TM because we are relying on some 
> javaagent performing some memory allocation outside of the JVM (rely on some 
> C bindings).
> Tuning the overhead or disabling the scale-down-compensation.enabled could 
> have helped for that particular event but this can leads to other issue as it 
> could leads to too little HEAP size being computed.
> It would be interesting to be able to set a min memory.managed.size to be 
> taken in account by the autotuning.
> What do you think about this? Do you think that some other specific config 
> should have been applied to avoid this issue?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35489) Metaspace size can be too little after autotuning change memory setting

2024-05-30 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-35489:

Summary: Metaspace size can be too little after autotuning change memory 
setting  (was: Add capability to set min taskmanager.memory.managed.size when 
enabling autotuning)

> Metaspace size can be too little after autotuning change memory setting
> ---
>
> Key: FLINK-35489
> URL: https://issues.apache.org/jira/browse/FLINK-35489
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: 1.8.0
>Reporter: Nicolas Fraison
>Priority: Major
>
> We have enable the autotuning feature on one of our flink job with below 
> config
> {code:java}
> # Autoscaler configuration
> job.autoscaler.enabled: "true"
> job.autoscaler.stabilization.interval: 1m
> job.autoscaler.metrics.window: 10m
> job.autoscaler.target.utilization: "0.8"
> job.autoscaler.target.utilization.boundary: "0.1"
> job.autoscaler.restart.time: 2m
> job.autoscaler.catch-up.duration: 10m
> job.autoscaler.memory.tuning.enabled: true
> job.autoscaler.memory.tuning.overhead: 0.5
> job.autoscaler.memory.tuning.maximize-managed-memory: true{code}
> During a scale down the autotuning decided to give all the memory to to JVM 
> (having heap being scale by 2) settting taskmanager.memory.managed.size to 0b.
> Here is the config that was compute by the autotuning for a TM running on a 
> 4GB pod:
> {code:java}
> taskmanager.memory.network.max: 4063232b
> taskmanager.memory.network.min: 4063232b
> taskmanager.memory.jvm-overhead.max: 433791712b
> taskmanager.memory.task.heap.size: 3699934605b
> taskmanager.memory.framework.off-heap.size: 134217728b
> taskmanager.memory.jvm-metaspace.size: 22960020b
> taskmanager.memory.framework.heap.size: "0 bytes"
> taskmanager.memory.flink.size: 3838215565b
> taskmanager.memory.managed.size: 0b {code}
> This has lead to some issue starting the TM because we are relying on some 
> javaagent performing some memory allocation outside of the JVM (rely on some 
> C bindings).
> Tuning the overhead or disabling the scale-down-compensation.enabled could 
> have helped for that particular event but this can leads to other issue as it 
> could leads to too little HEAP size being computed.
> It would be interesting to be able to set a min memory.managed.size to be 
> taken in account by the autotuning.
> What do you think about this? Do you think that some other specific config 
> should have been applied to avoid this issue?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35489) Metaspace size can be too little after autotuning change memory setting

2024-05-30 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-35489:

Description: 
We have enable the autotuning feature on one of our flink job with below config
{code:java}
# Autoscaler configuration
job.autoscaler.enabled: "true"
job.autoscaler.stabilization.interval: 1m
job.autoscaler.metrics.window: 10m
job.autoscaler.target.utilization: "0.8"
job.autoscaler.target.utilization.boundary: "0.1"
job.autoscaler.restart.time: 2m
job.autoscaler.catch-up.duration: 10m
job.autoscaler.memory.tuning.enabled: true
job.autoscaler.memory.tuning.overhead: 0.5
job.autoscaler.memory.tuning.maximize-managed-memory: true{code}
During a scale down the autotuning decided to give all the memory to to JVM 
(having heap being scale by 2) settting taskmanager.memory.managed.size to 0b.
Here is the config that was compute by the autotuning for a TM running on a 4GB 
pod:
{code:java}
taskmanager.memory.network.max: 4063232b
taskmanager.memory.network.min: 4063232b
taskmanager.memory.jvm-overhead.max: 433791712b
taskmanager.memory.task.heap.size: 3699934605b
taskmanager.memory.framework.off-heap.size: 134217728b
taskmanager.memory.jvm-metaspace.size: 22960020b
taskmanager.memory.framework.heap.size: "0 bytes"
taskmanager.memory.flink.size: 3838215565b
taskmanager.memory.managed.size: 0b {code}
This has lead to some issue starting the TM because we are relying on some 
javaagent performing some memory allocation outside of the JVM (rely on some C 
bindings).

Tuning the overhead or disabling the scale-down-compensation.enabled could have 
helped for that particular event but this can leads to other issue as it could 
leads to too little HEAP size being computed.

It would be interesting to be able to set a min memory.managed.size to be taken 
in account by the autotuning.
What do you think about this? Do you think that some other specific config 
should have been applied to avoid this issue?

 

Edit

  was:
We have enable the autotuning feature on one of our flink job with below config
{code:java}
# Autoscaler configuration
job.autoscaler.enabled: "true"
job.autoscaler.stabilization.interval: 1m
job.autoscaler.metrics.window: 10m
job.autoscaler.target.utilization: "0.8"
job.autoscaler.target.utilization.boundary: "0.1"
job.autoscaler.restart.time: 2m
job.autoscaler.catch-up.duration: 10m
job.autoscaler.memory.tuning.enabled: true
job.autoscaler.memory.tuning.overhead: 0.5
job.autoscaler.memory.tuning.maximize-managed-memory: true{code}
During a scale down the autotuning decided to give all the memory to to JVM 
(having heap being scale by 2) settting taskmanager.memory.managed.size to 0b.
Here is the config that was compute by the autotuning for a TM running on a 4GB 
pod:
{code:java}
taskmanager.memory.network.max: 4063232b
taskmanager.memory.network.min: 4063232b
taskmanager.memory.jvm-overhead.max: 433791712b
taskmanager.memory.task.heap.size: 3699934605b
taskmanager.memory.framework.off-heap.size: 134217728b
taskmanager.memory.jvm-metaspace.size: 22960020b
taskmanager.memory.framework.heap.size: "0 bytes"
taskmanager.memory.flink.size: 3838215565b
taskmanager.memory.managed.size: 0b {code}
This has lead to some issue starting the TM because we are relying on some 
javaagent performing some memory allocation outside of the JVM (rely on some C 
bindings).

Tuning the overhead or disabling the scale-down-compensation.enabled could have 
helped for that particular event but this can leads to other issue as it could 
leads to too little HEAP size being computed.

It would be interesting to be able to set a min memory.managed.size to be taken 
in account by the autotuning.
What do you think about this? Do you think that some other specific config 
should have been applied to avoid this issue?


> Metaspace size can be too little after autotuning change memory setting
> ---
>
> Key: FLINK-35489
> URL: https://issues.apache.org/jira/browse/FLINK-35489
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: 1.8.0
>Reporter: Nicolas Fraison
>Priority: Major
>
> We have enable the autotuning feature on one of our flink job with below 
> config
> {code:java}
> # Autoscaler configuration
> job.autoscaler.enabled: "true"
> job.autoscaler.stabilization.interval: 1m
> job.autoscaler.metrics.window: 10m
> job.autoscaler.target.utilization: "0.8"
> job.autoscaler.target.utilization.boundary: "0.1"
> job.autoscaler.restart.time: 2m
> job.autoscaler.catch-up.duration: 10m
> job.autoscaler.memory.tuning.enabled: true
> job.autoscaler.memory.tuning.overhead: 0.5
> job.autoscaler.memory.tuning.maximize-managed-memory: true{code}
> During a scale

[jira] [Updated] (FLINK-35489) Metaspace size can be too little after autotuning change memory setting

2024-05-30 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-35489:

Description: 
We have enable the autotuning feature on one of our flink job with below config
{code:java}
# Autoscaler configuration
job.autoscaler.enabled: "true"
job.autoscaler.stabilization.interval: 1m
job.autoscaler.metrics.window: 10m
job.autoscaler.target.utilization: "0.8"
job.autoscaler.target.utilization.boundary: "0.1"
job.autoscaler.restart.time: 2m
job.autoscaler.catch-up.duration: 10m
job.autoscaler.memory.tuning.enabled: true
job.autoscaler.memory.tuning.overhead: 0.5
job.autoscaler.memory.tuning.maximize-managed-memory: true{code}
During a scale down the autotuning decided to give all the memory to to JVM 
(having heap being scale by 2) settting taskmanager.memory.managed.size to 0b.
Here is the config that was compute by the autotuning for a TM running on a 4GB 
pod:
{code:java}
taskmanager.memory.network.max: 4063232b
taskmanager.memory.network.min: 4063232b
taskmanager.memory.jvm-overhead.max: 433791712b
taskmanager.memory.task.heap.size: 3699934605b
taskmanager.memory.framework.off-heap.size: 134217728b
taskmanager.memory.jvm-metaspace.size: 22960020b
taskmanager.memory.framework.heap.size: "0 bytes"
taskmanager.memory.flink.size: 3838215565b
taskmanager.memory.managed.size: 0b {code}
This has lead to some issue starting the TM because we are relying on some 
javaagent performing some memory allocation outside of the JVM (rely on some C 
bindings).

Tuning the overhead or disabling the scale-down-compensation.enabled could have 
helped for that particular event but this can leads to other issue as it could 
leads to too little HEAP size being computed.

It would be interesting to be able to set a min memory.managed.size to be taken 
in account by the autotuning.
What do you think about this? Do you think that some other specific config 
should have been applied to avoid this issue?

 

Edit see this comment that leads to the metaspace issue: 
https://issues.apache.org/jira/browse/FLINK-35489?focusedCommentId=17850694&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17850694

  was:
We have enable the autotuning feature on one of our flink job with below config
{code:java}
# Autoscaler configuration
job.autoscaler.enabled: "true"
job.autoscaler.stabilization.interval: 1m
job.autoscaler.metrics.window: 10m
job.autoscaler.target.utilization: "0.8"
job.autoscaler.target.utilization.boundary: "0.1"
job.autoscaler.restart.time: 2m
job.autoscaler.catch-up.duration: 10m
job.autoscaler.memory.tuning.enabled: true
job.autoscaler.memory.tuning.overhead: 0.5
job.autoscaler.memory.tuning.maximize-managed-memory: true{code}
During a scale down the autotuning decided to give all the memory to to JVM 
(having heap being scale by 2) settting taskmanager.memory.managed.size to 0b.
Here is the config that was compute by the autotuning for a TM running on a 4GB 
pod:
{code:java}
taskmanager.memory.network.max: 4063232b
taskmanager.memory.network.min: 4063232b
taskmanager.memory.jvm-overhead.max: 433791712b
taskmanager.memory.task.heap.size: 3699934605b
taskmanager.memory.framework.off-heap.size: 134217728b
taskmanager.memory.jvm-metaspace.size: 22960020b
taskmanager.memory.framework.heap.size: "0 bytes"
taskmanager.memory.flink.size: 3838215565b
taskmanager.memory.managed.size: 0b {code}
This has lead to some issue starting the TM because we are relying on some 
javaagent performing some memory allocation outside of the JVM (rely on some C 
bindings).

Tuning the overhead or disabling the scale-down-compensation.enabled could have 
helped for that particular event but this can leads to other issue as it could 
leads to too little HEAP size being computed.

It would be interesting to be able to set a min memory.managed.size to be taken 
in account by the autotuning.
What do you think about this? Do you think that some other specific config 
should have been applied to avoid this issue?

 

Edit


> Metaspace size can be too little after autotuning change memory setting
> ---
>
> Key: FLINK-35489
> URL: https://issues.apache.org/jira/browse/FLINK-35489
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: 1.8.0
>Reporter: Nicolas Fraison
>Priority: Major
>
> We have enable the autotuning feature on one of our flink job with below 
> config
> {code:java}
> # Autoscaler configuration
> job.autoscaler.enabled: "true"
> job.autoscaler.stabilization.interval: 1m
> job.autoscaler.metrics.window: 10m
> job.autoscaler.target.utilization: "0.8"
> job.autoscaler.target.utilization.boundary: "0.1"
> job.autoscaler.res

[jira] [Comment Edited] (FLINK-35489) Metaspace size can be too little after autotuning change memory setting

2024-05-30 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850694#comment-17850694
 ] 

Nicolas Fraison edited comment on FLINK-35489 at 5/30/24 12:07 PM:
---

Thks [~fanrui] for the feedback it help me realise that my analysis was wrong.

The issue we are facing is the JVM crashing after the 
[autotuning|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autotuning/]
 change some memory config:
{code:java}
Starting kubernetes-taskmanager as a console application on host 
flink-kafka-job-apache-right-taskmanager-1-1.
Exception in thread "main" *** java.lang.instrument ASSERTION FAILED ***: 
"result" with message agent load/premain call failed at 
src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 422
FATAL ERROR in native method: processing of -javaagent failed, processJavaStart 
failed
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, 
Vv=VM code, C=native code)
V  [libjvm.so+0x78dee4]  jni_FatalError+0x70
V  [libjvm.so+0x88df00]  JvmtiExport::post_vm_initialized()+0x240
V  [libjvm.so+0xc353fc]  Threads::create_vm(JavaVMInitArgs*, bool*)+0x7ac
V  [libjvm.so+0x79c05c]  JNI_CreateJavaVM+0x7c
C  [libjli.so+0x3b2c]  JavaMain+0x7c
C  [libjli.so+0x7fdc]  ThreadJavaMain+0xc
C  [libpthread.so.0+0x7624]  start_thread+0x184 {code}
Seeing this big increase of HEAP (from 1.5 to more than 3GB and the fact that 
the memory.managed.size was shrink to 0b make me thing that it was linked to 
missing off heap.

But you are right that jvm-overhead already reserved some memory for the off 
heap (and we indeed have around 400 MB with that config)

So looking back to the new config I've identified the issue which is on the 
jvm-metaspace having been shrink to 22MB while it was set at 256MB.
I've done a test increasing this parameter and the TM is now able to start.

For the meta space computation size I can see the autotuning computing 
METASPACE_MEMORY_USED=1.41521584E8 which seems to be appropriate metaspace 
sizing.

But due to the the memBudget management it ends up setting only 22MB to the 
metaspace ([first allocate remaining memory to the heap and then this new 
remaining to metaspace and finally to managed 
memory|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/tuning/MemoryTuning.java#L130])

 


was (Author: JIRAUSER299678):
Thks [~fanrui] for the feedback it help me realise that my analysis was wrong.

The issue we are facing ifs the JVM crashing after the 
[autotuning|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autotuning/]
 change some memory config:

 
{code:java}
Starting kubernetes-taskmanager as a console application on host 
flink-kafka-job-apache-right-taskmanager-1-1.
Exception in thread "main" *** java.lang.instrument ASSERTION FAILED ***: 
"result" with message agent load/premain call failed at 
src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 422
FATAL ERROR in native method: processing of -javaagent failed, processJavaStart 
failed
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, 
Vv=VM code, C=native code)
V  [libjvm.so+0x78dee4]  jni_FatalError+0x70
V  [libjvm.so+0x88df00]  JvmtiExport::post_vm_initialized()+0x240
V  [libjvm.so+0xc353fc]  Threads::create_vm(JavaVMInitArgs*, bool*)+0x7ac
V  [libjvm.so+0x79c05c]  JNI_CreateJavaVM+0x7c
C  [libjli.so+0x3b2c]  JavaMain+0x7c
C  [libjli.so+0x7fdc]  ThreadJavaMain+0xc
C  [libpthread.so.0+0x7624]  start_thread+0x184 {code}
Seeing this big increase of HEAP (from 1.5 to more than 3GB and the fact that 
the memory.managed.size was shrink to 0b make me thing that it was linked to 
missing off heap.

But you are right that jvm-overhead already reserved some memory for the off 
heap (and we indeed have around 400 MB with that config)

So looking back to the new config I've identified the issue which is on the 
jvm-metaspace having been shrink to 22MB while it was set at 256MB.
I've done a test increasing this parameter and the TM is now able to start.

For the meta space computation size I can see the autotuning computing 
METASPACE_MEMORY_USED=1.41521584E8 which seems to be appropriate metaspace 
sizing.

But due to the the memBudget management it ends up setting only 22MB to the 
metaspace ([first allocate remaining memory to the heap and then this new 
remaining to metaspace and finally to managed 
memory|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/tuning/MemoryTuning.java#L130])

 

> Metaspace size can be too little after autotuning change memory setting
> ---
>
> Key: FLINK-35489
> URL: h

[jira] [Commented] (FLINK-35489) Metaspace size can be too little after autotuning change memory setting

2024-05-31 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850953#comment-17850953
 ] 

Nicolas Fraison commented on FLINK-35489:
-

[~mxm] what do you think about providing memory to the METASPACE_MEMORY before 
the HEAP_MEMORY (switching those lines 
[https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/tuning/MemoryTuning.java#L128-L131)]
And also ensuring that the METASPACE_MEMORY computed will never be bigger than 
the one assigned by default (from config taskmanager.memory.jvm-metaspace.size)

Looks to me that this space should not grow with some load change and the 
default one must be sufficiently big to have the TaskManager running fine. The 
autotuning should only scale down this memory space depending on the usage, no?

> Metaspace size can be too little after autotuning change memory setting
> ---
>
> Key: FLINK-35489
> URL: https://issues.apache.org/jira/browse/FLINK-35489
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: 1.8.0
>Reporter: Nicolas Fraison
>Priority: Major
>
> We have enable the autotuning feature on one of our flink job with below 
> config
> {code:java}
> # Autoscaler configuration
> job.autoscaler.enabled: "true"
> job.autoscaler.stabilization.interval: 1m
> job.autoscaler.metrics.window: 10m
> job.autoscaler.target.utilization: "0.8"
> job.autoscaler.target.utilization.boundary: "0.1"
> job.autoscaler.restart.time: 2m
> job.autoscaler.catch-up.duration: 10m
> job.autoscaler.memory.tuning.enabled: true
> job.autoscaler.memory.tuning.overhead: 0.5
> job.autoscaler.memory.tuning.maximize-managed-memory: true{code}
> During a scale down the autotuning decided to give all the memory to to JVM 
> (having heap being scale by 2) settting taskmanager.memory.managed.size to 0b.
> Here is the config that was compute by the autotuning for a TM running on a 
> 4GB pod:
> {code:java}
> taskmanager.memory.network.max: 4063232b
> taskmanager.memory.network.min: 4063232b
> taskmanager.memory.jvm-overhead.max: 433791712b
> taskmanager.memory.task.heap.size: 3699934605b
> taskmanager.memory.framework.off-heap.size: 134217728b
> taskmanager.memory.jvm-metaspace.size: 22960020b
> taskmanager.memory.framework.heap.size: "0 bytes"
> taskmanager.memory.flink.size: 3838215565b
> taskmanager.memory.managed.size: 0b {code}
> This has lead to some issue starting the TM because we are relying on some 
> javaagent performing some memory allocation outside of the JVM (rely on some 
> C bindings).
> Tuning the overhead or disabling the scale-down-compensation.enabled could 
> have helped for that particular event but this can leads to other issue as it 
> could leads to too little HEAP size being computed.
> It would be interesting to be able to set a min memory.managed.size to be 
> taken in account by the autotuning.
> What do you think about this? Do you think that some other specific config 
> should have been applied to avoid this issue?
>  
> Edit see this comment that leads to the metaspace issue: 
> https://issues.apache.org/jira/browse/FLINK-35489?focusedCommentId=17850694&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17850694



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33125) Upgrade JOSDK to 4.4.4

2023-09-20 Thread Nicolas Fraison (Jira)
Nicolas Fraison created FLINK-33125:
---

 Summary: Upgrade JOSDK to 4.4.4
 Key: FLINK-33125
 URL: https://issues.apache.org/jira/browse/FLINK-33125
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Reporter: Nicolas Fraison
 Fix For: kubernetes-operator-1.7.0


JOSDK 
[4.4.4|https://github.com/operator-framework/java-operator-sdk/releases/tag/v4.4.4]
 contains fix for leader election issue we face in our environment

Here are more information on the 
[issue|https://github.com/operator-framework/java-operator-sdk/issues/2056] 
faced



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33222) Operator rollback app when it should not

2023-10-09 Thread Nicolas Fraison (Jira)
Nicolas Fraison created FLINK-33222:
---

 Summary: Operator rollback app when it should not
 Key: FLINK-33222
 URL: https://issues.apache.org/jira/browse/FLINK-33222
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
 Environment: Flink operator 1.6 - Flink 1.17.1
Reporter: Nicolas Fraison


The operator can decide to rollback when an update of the job spec is performed 
on 
savepointTriggerNonce or initialSavepointPath if the app has been deployed 
since more than KubernetesOperatorConfigOptions.DEPLOYMENT_READINESS_TIMEOUT.
 
This is due to the objectmeta generation being 
[updated|https://github.com/apache/flink-kubernetes-operator/blob/release-1.6/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L169]
 when changing those spec and leading to the lastReconcileSpec not being 
aligned with the stableReconcileSpec while those spec are well ignored when 
checking for upgrade diff
 
Looking at the main branch we should still face the same issue as the same 
[update|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L169]
 is performed at the end of the reconcile loop



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33222) Operator rollback app when it should not

2023-10-10 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773606#comment-17773606
 ] 

Nicolas Fraison commented on FLINK-33222:
-

So I was wrong the release 1.7-snapshot is not affected by this bug thanks to 
[https://github.com/apache/flink-kubernetes-operator/pull/681] patch.

Indeed deploying the app with an {{{}initialSavepointPath{}}}:
 * lastReconciledSpec get the update of generation from N to N+1 while stable 
spec generation stay at N. But no rollback detected as the 
[update|https://github.com/apache/flink-kubernetes-operator/pull/681/files#diff-29ea38a50cac5b4432dd0969bc3e2177e29a5507f8c7bb01b80f605a8740de41R169]
 is done after the 
[rollback|https://github.com/apache/flink-kubernetes-operator/pull/681/files#diff-29ea38a50cac5b4432dd0969bc3e2177e29a5507f8c7bb01b80f605a8740de41R146]
 check
deployment is consider as DEPLOYED

 * then on second reconcile loop the stable spec generation is also updated 
from N to N+1 (in 
[patchAndCacheStatus|[https://github.com/apache/flink-kubernetes-operator/blob/release-1.6/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java#L135]])and
 the deployment is consider as STABLE

But this look quite brittle to me as just changing the position of the 
shouldRollBack or ReconciliationUtils.updateReconciliationMetadata could lead 
to that bad behaviour again.

 

I'm wondering if we could not take in account the generation field in the 
[isLastReconciledSpecStable|https://github.com/apache/flink-kubernetes-operator/blob/release-1.6/flink-kubernetes-operator-api/src/main/java/org/apache/flink/kubernetes/operator/api/status/ReconciliationStatus.java#L91]

> Operator rollback app when it should not
> 
>
> Key: FLINK-33222
> URL: https://issues.apache.org/jira/browse/FLINK-33222
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
> Environment: Flink operator 1.6 - Flink 1.17.1
>Reporter: Nicolas Fraison
>Priority: Major
>
> The operator can decide to rollback when an update of the job spec is 
> performed on 
> savepointTriggerNonce or initialSavepointPath if the app has been deployed 
> since more than KubernetesOperatorConfigOptions.DEPLOYMENT_READINESS_TIMEOUT.
>  
> This is due to the objectmeta generation being 
> [updated|https://github.com/apache/flink-kubernetes-operator/blob/release-1.6/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L169]
>  when changing those spec and leading to the lastReconcileSpec not being 
> aligned with the stableReconcileSpec while those spec are well ignored when 
> checking for upgrade diff
>  
> Looking at the main branch we should still face the same issue as the same 
> [update|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L169]
>  is performed at the end of the reconcile loop



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32551) Provide the possibility to have a savepoint taken by the operator when deleting a flinkdeployment

2023-07-06 Thread Nicolas Fraison (Jira)
Nicolas Fraison created FLINK-32551:
---

 Summary: Provide the possibility to have a savepoint taken by the 
operator when deleting a flinkdeployment
 Key: FLINK-32551
 URL: https://issues.apache.org/jira/browse/FLINK-32551
 Project: Flink
  Issue Type: Improvement
Reporter: Nicolas Fraison


Currently if a flinkdeployment is deleted all the HA metadata is removed and no 
checkpoint is taken.

It would be great (for ex. in case of fat finger) to be able to configure 
deployment in order to take a savepoint if the deployment is deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32551) Provide the possibility to have a savepoint taken by the operator when deleting a flinkdeployment

2023-07-06 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-32551:

Component/s: Kubernetes Operator

> Provide the possibility to have a savepoint taken by the operator when 
> deleting a flinkdeployment
> -
>
> Key: FLINK-32551
> URL: https://issues.apache.org/jira/browse/FLINK-32551
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Nicolas Fraison
>Priority: Major
>
> Currently if a flinkdeployment is deleted all the HA metadata is removed and 
> no checkpoint is taken.
> It would be great (for ex. in case of fat finger) to be able to configure 
> deployment in order to take a savepoint if the deployment is deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32551) Provide the possibility to take a savepoint when deleting a flinkdeployment

2023-07-06 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-32551:

Summary: Provide the possibility to take a savepoint when deleting a 
flinkdeployment  (was: Provide the possibility to have a savepoint taken by the 
operator when deleting a flinkdeployment)

> Provide the possibility to take a savepoint when deleting a flinkdeployment
> ---
>
> Key: FLINK-32551
> URL: https://issues.apache.org/jira/browse/FLINK-32551
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Nicolas Fraison
>Priority: Major
>
> Currently if a flinkdeployment is deleted all the HA metadata is removed and 
> no checkpoint is taken.
> It would be great (for ex. in case of fat finger) to be able to configure 
> deployment in order to take a savepoint if the deployment is deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32551) Provide the possibility to take a savepoint when deleting a flinkdeployment

2023-07-06 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-32551:

Description: 
Currently if a flinkdeployment is deleted all the HA metadata is removed and no 
savepoint is taken.

It would be great (for ex. in case of fat finger) to be able to configure 
deployment in order to take a savepoint if the deployment is deleted

  was:
Currently if a flinkdeployment is deleted all the HA metadata is removed and no 
checkpoint is taken.

It would be great (for ex. in case of fat finger) to be able to configure 
deployment in order to take a savepoint if the deployment is deleted


> Provide the possibility to take a savepoint when deleting a flinkdeployment
> ---
>
> Key: FLINK-32551
> URL: https://issues.apache.org/jira/browse/FLINK-32551
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Nicolas Fraison
>Priority: Major
>
> Currently if a flinkdeployment is deleted all the HA metadata is removed and 
> no savepoint is taken.
> It would be great (for ex. in case of fat finger) to be able to configure 
> deployment in order to take a savepoint if the deployment is deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32551) Provide the possibility to take a savepoint when deleting a flinkdeployment

2023-07-06 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740556#comment-17740556
 ] 

Nicolas Fraison commented on FLINK-32551:
-

It is probably not perfect but if the user select this option it probably means 
that he really wants it or wants to manage it by himself if there are some 
issues to take it before to delete the app

So I would failed the deletion if the savepoint failed and retry.

> Provide the possibility to take a savepoint when deleting a flinkdeployment
> ---
>
> Key: FLINK-32551
> URL: https://issues.apache.org/jira/browse/FLINK-32551
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Nicolas Fraison
>Priority: Major
>
> Currently if a flinkdeployment is deleted all the HA metadata is removed and 
> no savepoint is taken.
> It would be great (for ex. in case of fat finger) to be able to configure 
> deployment in order to take a savepoint if the deployment is deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32551) Provide the possibility to take a savepoint when deleting a flinkdeployment

2023-07-06 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740561#comment-17740561
 ] 

Nicolas Fraison commented on FLINK-32551:
-

Yes we could in case of accidental deletion or intentional deletion rely on 
those actions.

Still I would prefer not to have manual actions from user when doing such or 
having users to search for last checkpoint in case of accidental deletion (just 
looking at a log indicating the path of the savepoint taken during job 
cancelation)

Looks to me also more secure

> Provide the possibility to take a savepoint when deleting a flinkdeployment
> ---
>
> Key: FLINK-32551
> URL: https://issues.apache.org/jira/browse/FLINK-32551
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Nicolas Fraison
>Priority: Major
>
> Currently if a flinkdeployment is deleted all the HA metadata is removed and 
> no savepoint is taken.
> It would be great (for ex. in case of fat finger) to be able to configure 
> deployment in order to take a savepoint if the deployment is deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32551) Provide the possibility to take a savepoint when deleting a flinkdeployment

2023-07-06 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740564#comment-17740564
 ] 

Nicolas Fraison commented on FLINK-32551:
-

Yes

> Provide the possibility to take a savepoint when deleting a flinkdeployment
> ---
>
> Key: FLINK-32551
> URL: https://issues.apache.org/jira/browse/FLINK-32551
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Nicolas Fraison
>Priority: Major
>
> Currently if a flinkdeployment is deleted all the HA metadata is removed and 
> no savepoint is taken.
> It would be great (for ex. in case of fat finger) to be able to configure 
> deployment in order to take a savepoint if the deployment is deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32012) Operator failed to rollback due to missing HA metadata

2023-05-16 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723156#comment-17723156
 ] 

Nicolas Fraison commented on FLINK-32012:
-

I'd like to have your feeling on the first approach I'd like to do to merge 
upgrade/rollback.

 

Currently to do a rollback we
 * First check if spec change
 * If not check if we should rollback
 * If yes initiate rollback where we set status to ROLLING_BACK and wait for 
next reconcile loop
 * Next reconcile loop will do same checks
 * As it should rollback and state already up to date it will rollback relying 
on last stable spec but doesn't update the spec

The idea would be to really rollback the spec spec to rely for both mechanism 
on reconcileSpecChange:
 * Still keep the check if spec change
 * Then check if should rollback
 * In the initiate rollback set status to ROLLING_BACK and restore last stable 
spec and wait for next reconcile loop
 * Next reconcile loop will check if spec change
 * It will be the case so will run reconcileSpecChange but with status 
ROLLING_BACK so could then applied specificities of rollback

Would also simplify the operation that need to reapply the working spec like 
[resubmitJob|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractJobReconciler.java#L309]

Still I'm wondering if we will then be really able to simplify things as from 
your 
[comment|https://github.com/apache/flink-kubernetes-operator/pull/590#discussion_r1189847280]
 the rollback doesn't seem to be aligned to the upgrade?

For ex for savepoint upgrade mode:
 * upgrade
 ** delete ha metadata
 ** restore from savepoint
 * rollback
 ** keep ha metadata if exist
 ** restore from it if exist
 ** restore from savepoint if not exist and JM pod never started

> Operator failed to rollback due to missing HA metadata
> --
>
> Key: FLINK-32012
> URL: https://issues.apache.org/jira/browse/FLINK-32012
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.4.0
>Reporter: Nicolas Fraison
>Priority: Major
>  Labels: pull-request-available
>
> The operator has well detected that the job was failing and initiate the 
> rollback but this rollback has failed due to `Rollback is not possible due to 
> missing HA metadata`
> We are relying on saevpoint upgrade mode and zookeeper HA.
> The operator is performing a set of action to also delete this HA data in 
> savepoint upgrade mode:
>  * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346]
>  : Suspend job with savepoint and deleteClusterDeployment
>  * [flink-kubernetes-operator/StandaloneFlinkService.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158]
>  : Remove JM + TM deployment and delete HA data
>  * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008]
>  : Wait cluster shutdown and delete zookeeper HA data
>  * [flink-kubernetes-operator/FlinkUtils.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155]
>  : Remove all child znode
> Then when running rollback the operator is looking for HA data even if we 
> rely on sevepoint upgrade mode:
>  * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164]
>  Perform reconcile of rollback if it should rollback
>  * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387]
>  Rollback failed as HA data is not available
>  * [flink-kubernetes-operator/FlinkUtils.jav

[jira] [Commented] (FLINK-32012) Operator failed to rollback due to missing HA metadata

2023-05-23 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725413#comment-17725413
 ] 

Nicolas Fraison commented on FLINK-32012:
-

[~gyfora], I started an implementation which do not modify the upgradeMode to 
last-state as the 
[restoreJob|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractJobReconciler.java#L143%5D]
 for LAST_STATE upgrade mode will enforce the requirement for HA metadata which 
is not what we want when relying on SAVEPOINT.

Also when restoring last stable spec there is a case where the UpgradeMode is 
set to STATELESS in this spec even if the chosen mode is SAVEPOINT 
([updateStatusBeforeFirstDeployment|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L193])

In order to avoid restoring this bad state leading to rollback not taking in 
account the savepoint I enforce the upgrade mode of the restored spec to be the 
one currently set on the job.

But I'm wondering why we have decided to not persist the really used upgrade 
mode in the last stable spec for first deployment?

Here is the diff of my 
[WIP|https://github.com/apache/flink-kubernetes-operator/compare/main...ashangit:flink-kubernetes-operator:nfraison/FLINK-32012?expand=1]
 if approach is not clear (not to review...)

> Operator failed to rollback due to missing HA metadata
> --
>
> Key: FLINK-32012
> URL: https://issues.apache.org/jira/browse/FLINK-32012
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.4.0
>Reporter: Nicolas Fraison
>Priority: Major
>  Labels: pull-request-available
>
> The operator has well detected that the job was failing and initiate the 
> rollback but this rollback has failed due to `Rollback is not possible due to 
> missing HA metadata`
> We are relying on saevpoint upgrade mode and zookeeper HA.
> The operator is performing a set of action to also delete this HA data in 
> savepoint upgrade mode:
>  * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346]
>  : Suspend job with savepoint and deleteClusterDeployment
>  * [flink-kubernetes-operator/StandaloneFlinkService.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158]
>  : Remove JM + TM deployment and delete HA data
>  * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008]
>  : Wait cluster shutdown and delete zookeeper HA data
>  * [flink-kubernetes-operator/FlinkUtils.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155]
>  : Remove all child znode
> Then when running rollback the operator is looking for HA data even if we 
> rely on sevepoint upgrade mode:
>  * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164]
>  Perform reconcile of rollback if it should rollback
>  * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387]
>  Rollback failed as HA data is not available
>  * [flink-kubernetes-operator/FlinkUtils.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L220]
>  Check if some child znodes are available
> For both step the pattern looks to be the same for kubernetes HA so it 
> doesn't loo

[jira] [Commented] (FLINK-32012) Operator failed to rollback due to missing HA metadata

2023-05-23 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725450#comment-17725450
 ] 

Nicolas Fraison commented on FLINK-32012:
-

But the whole point of this request was the fact that when JobManager failed to 
start the HA metadata was not available during RollBack while the savepoint 
taken for the upgrade is available.

So relying on SAVEPOINT for rollback will ensure that Flink is aware of the 
availability of a savepoint.

>From my understanding of 
>[tryRestoreExecutionGraphFromSavepoint|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultExecutionGraphFactory.java#L198]
>  Flink will rely on the saveoint if no HA metadata exist otherwise it will 
>load the checkpoint

 

> Operator failed to rollback due to missing HA metadata
> --
>
> Key: FLINK-32012
> URL: https://issues.apache.org/jira/browse/FLINK-32012
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.4.0
>Reporter: Nicolas Fraison
>Priority: Major
>  Labels: pull-request-available
>
> The operator has well detected that the job was failing and initiate the 
> rollback but this rollback has failed due to `Rollback is not possible due to 
> missing HA metadata`
> We are relying on saevpoint upgrade mode and zookeeper HA.
> The operator is performing a set of action to also delete this HA data in 
> savepoint upgrade mode:
>  * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346]
>  : Suspend job with savepoint and deleteClusterDeployment
>  * [flink-kubernetes-operator/StandaloneFlinkService.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158]
>  : Remove JM + TM deployment and delete HA data
>  * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008]
>  : Wait cluster shutdown and delete zookeeper HA data
>  * [flink-kubernetes-operator/FlinkUtils.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155]
>  : Remove all child znode
> Then when running rollback the operator is looking for HA data even if we 
> rely on sevepoint upgrade mode:
>  * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164]
>  Perform reconcile of rollback if it should rollback
>  * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387]
>  Rollback failed as HA data is not available
>  * [flink-kubernetes-operator/FlinkUtils.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L220]
>  Check if some child znodes are available
> For both step the pattern looks to be the same for kubernetes HA so it 
> doesn't looks to be linked to a bug with zookeeper.
>  
> From https://issues.apache.org/jira/browse/FLINK-30305 it looks to be 
> expected that the HA data has been deleted (as it is also performed by flink 
> when relying on savepoint upgrade mode).
> Still the use case seems to differ from 
> https://issues.apache.org/jira/browse/FLINK-30305 as the operator is aware of 
> the failure and treat a specific rollback event.
> So I'm wondering why we enforce such a check when performing rollback if we 
> rely on savepoint upgrade mode. Would it be fine to not rely on the HA data 
> and rollback from the last savepoint (the one we used in the deployment step)?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-32012) Operator failed to rollback due to missing HA metadata

2023-05-23 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725450#comment-17725450
 ] 

Nicolas Fraison edited comment on FLINK-32012 at 5/23/23 3:05 PM:
--

But the whole point of this request was the fact that when JobManager failed to 
start the HA metadata was not available during RollBack while the savepoint 
taken for the upgrade is available.

So relying on SAVEPOINT for rollback will ensure that Flink is aware of the 
availability of a savepoint.

>From my understanding of 
>[tryRestoreExecutionGraphFromSavepoint|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultExecutionGraphFactory.java#L198]
>  Flink will rely on the saveoint if no HA metadata exist otherwise it will 
>load the checkpoint

 

Forgot to mention it but indeed when calling 
[cancelJob|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L353%5D]
 it fallback to LAST_STATE in order to ensure no savepoint is created and HA 
metadata is not deleted

 


was (Author: JIRAUSER299678):
But the whole point of this request was the fact that when JobManager failed to 
start the HA metadata was not available during RollBack while the savepoint 
taken for the upgrade is available.

So relying on SAVEPOINT for rollback will ensure that Flink is aware of the 
availability of a savepoint.

>From my understanding of 
>[tryRestoreExecutionGraphFromSavepoint|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultExecutionGraphFactory.java#L198]
>  Flink will rely on the saveoint if no HA metadata exist otherwise it will 
>load the checkpoint

 

> Operator failed to rollback due to missing HA metadata
> --
>
> Key: FLINK-32012
> URL: https://issues.apache.org/jira/browse/FLINK-32012
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.4.0
>Reporter: Nicolas Fraison
>Priority: Major
>  Labels: pull-request-available
>
> The operator has well detected that the job was failing and initiate the 
> rollback but this rollback has failed due to `Rollback is not possible due to 
> missing HA metadata`
> We are relying on saevpoint upgrade mode and zookeeper HA.
> The operator is performing a set of action to also delete this HA data in 
> savepoint upgrade mode:
>  * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346]
>  : Suspend job with savepoint and deleteClusterDeployment
>  * [flink-kubernetes-operator/StandaloneFlinkService.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158]
>  : Remove JM + TM deployment and delete HA data
>  * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008]
>  : Wait cluster shutdown and delete zookeeper HA data
>  * [flink-kubernetes-operator/FlinkUtils.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155]
>  : Remove all child znode
> Then when running rollback the operator is looking for HA data even if we 
> rely on sevepoint upgrade mode:
>  * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164]
>  Perform reconcile of rollback if it should rollback
>  * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387]
>  Rollback failed as HA data is not available
>  * [flink-kubernetes-operator/FlinkUtils.java at main · 
> apache/flink-kubernet

[jira] [Created] (FLINK-32334) Operator failed to create taskmanager deployment because it already exist

2023-06-14 Thread Nicolas Fraison (Jira)
Nicolas Fraison created FLINK-32334:
---

 Summary: Operator failed to create taskmanager deployment because 
it already exist
 Key: FLINK-32334
 URL: https://issues.apache.org/jira/browse/FLINK-32334
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.5.0
Reporter: Nicolas Fraison


During a job upgrade the operator has failed to start the new job because it 
has failed to create the taskmanager deployment:

 
{code:java}
Jun 12 19:45:28.115 >>> Status | Error | UPGRADING | 
{"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException:
 Could not create Kubernetes cluster 
\"flink-metering\".","throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"Could
 not create Kubernetes cluster 
\"flink-metering\"."},{"type":"org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure
 executing: POST at: 
https://10.129.144.1/apis/apps/v1/namespaces/metering/deployments. Message: 
object is being deleted: deployments.apps \"flink-metering-taskmanager\" 
already exists. Received status: Status(apiVersion=v1, code=409, 
details=StatusDetails(causes=[], group=apps, kind=deployments, 
name=flink-metering-taskmanager, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=object is being deleted: 
deployments.apps \"flink-metering-taskmanager\" already exists, 
metadata=ListMeta(_continue=null, remainingItemCount=null, 
resourceVersion=null, selfLink=null, additionalProperties={}), 
reason=AlreadyExists, status=Failure, additionalProperties={})."}]} {code}
As indicated in the error log this is due to taskmanger deployment still 
existing while it is under deletion.

Looking at the [source 
code|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L150]
 we are well relying on FOREGROUND policy by default.

Still it seems that the REST API call to delete only wait until the resource 
has been modified and the {{deletionTimestamp}} has been added to the metadata: 
[ensure delete returns only when the delete operation is fully finished -  
Issue #3246 -  
fabric8io/kubernetes-client|https://github.com/fabric8io/kubernetes-client/issues/3246#issuecomment-874019899]

So we could face this issue if the k8s cluster is slow to "really" delete the 
deployment

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32334) Operator failed to create taskmanager deployment because it already exist

2023-06-14 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732431#comment-17732431
 ] 

Nicolas Fraison commented on FLINK-32334:
-

Adding a check that object has been well deleted should be enough for this issue
{code:java}
kubernetesClient
.apps()
.deployments()
.inNamespace(namespace)

.withName(StandaloneKubernetesUtils.getTaskManagerDeploymentName(clusterId))
.waitUntilCondition(Objects::isNull, 30, TimeUnit.SECONDS); {code}

> Operator failed to create taskmanager deployment because it already exist
> -
>
> Key: FLINK-32334
> URL: https://issues.apache.org/jira/browse/FLINK-32334
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.5.0
>Reporter: Nicolas Fraison
>Priority: Major
>
> During a job upgrade the operator has failed to start the new job because it 
> has failed to create the taskmanager deployment:
>  
> {code:java}
> Jun 12 19:45:28.115 >>> Status | Error | UPGRADING | 
> {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException:
>  Could not create Kubernetes cluster 
> \"flink-metering\".","throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"Could
>  not create Kubernetes cluster 
> \"flink-metering\"."},{"type":"org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure
>  executing: POST at: 
> https://10.129.144.1/apis/apps/v1/namespaces/metering/deployments. Message: 
> object is being deleted: deployments.apps \"flink-metering-taskmanager\" 
> already exists. Received status: Status(apiVersion=v1, code=409, 
> details=StatusDetails(causes=[], group=apps, kind=deployments, 
> name=flink-metering-taskmanager, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=object is being deleted: 
> deployments.apps \"flink-metering-taskmanager\" already exists, 
> metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=AlreadyExists, status=Failure, additionalProperties={})."}]} {code}
> As indicated in the error log this is due to taskmanger deployment still 
> existing while it is under deletion.
> Looking at the [source 
> code|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L150]
>  we are well relying on FOREGROUND policy by default.
> Still it seems that the REST API call to delete only wait until the resource 
> has been modified and the {{deletionTimestamp}} has been added to the 
> metadata: [ensure delete returns only when the delete operation is fully 
> finished -  Issue #3246 -  
> fabric8io/kubernetes-client|https://github.com/fabric8io/kubernetes-client/issues/3246#issuecomment-874019899]
> So we could face this issue if the k8s cluster is slow to "really" delete the 
> deployment
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32334) Operator failed to create taskmanager deployment because it already exist

2023-06-14 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732475#comment-17732475
 ] 

Nicolas Fraison commented on FLINK-32334:
-

I can see in the log that we have wait once and then "found" the cluster to be 
shutdown
{code:java}
Jun 12 19:45:25.133 for cluster shutdown...
Jun 12 19:45:27.192 shutdown completed. {code}
I think the issue is due to the fact that we only look for JM pods not for TM 
one

I will see to add a check for the TM in it (and only for standalone clusters)

> Operator failed to create taskmanager deployment because it already exist
> -
>
> Key: FLINK-32334
> URL: https://issues.apache.org/jira/browse/FLINK-32334
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.5.0
>Reporter: Nicolas Fraison
>Priority: Major
>
> During a job upgrade the operator has failed to start the new job because it 
> has failed to create the taskmanager deployment:
>  
> {code:java}
> Jun 12 19:45:28.115 >>> Status | Error | UPGRADING | 
> {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException:
>  Could not create Kubernetes cluster 
> \"flink-metering\".","throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"Could
>  not create Kubernetes cluster 
> \"flink-metering\"."},{"type":"org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure
>  executing: POST at: 
> https://10.129.144.1/apis/apps/v1/namespaces/metering/deployments. Message: 
> object is being deleted: deployments.apps \"flink-metering-taskmanager\" 
> already exists. Received status: Status(apiVersion=v1, code=409, 
> details=StatusDetails(causes=[], group=apps, kind=deployments, 
> name=flink-metering-taskmanager, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=object is being deleted: 
> deployments.apps \"flink-metering-taskmanager\" already exists, 
> metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=AlreadyExists, status=Failure, additionalProperties={})."}]} {code}
> As indicated in the error log this is due to taskmanger deployment still 
> existing while it is under deletion.
> Looking at the [source 
> code|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L150]
>  we are well relying on FOREGROUND policy by default.
> Still it seems that the REST API call to delete only wait until the resource 
> has been modified and the {{deletionTimestamp}} has been added to the 
> metadata: [ensure delete returns only when the delete operation is fully 
> finished -  Issue #3246 -  
> fabric8io/kubernetes-client|https://github.com/fabric8io/kubernetes-client/issues/3246#issuecomment-874019899]
> So we could face this issue if the k8s cluster is slow to "really" delete the 
> deployment
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30609) Add ephemeral storage to CRD

2023-04-03 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17708023#comment-17708023
 ] 

Nicolas Fraison commented on FLINK-30609:
-

We are looking into using this feature and have some time to implement it.
If it is fine by you I can take over on it.

> Add ephemeral storage to CRD
> 
>
> Key: FLINK-30609
> URL: https://issues.apache.org/jira/browse/FLINK-30609
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
>Reporter: Matyas Orhidi
>Assignee: Zhenqiu Huang
>Priority: Major
>  Labels: starter
> Fix For: kubernetes-operator-1.5.0
>
>
> We should consider adding ephemeral storage to the existing [resource 
> specification 
> |https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/reference/#resource]in
>  CRD, next to {{cpu}} and {{memory}}
> https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#setting-requests-and-limits-for-local-ephemeral-storage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31812) SavePoint from /jars/:jarid:/run api on body is not anymore emptyToNull if empty

2023-04-14 Thread Nicolas Fraison (Jira)
Nicolas Fraison created FLINK-31812:
---

 Summary: SavePoint from /jars/:jarid:/run api on body is not 
anymore emptyToNull if empty
 Key: FLINK-31812
 URL: https://issues.apache.org/jira/browse/FLINK-31812
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.17.0
Reporter: Nicolas Fraison


Since https://issues.apache.org/jira/browse/FLINK-29543 the 
savepointPath from the body is not anymore transform to null if empty: 
https://github.com/apache/flink/pull/21012/files#diff-c6d9a43d970eb07642a87e4bf9ec6a9dc7d363b1b5b557ed76f73d8de1cc5a54R145
 
This leads to issue running a flink job in release 1.17 with lyft operator 
which set savePoint in body to "": 
[https://github.com/lyft/flinkk8soperator/blob/master/pkg/controller/flinkapplication/flink_state_machine.go#L721]
 
Issue faced by the job as the savepointPath is set to "":
{code:java}
org.apache.flink.runtime.client.JobInitializationException: Could not start the 
JobMaster.
3   at 
org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97)
4   at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
5   at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
6   at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
7   at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705)
8   at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
9   at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
10  at java.base/java.lang.Thread.run(Thread.java:829)
11Caused by: java.util.concurrent.CompletionException: 
java.lang.IllegalArgumentException: empty checkpoint pointer
12  at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
13  at 
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
14  at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702)
15  ... 3 more
16Caused by: java.lang.IllegalArgumentException: empty checkpoint pointer
17  at 
org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138)
18  at 
org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpointPointer(AbstractFsCheckpointStorageAccess.java:240)
19  at 
org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpoint(AbstractFsCheckpointStorageAccess.java:136)
20  at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1824)
21  at 
org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.tryRestoreExecutionGraphFromSavepoint(DefaultExecutionGraphFactory.java:223)
22  at 
org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:198)
23  at 
org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:365)
24  at 
org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:210)
25  at 
org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:136)
26  at 
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:152)
27  at 
org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:119)
28  at 
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:371)
29  at 
org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:348)
30  at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:123)
31  at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95)
32  at 
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
33  at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
34  ... 3 more
35 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31812) SavePoint from /jars/:jarid:/run api on body is not anymore set to null if empty

2023-04-14 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-31812:

Summary: SavePoint from /jars/:jarid:/run api on body is not anymore set to 
null if empty  (was: SavePoint from /jars/:jarid:/run api on body is not 
anymore emptyToNull if empty)

> SavePoint from /jars/:jarid:/run api on body is not anymore set to null if 
> empty
> 
>
> Key: FLINK-31812
> URL: https://issues.apache.org/jira/browse/FLINK-31812
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: Nicolas Fraison
>Priority: Minor
>
> Since https://issues.apache.org/jira/browse/FLINK-29543 the 
> savepointPath from the body is not anymore transform to null if empty: 
> https://github.com/apache/flink/pull/21012/files#diff-c6d9a43d970eb07642a87e4bf9ec6a9dc7d363b1b5b557ed76f73d8de1cc5a54R145
>  
> This leads to issue running a flink job in release 1.17 with lyft operator 
> which set savePoint in body to "": 
> [https://github.com/lyft/flinkk8soperator/blob/master/pkg/controller/flinkapplication/flink_state_machine.go#L721]
>  
> Issue faced by the job as the savepointPath is set to "":
> {code:java}
> org.apache.flink.runtime.client.JobInitializationException: Could not start 
> the JobMaster.
> 3 at 
> org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97)
> 4 at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
> 5 at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
> 6 at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
> 7 at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705)
> 8 at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> 9 at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> 10at java.base/java.lang.Thread.run(Thread.java:829)
> 11Caused by: java.util.concurrent.CompletionException: 
> java.lang.IllegalArgumentException: empty checkpoint pointer
> 12at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
> 13at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
> 14at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702)
> 15... 3 more
> 16Caused by: java.lang.IllegalArgumentException: empty checkpoint pointer
> 17at 
> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138)
> 18at 
> org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpointPointer(AbstractFsCheckpointStorageAccess.java:240)
> 19at 
> org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpoint(AbstractFsCheckpointStorageAccess.java:136)
> 20at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1824)
> 21at 
> org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.tryRestoreExecutionGraphFromSavepoint(DefaultExecutionGraphFactory.java:223)
> 22at 
> org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:198)
> 23at 
> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:365)
> 24at 
> org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:210)
> 25at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:136)
> 26at 
> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:152)
> 27at 
> org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:119)
> 28at 
> org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:371)
> 29at 
> org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:348)
> 30at 
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:123)
> 31at 
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95)
> 32at 
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
> 33at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
> 34...

[jira] [Updated] (FLINK-31812) SavePoint from /jars/:jarid:/run api on body is not anymore set to null if empty

2023-04-14 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-31812:

Description: 
Since https://issues.apache.org/jira/browse/FLINK-29543 the 
savepointPath from the body is not anymore transform to null if empty: 
[https://github.com/apache/flink/pull/21012/files#diff-c6d9a43d970eb07642a87e4bf9ec6a9dc7d363b1b5b557ed76f73d8de1cc5a54R145]
 
This leads to issue running a flink job in release 1.17 with lyft operator 
which set savePoint in body to empty string: 
[https://github.com/lyft/flinkk8soperator/blob/master/pkg/controller/flinkapplication/flink_state_machine.go#L721]
 
Issue faced by the job as the savepointPath is setto empty string:
{code:java}
org.apache.flink.runtime.client.JobInitializationException: Could not start the 
JobMaster.
3   at 
org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97)
4   at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
5   at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
6   at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
7   at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705)
8   at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
9   at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
10  at java.base/java.lang.Thread.run(Thread.java:829)
11Caused by: java.util.concurrent.CompletionException: 
java.lang.IllegalArgumentException: empty checkpoint pointer
12  at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
13  at 
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
14  at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702)
15  ... 3 more
16Caused by: java.lang.IllegalArgumentException: empty checkpoint pointer
17  at 
org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138)
18  at 
org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpointPointer(AbstractFsCheckpointStorageAccess.java:240)
19  at 
org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorageAccess.resolveCheckpoint(AbstractFsCheckpointStorageAccess.java:136)
20  at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1824)
21  at 
org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.tryRestoreExecutionGraphFromSavepoint(DefaultExecutionGraphFactory.java:223)
22  at 
org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:198)
23  at 
org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:365)
24  at 
org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:210)
25  at 
org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:136)
26  at 
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:152)
27  at 
org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:119)
28  at 
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:371)
29  at 
org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:348)
30  at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:123)
31  at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95)
32  at 
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
33  at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
34  ... 3 more
35 {code}

  was:
Since https://issues.apache.org/jira/browse/FLINK-29543 the 
savepointPath from the body is not anymore transform to null if empty: 
https://github.com/apache/flink/pull/21012/files#diff-c6d9a43d970eb07642a87e4bf9ec6a9dc7d363b1b5b557ed76f73d8de1cc5a54R145
 
This leads to issue running a flink job in release 1.17 with lyft operator 
which set savePoint in body to "": 
[https://github.com/lyft/flinkk8soperator/blob/master/pkg/controller/flinkapplication/flink_state_machine.go#L721]
 
Issue faced by the job as the savepointPath is set to "":
{code:java}
org.apache.flink.runtime.client.JobInitializationException: Could not start the 
Jo

[jira] [Created] (FLINK-32012) Operator failed to rollback due to missing HA metadata

2023-05-05 Thread Nicolas Fraison (Jira)
Nicolas Fraison created FLINK-32012:
---

 Summary: Operator failed to rollback due to missing HA metadata
 Key: FLINK-32012
 URL: https://issues.apache.org/jira/browse/FLINK-32012
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.4.0
Reporter: Nicolas Fraison


The operator has well detected that the job was failing and initiate the 
rollback but this rollback has failed due to `Rollback is not possible due to 
missing HA metadata`

We are relying on saevpoint upgrade mode and zookeeper HA.

The operator is performing a set of action to also delete this HA data in 
savepoint upgrade mode:
 * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346]
 : Suspend job with savepoint and deleteClusterDeployment

 * [flink-kubernetes-operator/StandaloneFlinkService.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158]
 : Remove JM + TM deployment and delete HA data

 * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008]
 : Wait cluster shutdown and delete zookeeper HA data

 * [flink-kubernetes-operator/FlinkUtils.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155]
 : Remove all child znode

Then when running rollback the operator is looking for HA data even if we rely 
on sevepoint upgrade mode:
 * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164]
 Perform reconcile of rollback if it should rollback

 * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387]
 Rollback failed as HA data is not available

 * [flink-kubernetes-operator/FlinkUtils.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L220]
 Check if some child znodes are available

For both step the pattern looks to be the same for kubernetes HA so it doesn't 
looks to be linked to a bug with zookeeper.

 

>From https://issues.apache.org/jira/browse/FLINK-30305 it looks to be expected 
>that the HA data has been deleted (as it is also performed by flink when 
>relying on savepoint upgrade mode)

So I'm wondering why we enforce such a check when performing rollback if we 
rely on savepoint upgrade mode.

Would it be fine to not rely on the HA data and rollback from the last 
savepoint (the one we used in the deployment step)?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32012) Operator failed to rollback due to missing HA metadata

2023-05-05 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-32012:

Description: 
The operator has well detected that the job was failing and initiate the 
rollback but this rollback has failed due to `Rollback is not possible due to 
missing HA metadata`

We are relying on saevpoint upgrade mode and zookeeper HA.

The operator is performing a set of action to also delete this HA data in 
savepoint upgrade mode:
 * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346]
 : Suspend job with savepoint and deleteClusterDeployment

 * [flink-kubernetes-operator/StandaloneFlinkService.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158]
 : Remove JM + TM deployment and delete HA data

 * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008]
 : Wait cluster shutdown and delete zookeeper HA data

 * [flink-kubernetes-operator/FlinkUtils.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155]
 : Remove all child znode

Then when running rollback the operator is looking for HA data even if we rely 
on sevepoint upgrade mode:
 * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164]
 Perform reconcile of rollback if it should rollback

 * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387]
 Rollback failed as HA data is not available

 * [flink-kubernetes-operator/FlinkUtils.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L220]
 Check if some child znodes are available

For both step the pattern looks to be the same for kubernetes HA so it doesn't 
looks to be linked to a bug with zookeeper.

 

>From https://issues.apache.org/jira/browse/FLINK-30305 it looks to be expected 
>that the HA data has been deleted (as it is also performed by flink when 
>relying on savepoint upgrade mode).

Still the use case seems to differ from 
https://issues.apache.org/jira/browse/FLINK-30305 as the operator is aware of 
the failure and treat a specific rollback event.

So I'm wondering why we enforce such a check when performing rollback if we 
rely on savepoint upgrade mode. Would it be fine to not rely on the HA data and 
rollback from the last savepoint (the one we used in the deployment step)?

  was:
The operator has well detected that the job was failing and initiate the 
rollback but this rollback has failed due to `Rollback is not possible due to 
missing HA metadata`

We are relying on saevpoint upgrade mode and zookeeper HA.

The operator is performing a set of action to also delete this HA data in 
savepoint upgrade mode:
 * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346]
 : Suspend job with savepoint and deleteClusterDeployment

 * [flink-kubernetes-operator/StandaloneFlinkService.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158]
 : Remove JM + TM deployment and delete HA data

 * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008]

[jira] [Commented] (FLINK-32012) Operator failed to rollback due to missing HA metadata

2023-05-05 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719825#comment-17719825
 ] 

Nicolas Fraison commented on FLINK-32012:
-

[~gyfora] do you think that we can implement this specific case for rollback or 
do you foresee some potential issues with this?

> Operator failed to rollback due to missing HA metadata
> --
>
> Key: FLINK-32012
> URL: https://issues.apache.org/jira/browse/FLINK-32012
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.4.0
>Reporter: Nicolas Fraison
>Priority: Major
>
> The operator has well detected that the job was failing and initiate the 
> rollback but this rollback has failed due to `Rollback is not possible due to 
> missing HA metadata`
> We are relying on saevpoint upgrade mode and zookeeper HA.
> The operator is performing a set of action to also delete this HA data in 
> savepoint upgrade mode:
>  * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L346]
>  : Suspend job with savepoint and deleteClusterDeployment
>  * [flink-kubernetes-operator/StandaloneFlinkService.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L158]
>  : Remove JM + TM deployment and delete HA data
>  * [flink-kubernetes-operator/AbstractFlinkService.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L1008]
>  : Wait cluster shutdown and delete zookeeper HA data
>  * [flink-kubernetes-operator/FlinkUtils.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L155]
>  : Remove all child znode
> Then when running rollback the operator is looking for HA data even if we 
> rely on sevepoint upgrade mode:
>  * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L164]
>  Perform reconcile of rollback if it should rollback
>  * [flink-kubernetes-operator/AbstractFlinkResourceReconciler.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L387]
>  Rollback failed as HA data is not available
>  * [flink-kubernetes-operator/FlinkUtils.java at main · 
> apache/flink-kubernetes-operator|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/FlinkUtils.java#L220]
>  Check if some child znodes are available
> For both step the pattern looks to be the same for kubernetes HA so it 
> doesn't looks to be linked to a bug with zookeeper.
>  
> From https://issues.apache.org/jira/browse/FLINK-30305 it looks to be 
> expected that the HA data has been deleted (as it is also performed by flink 
> when relying on savepoint upgrade mode).
> Still the use case seems to differ from 
> https://issues.apache.org/jira/browse/FLINK-30305 as the operator is aware of 
> the failure and treat a specific rollback event.
> So I'm wondering why we enforce such a check when performing rollback if we 
> rely on savepoint upgrade mode. Would it be fine to not rely on the HA data 
> and rollback from the last savepoint (the one we used in the deployment step)?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32890) Flink app rolled back with old savepoints (3 hours back in time) while some checkpoints have been taken in between

2023-08-17 Thread Nicolas Fraison (Jira)
Nicolas Fraison created FLINK-32890:
---

 Summary: Flink app rolled back with old savepoints (3 hours back 
in time) while some checkpoints have been taken in between
 Key: FLINK-32890
 URL: https://issues.apache.org/jira/browse/FLINK-32890
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Nicolas Fraison


Here are all details about the issue (based on an app/scenario reproducing the 
issue): * Deployed new release of a flink app with a new operator
 * Flink Operator set the app as stable
 * After some time the app failed and stay in failed state (due to some issue 
with our kafka clusters)
 * Sadly the team in charge of this flink app decided to rollback the app as 
they were thinking it was linked to this new deployment
 * Operator detect: {{Job is not running but HA metadata is available for last 
state restore, ready for upgrade, Deleting JobManager deployment while 
preserving HA metadata.}}  -> rely on last-state (as we do not disable 
fallback), no savepoint taken
 * Flink start JM and deployment of the app. It well find the 3 checkpoints

 * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as 
Zookeeper namespace.}}
 * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}}
 * {{Recovering checkpoints from 
ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}}
 * {{Found 3 checkpoints in 
ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}}
 * {{Restoring job 6b24a364c1905e924a69f3dbff0d26a6 from Checkpoint 19 @ 
1692268003920 for 6b24a364c1905e924a69f3dbff0d26a6 located at 
}}{{{}s3p://.../flink-kafka-job-apache-nico/checkpoints/6b24a364c1905e924a69f3dbff0d26a6/chk-19{}}}{{{}.{}}}

 * Job failed because of the missing operator

Job 6b24a364c1905e924a69f3dbff0d26a6 reached terminal state FAILED.
org.apache.flink.runtime.client.JobInitializationException: Could not start the 
JobMaster.
Caused by: java.util.concurrent.CompletionException: 
java.lang.IllegalStateException: There is no operator for the state 
f298e8715b4d85e6f965b60e1c848cbe * {{Job 6b24a364c1905e924a69f3dbff0d26a6 has 
been registered for cleanup in the JobResultStore after reaching a terminal 
state.}}
 * {{Clean up the high availability data for job 
6b24a364c1905e924a69f3dbff0d26a6.}}
 * {{Removed job graph 6b24a364c1905e924a69f3dbff0d26a6 from 
ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobgraphs'}.}}

 * JobManager restart and try to resubmit the job but the job was already 
submitted so finished

 * {{Received JobGraph submission 'flink-kafka-job' 
(6b24a364c1905e924a69f3dbff0d26a6).}}
 * {{Ignoring JobGraph submission 'flink-kafka-job' 
(6b24a364c1905e924a69f3dbff0d26a6) because the job already reached a 
globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous 
execution.}}
 * {{Application completed SUCCESSFULLY}}

 * Finally the operator rollback the deployment and still indicate that {{Job 
is not running but HA metadata is available for last state restore, ready for 
upgrade}}
 * But the job metadata are not anymore there (clean previously)

(CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints
Path 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints
 doesn't exist
(CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs

(CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico
jobgraphs
jobs
leader * The rolled back app from flink operator finally take the last provided 
savepoint as no metadata/checkpoints are available. But this last savepoint is 
an old one as during the upgrade the operator decided to rely on last-state



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32890) Flink app rolled back with old savepoints (3 hours back in time) while some checkpoints have been taken in between

2023-08-17 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-32890:

Description: 
Here are all details about the issue (based on an app/scenario reproducing the 
issue): * Deployed new release of a flink app with a new operator
 * Flink Operator set the app as stable
 * After some time the app failed and stay in failed state (due to some issue 
with our kafka clusters)
 * Sadly the team in charge of this flink app decided to rollback the app as 
they were thinking it was linked to this new deployment
 * Operator detect: {{Job is not running but HA metadata is available for last 
state restore, ready for upgrade, Deleting JobManager deployment while 
preserving HA metadata.}}  -> rely on last-state (as we do not disable 
fallback), no savepoint taken
 * Flink start JM and deployment of the app. It well find the 3 checkpoints

 * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as 
Zookeeper namespace.}}
 * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}}
 * {{Recovering checkpoints from 
ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}}
 * {{Found 3 checkpoints in 
ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}}
 * {{{}Restoring job 6b24a364c1905e924a69f3dbff0d26a6 from Checkpoint 19 @ 
1692268003920 for 6b24a364c1905e924a69f3dbff0d26a6 located at 
}}\{{{}s3p://.../flink-kafka-job-apache-nico/checkpoints/6b24a364c1905e924a69f3dbff0d26a6/chk-19{}}}{{{}.{}}}

 * Job failed because of the missing operator

{code:java}
Job 6b24a364c1905e924a69f3dbff0d26a6 reached terminal state FAILED.
org.apache.flink.runtime.client.JobInitializationException: Could not start the 
JobMaster.
Caused by: java.util.concurrent.CompletionException: 
java.lang.IllegalStateException: There is no operator for the state 
f298e8715b4d85e6f965b60e1c848cbe * Job 6b24a364c1905e924a69f3dbff0d26a6 has 
been registered for cleanup in the JobResultStore after reaching a terminal 
state.{code}
 * {{Clean up the high availability data for job 
6b24a364c1905e924a69f3dbff0d26a6.}}
 * {{Removed job graph 6b24a364c1905e924a69f3dbff0d26a6 from 
ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobgraphs'}.}}

 * JobManager restart and try to resubmit the job but the job was already 
submitted so finished

 * {{Received JobGraph submission 'flink-kafka-job' 
(6b24a364c1905e924a69f3dbff0d26a6).}}
 * {{Ignoring JobGraph submission 'flink-kafka-job' 
(6b24a364c1905e924a69f3dbff0d26a6) because the job already reached a 
globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous 
execution.}}
 * {{Application completed SUCCESSFULLY}}

 * Finally the operator rollback the deployment and still indicate that {{Job 
is not running but HA metadata is available for last state restore, ready for 
upgrade}}
 * But the job metadata are not anymore there (clean previously)

 
{code:java}
(CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints
Path 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints
 doesn't exist
(CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs
(CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico
jobgraphs
jobs
leader
{code}
 

The rolled back app from flink operator finally take the last provided 
savepoint as no metadata/checkpoints are available. But this last savepoint is 
an old one as during the upgrade the operator decided to rely on last-state

  was:
Here are all details about the issue (based on an app/scenario reproducing the 
issue): * Deployed new release of a flink app with a new operator
 * Flink Operator set the app as stable
 * After some time the app failed and stay in failed state (due to some issue 
with our kafka clusters)
 * Sadly the team in charge of this flink app decided to rollback the app as 
they were thinking it was linked to this new deployment
 * Operator detect: {{Job is not running but HA metadata is available for last 
state restore, ready for upgrade, Deleting JobManager deployment while 
preserving HA metadata.}}  -> rely on last-state (as we do not disable 
fallback), no savepoint taken
 * Flink start JM and deployment of the app. It well find the 3 checkpoints

 * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as 
Zookeeper namespace.}}
 * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).

[jira] [Updated] (FLINK-32890) Flink app rolled back with old savepoints (3 hours back in time) while some checkpoints have been taken in between

2023-08-17 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-32890:

Issue Type: Bug  (was: Improvement)

> Flink app rolled back with old savepoints (3 hours back in time) while some 
> checkpoints have been taken in between
> --
>
> Key: FLINK-32890
> URL: https://issues.apache.org/jira/browse/FLINK-32890
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Nicolas Fraison
>Priority: Major
>
> Here are all details about the issue (based on an app/scenario reproducing 
> the issue): * Deployed new release of a flink app with a new operator
>  * Flink Operator set the app as stable
>  * After some time the app failed and stay in failed state (due to some issue 
> with our kafka clusters)
>  * Sadly the team in charge of this flink app decided to rollback the app as 
> they were thinking it was linked to this new deployment
>  * Operator detect: {{Job is not running but HA metadata is available for 
> last state restore, ready for upgrade, Deleting JobManager deployment while 
> preserving HA metadata.}}  -> rely on last-state (as we do not disable 
> fallback), no savepoint taken
>  * Flink start JM and deployment of the app. It well find the 3 checkpoints
>  * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as 
> Zookeeper namespace.}}
>  * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}}
>  * {{Recovering checkpoints from 
> ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}}
>  * {{Found 3 checkpoints in 
> ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}}
>  * {{{}Restoring job 6b24a364c1905e924a69f3dbff0d26a6 from Checkpoint 19 @ 
> 1692268003920 for 6b24a364c1905e924a69f3dbff0d26a6 located at 
> }}\{{{}s3p://.../flink-kafka-job-apache-nico/checkpoints/6b24a364c1905e924a69f3dbff0d26a6/chk-19{}}}{{{}.{}}}
>  * Job failed because of the missing operator
> {code:java}
> Job 6b24a364c1905e924a69f3dbff0d26a6 reached terminal state FAILED.
> org.apache.flink.runtime.client.JobInitializationException: Could not start 
> the JobMaster.
> Caused by: java.util.concurrent.CompletionException: 
> java.lang.IllegalStateException: There is no operator for the state 
> f298e8715b4d85e6f965b60e1c848cbe * Job 6b24a364c1905e924a69f3dbff0d26a6 has 
> been registered for cleanup in the JobResultStore after reaching a terminal 
> state.{code}
>  * {{Clean up the high availability data for job 
> 6b24a364c1905e924a69f3dbff0d26a6.}}
>  * {{Removed job graph 6b24a364c1905e924a69f3dbff0d26a6 from 
> ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobgraphs'}.}}
>  * JobManager restart and try to resubmit the job but the job was already 
> submitted so finished
>  * {{Received JobGraph submission 'flink-kafka-job' 
> (6b24a364c1905e924a69f3dbff0d26a6).}}
>  * {{Ignoring JobGraph submission 'flink-kafka-job' 
> (6b24a364c1905e924a69f3dbff0d26a6) because the job already reached a 
> globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous 
> execution.}}
>  * {{Application completed SUCCESSFULLY}}
>  * Finally the operator rollback the deployment and still indicate that {{Job 
> is not running but HA metadata is available for last state restore, ready for 
> upgrade}}
>  * But the job metadata are not anymore there (clean previously)
>  
> {code:java}
> (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
> /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints
> Path 
> /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints
>  doesn't exist
> (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
> /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs
> (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
> /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico
> jobgraphs
> jobs
> leader
> {code}
>  
> The rolled back app from flink operator finally take the last provided 
> savepoint as no metadata/checkpoints are available. But this last savepoint 
> is an old one as during the upgrade the operator decided to rely on last-state



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-32890) Flink app rolled back with old savepoints (3 hours back in time) while some checkpoints have been taken in between

2023-08-17 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-32890:

Description: 
Here are all details about the issue:
 * Deployed new release of a flink app with a new operator
 * Flink Operator set the app as stable
 * After some time the app failed and stay in failed state (due to some issue 
with our kafka clusters)
 * Sadly the team in charge of this flink app decided to rollback the app as 
they were thinking it was linked to this new deployment
 * Operator detect: {{Job is not running but HA metadata is available for last 
state restore, ready for upgrade, Deleting JobManager deployment while 
preserving HA metadata.}}  -> rely on last-state (as we do not disable 
fallback), no savepoint taken
 * Flink start JM and deployment of the app. It well find the 3 checkpoints

 * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as 
Zookeeper namespace.}}
 * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}}
 * {{Recovering checkpoints from 
ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}}
 * {{Found 3 checkpoints in 
ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}}
 * {{{}Restoring job 6b24a364c1905e924a69f3dbff0d26a6 from Checkpoint 19 @ 
1692268003920 for 6b24a364c1905e924a69f3dbff0d26a6 located at 
}}\{{{}s3p://.../flink-kafka-job-apache-nico/checkpoints/6b24a364c1905e924a69f3dbff0d26a6/chk-19{}}}{{{}.{}}}

 * Job failed because of the missing operator

{code:java}
Job 6b24a364c1905e924a69f3dbff0d26a6 reached terminal state FAILED.
org.apache.flink.runtime.client.JobInitializationException: Could not start the 
JobMaster.
Caused by: java.util.concurrent.CompletionException: 
java.lang.IllegalStateException: There is no operator for the state 
f298e8715b4d85e6f965b60e1c848cbe * Job 6b24a364c1905e924a69f3dbff0d26a6 has 
been registered for cleanup in the JobResultStore after reaching a terminal 
state.{code}
 * {{Clean up the high availability data for job 
6b24a364c1905e924a69f3dbff0d26a6.}}
 * {{Removed job graph 6b24a364c1905e924a69f3dbff0d26a6 from 
ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobgraphs'}.}}

 * JobManager restart and try to resubmit the job but the job was already 
submitted so finished

 * {{Received JobGraph submission 'flink-kafka-job' 
(6b24a364c1905e924a69f3dbff0d26a6).}}
 * {{Ignoring JobGraph submission 'flink-kafka-job' 
(6b24a364c1905e924a69f3dbff0d26a6) because the job already reached a 
globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous 
execution.}}
 * {{Application completed SUCCESSFULLY}}

 * Finally the operator rollback the deployment and still indicate that {{Job 
is not running but HA metadata is available for last state restore, ready for 
upgrade}}
 * But the job metadata are not anymore there (clean previously)

 
{code:java}
(CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints
Path 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints
 doesn't exist
(CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs
(CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico
jobgraphs
jobs
leader
{code}
 

The rolled back app from flink operator finally take the last provided 
savepoint as no metadata/checkpoints are available. But this last savepoint is 
an old one as during the upgrade the operator decided to rely on last-state

  was:
Here are all details about the issue (based on an app/scenario reproducing the 
issue): * Deployed new release of a flink app with a new operator
 * Flink Operator set the app as stable
 * After some time the app failed and stay in failed state (due to some issue 
with our kafka clusters)
 * Sadly the team in charge of this flink app decided to rollback the app as 
they were thinking it was linked to this new deployment
 * Operator detect: {{Job is not running but HA metadata is available for last 
state restore, ready for upgrade, Deleting JobManager deployment while 
preserving HA metadata.}}  -> rely on last-state (as we do not disable 
fallback), no savepoint taken
 * Flink start JM and deployment of the app. It well find the 3 checkpoints

 * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as 
Zookeeper namespace.}}
 * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}}
 * {{Recovering checkpoints from 
ZooKeeperSta

[jira] [Updated] (FLINK-32890) Flink app rolled back with old savepoints (3 hours back in time) while some checkpoints have been taken in between

2023-08-17 Thread Nicolas Fraison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-32890:

Description: 
Here are all details about the issue:
 * Deployed new release of a flink app with a new operator
 * Flink Operator set the app as stable
 * After some time the app failed and stay in failed state (due to some issue 
with our kafka clusters)
 * Finally decided to rollback the new release just in case it could be the 
root cause of the issue on kafka
 * Operator detect: {{Job is not running but HA metadata is available for last 
state restore, ready for upgrade, Deleting JobManager deployment while 
preserving HA metadata.}}  -> rely on last-state (as we do not disable 
fallback), no savepoint taken
 * Flink start JM and deployment of the app. It well find the 3 checkpoints

 * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as 
Zookeeper namespace.}}
 * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}}
 * {{Recovering checkpoints from 
ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}}
 * {{Found 3 checkpoints in 
ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}}
 * {{{}Restoring job 6b24a364c1905e924a69f3dbff0d26a6 from Checkpoint 19 @ 
1692268003920 for 6b24a364c1905e924a69f3dbff0d26a6 located at 
}}\{{{}s3p://.../flink-kafka-job-apache-nico/checkpoints/6b24a364c1905e924a69f3dbff0d26a6/chk-19{}}}{{{}.{}}}

 * Job failed because of the missing operator

{code:java}
Job 6b24a364c1905e924a69f3dbff0d26a6 reached terminal state FAILED.
org.apache.flink.runtime.client.JobInitializationException: Could not start the 
JobMaster.
Caused by: java.util.concurrent.CompletionException: 
java.lang.IllegalStateException: There is no operator for the state 
f298e8715b4d85e6f965b60e1c848cbe * Job 6b24a364c1905e924a69f3dbff0d26a6 has 
been registered for cleanup in the JobResultStore after reaching a terminal 
state.{code}
 * {{Clean up the high availability data for job 
6b24a364c1905e924a69f3dbff0d26a6.}}
 * {{Removed job graph 6b24a364c1905e924a69f3dbff0d26a6 from 
ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobgraphs'}.}}

 * JobManager restart and try to resubmit the job but the job was already 
submitted so finished

 * {{Received JobGraph submission 'flink-kafka-job' 
(6b24a364c1905e924a69f3dbff0d26a6).}}
 * {{Ignoring JobGraph submission 'flink-kafka-job' 
(6b24a364c1905e924a69f3dbff0d26a6) because the job already reached a 
globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous 
execution.}}
 * {{Application completed SUCCESSFULLY}}

 * Finally the operator rollback the deployment and still indicate that {{Job 
is not running but HA metadata is available for last state restore, ready for 
upgrade}}
 * But the job metadata are not anymore there (clean previously)

 
{code:java}
(CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints
Path 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints
 doesn't exist
(CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs
(CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico
jobgraphs
jobs
leader
{code}
 

The rolled back app from flink operator finally take the last provided 
savepoint as no metadata/checkpoints are available. But this last savepoint is 
an old one as during the upgrade the operator decided to rely on last-state 
(The old savepoint taken is a scheduled one)

  was:
Here are all details about the issue:
 * Deployed new release of a flink app with a new operator
 * Flink Operator set the app as stable
 * After some time the app failed and stay in failed state (due to some issue 
with our kafka clusters)
 * Sadly the team in charge of this flink app decided to rollback the app as 
they were thinking it was linked to this new deployment
 * Operator detect: {{Job is not running but HA metadata is available for last 
state restore, ready for upgrade, Deleting JobManager deployment while 
preserving HA metadata.}}  -> rely on last-state (as we do not disable 
fallback), no savepoint taken
 * Flink start JM and deployment of the app. It well find the 3 checkpoints

 * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as 
Zookeeper namespace.}}
 * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}}
 * {{Recovering checkpoints from 
ZooKeeperStateHandleStore\{namespace='f

[jira] [Commented] (FLINK-32890) Flink app rolled back with old savepoints (3 hours back in time) while some checkpoints have been taken in between

2023-08-18 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755866#comment-17755866
 ] 

Nicolas Fraison commented on FLINK-32890:
-

The HA metadata check for zookeeper is not fine. It should have check for 
availability of entries in the /jobs znode.

With such a test the rollback would have not being done requiring manual 
intervention to ensure last taken checkpoint is used to restore the flink app

> Flink app rolled back with old savepoints (3 hours back in time) while some 
> checkpoints have been taken in between
> --
>
> Key: FLINK-32890
> URL: https://issues.apache.org/jira/browse/FLINK-32890
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Nicolas Fraison
>Priority: Major
>
> Here are all details about the issue:
>  * Deployed new release of a flink app with a new operator
>  * Flink Operator set the app as stable
>  * After some time the app failed and stay in failed state (due to some issue 
> with our kafka clusters)
>  * Finally decided to rollback the new release just in case it could be the 
> root cause of the issue on kafka
>  * Operator detect: {{Job is not running but HA metadata is available for 
> last state restore, ready for upgrade, Deleting JobManager deployment while 
> preserving HA metadata.}}  -> rely on last-state (as we do not disable 
> fallback), no savepoint taken
>  * Flink start JM and deployment of the app. It well find the 3 checkpoints
>  * {{Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as 
> Zookeeper namespace.}}
>  * {{Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).}}
>  * {{Recovering checkpoints from 
> ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}}
>  * {{Found 3 checkpoints in 
> ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.}}
>  * {{{}Restoring job 6b24a364c1905e924a69f3dbff0d26a6 from Checkpoint 19 @ 
> 1692268003920 for 6b24a364c1905e924a69f3dbff0d26a6 located at 
> }}\{{{}s3p://.../flink-kafka-job-apache-nico/checkpoints/6b24a364c1905e924a69f3dbff0d26a6/chk-19{}}}{{{}.{}}}
>  * Job failed because of the missing operator
> {code:java}
> Job 6b24a364c1905e924a69f3dbff0d26a6 reached terminal state FAILED.
> org.apache.flink.runtime.client.JobInitializationException: Could not start 
> the JobMaster.
> Caused by: java.util.concurrent.CompletionException: 
> java.lang.IllegalStateException: There is no operator for the state 
> f298e8715b4d85e6f965b60e1c848cbe * Job 6b24a364c1905e924a69f3dbff0d26a6 has 
> been registered for cleanup in the JobResultStore after reaching a terminal 
> state.{code}
>  * {{Clean up the high availability data for job 
> 6b24a364c1905e924a69f3dbff0d26a6.}}
>  * {{Removed job graph 6b24a364c1905e924a69f3dbff0d26a6 from 
> ZooKeeperStateHandleStore\{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobgraphs'}.}}
>  * JobManager restart and try to resubmit the job but the job was already 
> submitted so finished
>  * {{Received JobGraph submission 'flink-kafka-job' 
> (6b24a364c1905e924a69f3dbff0d26a6).}}
>  * {{Ignoring JobGraph submission 'flink-kafka-job' 
> (6b24a364c1905e924a69f3dbff0d26a6) because the job already reached a 
> globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous 
> execution.}}
>  * {{Application completed SUCCESSFULLY}}
>  * Finally the operator rollback the deployment and still indicate that {{Job 
> is not running but HA metadata is available for last state restore, ready for 
> upgrade}}
>  * But the job metadata are not anymore there (clean previously)
>  
> {code:java}
> (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
> /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints
> Path 
> /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints
>  doesn't exist
> (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
> /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs
> (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls 
> /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico
> jobgraphs
> jobs
> leader
> {code}
>  
> The rolled back app from flink operator finally take the last provided 
> savepoint as no metadata/checkpoints are available. But this last savepoint 
> is an old one as during the upgrade the operator decided to rely on 
> last-state (The old savepoint taken is a scheduled one)




[jira] [Created] (FLINK-32973) Documentation for kubernetes.operator.job.savepoint-on-deletion indicate that it can be set on per resource level while it can't

2023-08-28 Thread Nicolas Fraison (Jira)
Nicolas Fraison created FLINK-32973:
---

 Summary: Documentation for 
kubernetes.operator.job.savepoint-on-deletion indicate that it can be set on 
per resource level while it can't
 Key: FLINK-32973
 URL: https://issues.apache.org/jira/browse/FLINK-32973
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Reporter: Nicolas Fraison


Documentation for kubernetes.operator.job.savepoint-on-deletion indicate that 
it can be set on [per resource 
level|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#resourceuser-configuration]
 while it can't.

The config is only retrieved from operator configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-32973) Documentation for kubernetes.operator.job.savepoint-on-deletion indicate that it can be set on per resource level while it can't

2023-08-28 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-32973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759506#comment-17759506
 ] 

Nicolas Fraison commented on FLINK-32973:
-

I think that we should support the per-resource level configuration and not 
only update the documentation.

Will provide a patch for this

> Documentation for kubernetes.operator.job.savepoint-on-deletion indicate that 
> it can be set on per resource level while it can't
> 
>
> Key: FLINK-32973
> URL: https://issues.apache.org/jira/browse/FLINK-32973
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Nicolas Fraison
>Priority: Minor
>
> Documentation for kubernetes.operator.job.savepoint-on-deletion indicate that 
> it can be set on [per resource 
> level|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#resourceuser-configuration]
>  while it can't.
> The config is only retrieved from operator configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29199) Support blue-green deployment type

2023-08-29 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759856#comment-17759856
 ] 

Nicolas Fraison commented on FLINK-29199:
-

We are looking in supporting this deployment for our Flink users:
 * Is there any plan to support Blue Green deployment within the flink operator?
 * Are some of you already performing such deployment relying on an external 
custom solution? If yes I would be happy to discuss it if possible

> Support blue-green deployment type
> --
>
> Key: FLINK-29199
> URL: https://issues.apache.org/jira/browse/FLINK-29199
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
> Environment: Kubernetes
>Reporter: Oleg Vorobev
>Priority: Minor
>
> Are there any plans to support blue-green deployment/rollout mode similar to 
> *BlueGreen* in the 
> [flinkk8soperator|https://github.com/lyft/flinkk8soperator] to avoid downtime 
> while updating?
> The idea is to run a new version in parallel with an old one and remove the 
> old one only after the stability condition of the new one is satisfied (like 
> in 
> [rollbacks|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/docs/custom-resource/job-management/#application-upgrade-rollbacks-experimental]).
> For stateful apps with {*}upgradeMode: savepoint{*}, this means: not 
> cancelling an old job after creating a savepoint -> starting new job from 
> that savepoint -> waiting for it to become running/one successful 
> checkpoint/timeout or something else -> cancelling and removing old job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29199) Support blue-green deployment type

2023-08-30 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760361#comment-17760361
 ] 

Nicolas Fraison commented on FLINK-29199:
-

This is indeed a concern we have as we rely on kafka data sources.

Currently we are thinking on relying on 2 different kafka consumer group (one 
blue and one green) for those 2 jobs. I need to validate that resuming the 
green deployment from blue checkpoint with a different consumer group can be 
done.

For the double data output we can tolerate some duplicate for those specific 
jobs

> Support blue-green deployment type
> --
>
> Key: FLINK-29199
> URL: https://issues.apache.org/jira/browse/FLINK-29199
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
> Environment: Kubernetes
>Reporter: Oleg Vorobev
>Priority: Minor
>
> Are there any plans to support blue-green deployment/rollout mode similar to 
> *BlueGreen* in the 
> [flinkk8soperator|https://github.com/lyft/flinkk8soperator] to avoid downtime 
> while updating?
> The idea is to run a new version in parallel with an old one and remove the 
> old one only after the stability condition of the new one is satisfied (like 
> in 
> [rollbacks|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/docs/custom-resource/job-management/#application-upgrade-rollbacks-experimental]).
> For stateful apps with {*}upgradeMode: savepoint{*}, this means: not 
> cancelling an old job after creating a savepoint -> starting new job from 
> that savepoint -> waiting for it to become running/one successful 
> checkpoint/timeout or something else -> cancelling and removing old job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-12869) Add yarn acls capability to flink containers

2019-07-02 Thread Nicolas Fraison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876774#comment-16876774
 ] 

Nicolas Fraison commented on FLINK-12869:
-

[~azagrebin] thanks for the feedback.
So my proposal is to add 2 new parameter (yarn.view.acls and yarn.admin.acls) 
and to setApplicationACLs based on those 2 parameters for all spawned yarn 
containers. For this I will add a method setAclsFor in flink-yarn Utils class.

> Add yarn acls capability to flink containers
> 
>
> Key: FLINK-12869
> URL: https://issues.apache.org/jira/browse/FLINK-12869
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / YARN
>Reporter: Nicolas Fraison
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Yarn provide application acls mechanism to be able to provide specific rights 
> to other users than the one running the job (view logs through the 
> resourcemanager/job history, kill the application)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-37504) Handle TLS Certificate Renewal

2025-03-21 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-37504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937301#comment-17937301
 ] 

Nicolas Fraison commented on FLINK-37504:
-

Hi, thanks for the feedback.

I will see to create a FLIP for this.

For the implementation I've seen that the Apache Flink Kubernetes Operator was 
already managing certificate reload so I've just based the implementation from 
the one from that project: 
[https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-webhook/src/main/java/org/apache/flink/kubernetes/operator/admission/FlinkOperatorWebhook.java#L154]

The requirement for the certificate reload is not linked to the kubernetes 
usage but linked to a zero trust network policy applied in my company.

The created certificate have a one day validity and are renewed every 12 hours 
which is why we would need this feature to avoid restarting the job every day.

> Handle TLS Certificate Renewal
> --
>
> Key: FLINK-37504
> URL: https://issues.apache.org/jira/browse/FLINK-37504
> Project: Flink
>  Issue Type: Improvement
>Reporter: Nicolas Fraison
>Priority: Minor
>  Labels: pull-request-available
>
> Flink does not reload certificate if underlying truststore and keytstore are 
> updated.
> We aim at using 1 day validity certificate which currently means having to 
> restart our jobs every day.
> In order to avoid this we will need to add a feature to be able to reload TLS 
> certificate when underlying truststore and keytstore are updated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-37504) Handle TLS Certificate Renewal

2025-03-18 Thread Nicolas Fraison (Jira)
Nicolas Fraison created FLINK-37504:
---

 Summary: Handle TLS Certificate Renewal
 Key: FLINK-37504
 URL: https://issues.apache.org/jira/browse/FLINK-37504
 Project: Flink
  Issue Type: Improvement
Reporter: Nicolas Fraison


Flink does not reload certificate if underlying truststore and keytstore are 
updated.

We aim at using 1 day validity certificate which currently means having to 
restart our jobs every day.

In order to avoid this we will need to add a feature to be able to reload TLS 
certificate when underlying truststore and keytstore are updated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-37504) Handle TLS Certificate Renewal

2025-04-18 Thread Nicolas Fraison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-37504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945662#comment-17945662
 ] 

Nicolas Fraison commented on FLINK-37504:
-

Created this FLIP: 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-523%3A+Handle+TLS+Certificate+Renewal

> Handle TLS Certificate Renewal
> --
>
> Key: FLINK-37504
> URL: https://issues.apache.org/jira/browse/FLINK-37504
> Project: Flink
>  Issue Type: Improvement
>Reporter: Nicolas Fraison
>Priority: Minor
>  Labels: pull-request-available
>
> Flink does not reload certificate if underlying truststore and keytstore are 
> updated.
> We aim at using 1 day validity certificate which currently means having to 
> restart our jobs every day.
> In order to avoid this we will need to add a feature to be able to reload TLS 
> certificate when underlying truststore and keytstore are updated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)