[jira] [Created] (FLINK-35279) Support "last-state" upgrade mode for FlinkSessionJob

2024-04-30 Thread Alan Zhang (Jira)
Alan Zhang created FLINK-35279:
--

 Summary: Support "last-state" upgrade mode for FlinkSessionJob 
 Key: FLINK-35279
 URL: https://issues.apache.org/jira/browse/FLINK-35279
 Project: Flink
  Issue Type: New Feature
Reporter: Alan Zhang


The "last-state" upgrade mode is only supported for Flink application mode 
today[1], we should provide a similar user experience in session mode.

 

[[1]https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades]
{code:java}
Last state upgrade mode is currently only supported for FlinkDeployments. {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35279) Support "last-state" upgrade mode for FlinkSessionJob

2024-04-30 Thread Alan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Zhang updated FLINK-35279:
---
Description: 
The "last-state" upgrade mode is only supported for Flink application mode 
today[1], we should provide a consistent / similar user experience in Flink 
session mode.

[1] 
[https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades]
{code:java}
Last state upgrade mode is currently only supported for FlinkDeployments. {code}

  was:
The "last-state" upgrade mode is only supported for Flink application mode 
today[1], we should provide a similar user experience in session mode.

 

[[1]https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades]
{code:java}
Last state upgrade mode is currently only supported for FlinkDeployments. {code}


> Support "last-state" upgrade mode for FlinkSessionJob 
> --
>
> Key: FLINK-35279
> URL: https://issues.apache.org/jira/browse/FLINK-35279
> Project: Flink
>  Issue Type: New Feature
>Reporter: Alan Zhang
>Priority: Major
>
> The "last-state" upgrade mode is only supported for Flink application mode 
> today[1], we should provide a consistent / similar user experience in Flink 
> session mode.
> [1] 
> [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades]
> {code:java}
> Last state upgrade mode is currently only supported for FlinkDeployments. 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-36599) Correct the name of the config which enables response codes group

2024-10-24 Thread Alan Zhang (Jira)
Alan Zhang created FLINK-36599:
--

 Summary: Correct the name of the config which enables response 
codes group
 Key: FLINK-36599
 URL: https://issues.apache.org/jira/browse/FLINK-36599
 Project: Flink
  Issue Type: Improvement
Reporter: Alan Zhang


In the "Metrics and Logging" doc, it doesn't use the full name of the operator 
config(aka, add operator config prefix "kubernetes.operator."): 
[https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging.]

 
{code:java}
It’s possible to publish additional metrics by Http response code received from 
API server by setting 
kubernetes.client.metrics.http.response.code.groups.enabled to true . {code}

This is confusing, we should make it be consistent with the naming in the 
"Configuration" doc: 
[https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#system-metrics-configuration]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-36599) Correct the name of the config which enables response codes group

2024-10-24 Thread Alan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-36599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Zhang updated FLINK-36599:
---
Description: 
In the "Metrics and Logging" doc, it doesn't use the full name of the operator 
config(aka, add operator config prefix "kubernetes.operator."): 
[https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging.]
{code:java}
It’s possible to publish additional metrics by Http response code received from 
API server by setting 
kubernetes.client.metrics.http.response.code.groups.enabled to true . {code}
This is confusing, we should make it be consistent with the naming in the 
"Configuration" doc: 
[https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#system-metrics-configuration]

 

  was:
In the "Metrics and Logging" doc, it doesn't use the full name of the operator 
config(aka, add operator config prefix "kubernetes.operator."): 
[https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging.]

 
{code:java}
It’s possible to publish additional metrics by Http response code received from 
API server by setting 
kubernetes.client.metrics.http.response.code.groups.enabled to true . {code}

This is confusing, we should make it be consistent with the naming in the 
"Configuration" doc: 
[https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#system-metrics-configuration]

 


> Correct the name of the config which enables response codes group
> -
>
> Key: FLINK-36599
> URL: https://issues.apache.org/jira/browse/FLINK-36599
> Project: Flink
>  Issue Type: Improvement
>Reporter: Alan Zhang
>Priority: Minor
>
> In the "Metrics and Logging" doc, it doesn't use the full name of the 
> operator config(aka, add operator config prefix "kubernetes.operator."): 
> [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging.]
> {code:java}
> It’s possible to publish additional metrics by Http response code received 
> from API server by setting 
> kubernetes.client.metrics.http.response.code.groups.enabled to true . {code}
> This is confusing, we should make it be consistent with the naming in the 
> "Configuration" doc: 
> [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#system-metrics-configuration]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-36673) Operator is not properly handling failed deployments without savepoints

2025-03-01 Thread Alan Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931738#comment-17931738
 ] 

Alan Zhang edited comment on FLINK-36673 at 3/2/25 1:41 AM:


> Checkpointing is enabled as a very last step when building the execution 
> graph. If the job fails before that (e.g. when registering a source or a 
> sink), the Flink runtime will return the "Checkpointing has not been enabled" 
> exception.

[~gyfora]  I observed an exact same problem as [~sap1ens] mentioned here, and 
I'm using operator 1.10 and Flink 1.16. And I think this explanation makes 
sense.

I have one Flink application which consumes data from Kafka topic, however 
somehow I used a wrong topic name which results schema not found. 
{code:java}
org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error: Error fetching avro schema for topic Samza-PageViewEvent1
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98) 
{code}
It seems JM didn't go through the logic that enabling the checkpoints, this 
error occurred before checkpointing. I could see this Flink job be marked as 
"failed" status[1], and I see JM return exception "Checkpointing has not been 
enabled" by calling JM rest APIs.

In this case, rollback by updating the image to a last-known-good version 
doesn't work because of this SnapshotObserver blocked it by throwing the 
exception: "ReconciliationException: Could not observe latest savepoint 
information" (full stacktrace is attached)". Also job status of the 
FlinkDeployment in this case should be FAILED instead of RECONCILING[3].

[1]
!Screenshot 2025-02-28 at 4.15.26 PM.png|width=481,height=207!
[2]
!Screenshot 2025-02-28 at 8.51.37 PM.png|width=679,height=201!
[3]
!Screenshot 2025-02-28 at 8.55.36 PM.png|width=618,height=386!


was (Author: alnzng):
> Checkpointing is enabled as a very last step when building the execution 
> graph. If the job fails before that (e.g. when registering a source or a 
> sink), the Flink runtime will return the "Checkpointing has not been enabled" 
> exception.

[~gyfora]  I observed an exact same problem as [~sap1ens] mentioned here, and 
I'm using operator 1.10 and Flink 1.16. And I think this explanation makes 
sense.

I have one Flink application which consumes data from Kafka topic, however 
somehow I used a wrong topic name which results schema not found. 
{code:java}
org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error: Error fetching avro schema for topic Samza-PageViewEvent1
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98) 
{code}
It seems JM didn't go through the logic that enabling the checkpoints, this 
error occurred before checkpointing. I could see this Flink job be marked as 
"failed" status[1], and I see JM return exception "Checkpointing has not been 
enabled" by calling JM rest APIs.

In this case, rollback by updating the image to a last-known-good version 
doesn't work because of this SnapshotObserver blocked it by throwing the 
exception: "ReconciliationException: Could not observe latest savepoint 
information" (full stacktrace is attached)". Also job status in this case 
should be FAILED instead of RECONCILING[3].

[1]
!Screenshot 2025-02-28 at 4.15.26 PM.png|width=481,height=207!
[2]
!Screenshot 2025-02-28 at 8.51.37 PM.png|width=679,height=201!
[3]
!Screenshot 2025-02-28 at 8.55.36 PM.png|width=618,height=386!

> Operator is not properly handling failed deployments without savepoints
> ---
>
> Key: FLINK-36673
> URL: https://issues.apache.org/jira/browse/FLINK-36673
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Yaroslav Tkachenko
>Priority: Major
> Attachments: Screenshot 2025-02-28 at 4.15.26 PM.png, Screenshot 
> 2025-02-28 at 8.51.37 PM.png, Screenshot 2025-02-28 at 8.55.36 PM.png, 
> stacktrace.txt
>
>
> I noticed an issue after upgrading Flink Kubernetes Operator from 1.9 to 1.10.
> When I deploy a FlinkDeployment that fails during the startup, I get a 
> "ReconciliationException: Could not observe latest savepoint information" 
> (full stacktrace is attached). 
> I think the issue was introduced here: 
> [https://github.com/apache/flink-kubernetes-operator/pull/871.] 
> *AbstractFlinkServ

[jira] [Updated] (FLINK-36673) Operator is not properly handling failed deployments without savepoints

2025-03-01 Thread Alan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Zhang updated FLINK-36673:
---
Attachment: Screenshot 2025-02-28 at 8.51.37 PM.png

> Operator is not properly handling failed deployments without savepoints
> ---
>
> Key: FLINK-36673
> URL: https://issues.apache.org/jira/browse/FLINK-36673
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Yaroslav Tkachenko
>Priority: Major
> Attachments: Screenshot 2025-02-28 at 4.15.26 PM.png, Screenshot 
> 2025-02-28 at 8.51.37 PM.png, stacktrace.txt
>
>
> I noticed an issue after upgrading Flink Kubernetes Operator from 1.9 to 1.10.
> When I deploy a FlinkDeployment that fails during the startup, I get a 
> "ReconciliationException: Could not observe latest savepoint information" 
> (full stacktrace is attached). 
> I think the issue was introduced here: 
> [https://github.com/apache/flink-kubernetes-operator/pull/871.] 
> *AbstractFlinkService.getLastCheckpoint* now throws a 
> *ReconciliationException* when a savepoint is not available, and 
> *SnapshotObserver.observeLatestCheckpoint* doesn't handle it properly. I 
> think having no savepoint is completely normal in some situations (e.g. a 
> brand new job). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-36673) Operator is not properly handling failed deployments without savepoints

2025-03-01 Thread Alan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Zhang updated FLINK-36673:
---
Attachment: Screenshot 2025-02-28 at 4.15.26 PM.png

> Operator is not properly handling failed deployments without savepoints
> ---
>
> Key: FLINK-36673
> URL: https://issues.apache.org/jira/browse/FLINK-36673
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Yaroslav Tkachenko
>Priority: Major
> Attachments: Screenshot 2025-02-28 at 4.15.26 PM.png, Screenshot 
> 2025-02-28 at 8.51.37 PM.png, stacktrace.txt
>
>
> I noticed an issue after upgrading Flink Kubernetes Operator from 1.9 to 1.10.
> When I deploy a FlinkDeployment that fails during the startup, I get a 
> "ReconciliationException: Could not observe latest savepoint information" 
> (full stacktrace is attached). 
> I think the issue was introduced here: 
> [https://github.com/apache/flink-kubernetes-operator/pull/871.] 
> *AbstractFlinkService.getLastCheckpoint* now throws a 
> *ReconciliationException* when a savepoint is not available, and 
> *SnapshotObserver.observeLatestCheckpoint* doesn't handle it properly. I 
> think having no savepoint is completely normal in some situations (e.g. a 
> brand new job). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-36673) Operator is not properly handling failed deployments without savepoints

2025-03-01 Thread Alan Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931738#comment-17931738
 ] 

Alan Zhang commented on FLINK-36673:


> Checkpointing is enabled as a very last step when building the execution 
> graph. If the job fails before that (e.g. when registering a source or a 
> sink), the Flink runtime will return the "Checkpointing has not been enabled" 
> exception.

[~gyfora]  I observed an exact same problem as [~sap1ens] mentioned here, and 
I'm using operator 1.10. And I think this explanation makes sense.

I have one Flink application which consumes data from Kafka topic, however 
somehow I used a wrong topic name which results schema not found. 

{code:java}
org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error: Error fetching avro schema for topic Samza-PageViewEvent1
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98) 
{code}
It seems JM didn't go through the logic that enabling the checkpoints, this 
error occurred before checkpointing. I could see this Flink job be marked as 
"failed" status[1], and I see JM return exception "Checkpointing has not been 
enabled" by calling JM rest APIs.

In this case, rollback by updating the image to a last-known-good version 
doesn't work because of this SnapshotObserver blocked it by throwing the 
exception: "ReconciliationException: Could not observe latest savepoint 
information" (full stacktrace is attached)". Also job status in this case 
should be FAILED instead of RECONCILING[3].

[1]
!Screenshot 2025-02-28 at 4.15.26 PM.png|width=481,height=207!
[2]
!Screenshot 2025-02-28 at 8.51.37 PM.png|width=679,height=201!
[3]
!Screenshot 2025-02-28 at 8.55.36 PM.png|width=618,height=386!

> Operator is not properly handling failed deployments without savepoints
> ---
>
> Key: FLINK-36673
> URL: https://issues.apache.org/jira/browse/FLINK-36673
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Yaroslav Tkachenko
>Priority: Major
> Attachments: Screenshot 2025-02-28 at 4.15.26 PM.png, Screenshot 
> 2025-02-28 at 8.51.37 PM.png, Screenshot 2025-02-28 at 8.55.36 PM.png, 
> stacktrace.txt
>
>
> I noticed an issue after upgrading Flink Kubernetes Operator from 1.9 to 1.10.
> When I deploy a FlinkDeployment that fails during the startup, I get a 
> "ReconciliationException: Could not observe latest savepoint information" 
> (full stacktrace is attached). 
> I think the issue was introduced here: 
> [https://github.com/apache/flink-kubernetes-operator/pull/871.] 
> *AbstractFlinkService.getLastCheckpoint* now throws a 
> *ReconciliationException* when a savepoint is not available, and 
> *SnapshotObserver.observeLatestCheckpoint* doesn't handle it properly. I 
> think having no savepoint is completely normal in some situations (e.g. a 
> brand new job). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-36673) Operator is not properly handling failed deployments without savepoints

2025-03-01 Thread Alan Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931738#comment-17931738
 ] 

Alan Zhang edited comment on FLINK-36673 at 3/2/25 1:40 AM:


> Checkpointing is enabled as a very last step when building the execution 
> graph. If the job fails before that (e.g. when registering a source or a 
> sink), the Flink runtime will return the "Checkpointing has not been enabled" 
> exception.

[~gyfora]  I observed an exact same problem as [~sap1ens] mentioned here, and 
I'm using operator 1.10 and Flink 1.16. And I think this explanation makes 
sense.

I have one Flink application which consumes data from Kafka topic, however 
somehow I used a wrong topic name which results schema not found. 
{code:java}
org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error: Error fetching avro schema for topic Samza-PageViewEvent1
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98) 
{code}
It seems JM didn't go through the logic that enabling the checkpoints, this 
error occurred before checkpointing. I could see this Flink job be marked as 
"failed" status[1], and I see JM return exception "Checkpointing has not been 
enabled" by calling JM rest APIs.

In this case, rollback by updating the image to a last-known-good version 
doesn't work because of this SnapshotObserver blocked it by throwing the 
exception: "ReconciliationException: Could not observe latest savepoint 
information" (full stacktrace is attached)". Also job status in this case 
should be FAILED instead of RECONCILING[3].

[1]
!Screenshot 2025-02-28 at 4.15.26 PM.png|width=481,height=207!
[2]
!Screenshot 2025-02-28 at 8.51.37 PM.png|width=679,height=201!
[3]
!Screenshot 2025-02-28 at 8.55.36 PM.png|width=618,height=386!


was (Author: alnzng):
> Checkpointing is enabled as a very last step when building the execution 
> graph. If the job fails before that (e.g. when registering a source or a 
> sink), the Flink runtime will return the "Checkpointing has not been enabled" 
> exception.

[~gyfora]  I observed an exact same problem as [~sap1ens] mentioned here, and 
I'm using operator 1.10. And I think this explanation makes sense.

I have one Flink application which consumes data from Kafka topic, however 
somehow I used a wrong topic name which results schema not found. 

{code:java}
org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error: Error fetching avro schema for topic Samza-PageViewEvent1
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98) 
{code}
It seems JM didn't go through the logic that enabling the checkpoints, this 
error occurred before checkpointing. I could see this Flink job be marked as 
"failed" status[1], and I see JM return exception "Checkpointing has not been 
enabled" by calling JM rest APIs.

In this case, rollback by updating the image to a last-known-good version 
doesn't work because of this SnapshotObserver blocked it by throwing the 
exception: "ReconciliationException: Could not observe latest savepoint 
information" (full stacktrace is attached)". Also job status in this case 
should be FAILED instead of RECONCILING[3].

[1]
!Screenshot 2025-02-28 at 4.15.26 PM.png|width=481,height=207!
[2]
!Screenshot 2025-02-28 at 8.51.37 PM.png|width=679,height=201!
[3]
!Screenshot 2025-02-28 at 8.55.36 PM.png|width=618,height=386!

> Operator is not properly handling failed deployments without savepoints
> ---
>
> Key: FLINK-36673
> URL: https://issues.apache.org/jira/browse/FLINK-36673
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Yaroslav Tkachenko
>Priority: Major
> Attachments: Screenshot 2025-02-28 at 4.15.26 PM.png, Screenshot 
> 2025-02-28 at 8.51.37 PM.png, Screenshot 2025-02-28 at 8.55.36 PM.png, 
> stacktrace.txt
>
>
> I noticed an issue after upgrading Flink Kubernetes Operator from 1.9 to 1.10.
> When I deploy a FlinkDeployment that fails during the startup, I get a 
> "ReconciliationException: Could not observe latest savepoint information" 
> (full stacktrace is attached). 
> I think the issue was introduced here: 
> [https://github.com/apache/flink-kubernetes-operator/pull/871.] 
> *AbstractFlinkService.getLastCheckpoint* now throws a 
>

[jira] [Updated] (FLINK-36673) Operator is not properly handling failed deployments without savepoints

2025-03-01 Thread Alan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Zhang updated FLINK-36673:
---
Attachment: Screenshot 2025-02-28 at 8.55.36 PM.png

> Operator is not properly handling failed deployments without savepoints
> ---
>
> Key: FLINK-36673
> URL: https://issues.apache.org/jira/browse/FLINK-36673
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Yaroslav Tkachenko
>Priority: Major
> Attachments: Screenshot 2025-02-28 at 4.15.26 PM.png, Screenshot 
> 2025-02-28 at 8.51.37 PM.png, Screenshot 2025-02-28 at 8.55.36 PM.png, 
> stacktrace.txt
>
>
> I noticed an issue after upgrading Flink Kubernetes Operator from 1.9 to 1.10.
> When I deploy a FlinkDeployment that fails during the startup, I get a 
> "ReconciliationException: Could not observe latest savepoint information" 
> (full stacktrace is attached). 
> I think the issue was introduced here: 
> [https://github.com/apache/flink-kubernetes-operator/pull/871.] 
> *AbstractFlinkService.getLastCheckpoint* now throws a 
> *ReconciliationException* when a savepoint is not available, and 
> *SnapshotObserver.observeLatestCheckpoint* doesn't handle it properly. I 
> think having no savepoint is completely normal in some situations (e.g. a 
> brand new job). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-36673) Operator is not properly handling failed deployments without savepoints

2025-03-02 Thread Alan Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931839#comment-17931839
 ] 

Alan Zhang commented on FLINK-36673:


Thanks [~gyfora] . This fix seems to work for streaming jobs as well even it 
was intended for fixing batch jobs, I want to test this patch. Is the version 
1.11 released? 

I wonder what is our release cadence for this operator, I didn't find related 
information in operator docs. The latest release notes I found is for 1.10, I 
didn't find one for 1.11: 
https://flink.apache.org/2024/10/25/apache-flink-kubernetes-operator-1.10.0-release-announcement/

> Operator is not properly handling failed deployments without savepoints
> ---
>
> Key: FLINK-36673
> URL: https://issues.apache.org/jira/browse/FLINK-36673
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Yaroslav Tkachenko
>Priority: Major
> Attachments: Screenshot 2025-02-28 at 4.15.26 PM.png, Screenshot 
> 2025-02-28 at 8.51.37 PM.png, Screenshot 2025-02-28 at 8.55.36 PM.png, 
> stacktrace.txt
>
>
> I noticed an issue after upgrading Flink Kubernetes Operator from 1.9 to 1.10.
> When I deploy a FlinkDeployment that fails during the startup, I get a 
> "ReconciliationException: Could not observe latest savepoint information" 
> (full stacktrace is attached). 
> I think the issue was introduced here: 
> [https://github.com/apache/flink-kubernetes-operator/pull/871.] 
> *AbstractFlinkService.getLastCheckpoint* now throws a 
> *ReconciliationException* when a savepoint is not available, and 
> *SnapshotObserver.observeLatestCheckpoint* doesn't handle it properly. I 
> think having no savepoint is completely normal in some situations (e.g. a 
> brand new job). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)