[jira] [Created] (FLINK-35279) Support "last-state" upgrade mode for FlinkSessionJob
Alan Zhang created FLINK-35279: -- Summary: Support "last-state" upgrade mode for FlinkSessionJob Key: FLINK-35279 URL: https://issues.apache.org/jira/browse/FLINK-35279 Project: Flink Issue Type: New Feature Reporter: Alan Zhang The "last-state" upgrade mode is only supported for Flink application mode today[1], we should provide a similar user experience in session mode. [[1]https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades] {code:java} Last state upgrade mode is currently only supported for FlinkDeployments. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-35279) Support "last-state" upgrade mode for FlinkSessionJob
[ https://issues.apache.org/jira/browse/FLINK-35279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Zhang updated FLINK-35279: --- Description: The "last-state" upgrade mode is only supported for Flink application mode today[1], we should provide a consistent / similar user experience in Flink session mode. [1] [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades] {code:java} Last state upgrade mode is currently only supported for FlinkDeployments. {code} was: The "last-state" upgrade mode is only supported for Flink application mode today[1], we should provide a similar user experience in session mode. [[1]https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades] {code:java} Last state upgrade mode is currently only supported for FlinkDeployments. {code} > Support "last-state" upgrade mode for FlinkSessionJob > -- > > Key: FLINK-35279 > URL: https://issues.apache.org/jira/browse/FLINK-35279 > Project: Flink > Issue Type: New Feature >Reporter: Alan Zhang >Priority: Major > > The "last-state" upgrade mode is only supported for Flink application mode > today[1], we should provide a consistent / similar user experience in Flink > session mode. > [1] > [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades] > {code:java} > Last state upgrade mode is currently only supported for FlinkDeployments. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-36599) Correct the name of the config which enables response codes group
Alan Zhang created FLINK-36599: -- Summary: Correct the name of the config which enables response codes group Key: FLINK-36599 URL: https://issues.apache.org/jira/browse/FLINK-36599 Project: Flink Issue Type: Improvement Reporter: Alan Zhang In the "Metrics and Logging" doc, it doesn't use the full name of the operator config(aka, add operator config prefix "kubernetes.operator."): [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging.] {code:java} It’s possible to publish additional metrics by Http response code received from API server by setting kubernetes.client.metrics.http.response.code.groups.enabled to true . {code} This is confusing, we should make it be consistent with the naming in the "Configuration" doc: [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#system-metrics-configuration] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-36599) Correct the name of the config which enables response codes group
[ https://issues.apache.org/jira/browse/FLINK-36599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Zhang updated FLINK-36599: --- Description: In the "Metrics and Logging" doc, it doesn't use the full name of the operator config(aka, add operator config prefix "kubernetes.operator."): [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging.] {code:java} It’s possible to publish additional metrics by Http response code received from API server by setting kubernetes.client.metrics.http.response.code.groups.enabled to true . {code} This is confusing, we should make it be consistent with the naming in the "Configuration" doc: [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#system-metrics-configuration] was: In the "Metrics and Logging" doc, it doesn't use the full name of the operator config(aka, add operator config prefix "kubernetes.operator."): [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging.] {code:java} It’s possible to publish additional metrics by Http response code received from API server by setting kubernetes.client.metrics.http.response.code.groups.enabled to true . {code} This is confusing, we should make it be consistent with the naming in the "Configuration" doc: [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#system-metrics-configuration] > Correct the name of the config which enables response codes group > - > > Key: FLINK-36599 > URL: https://issues.apache.org/jira/browse/FLINK-36599 > Project: Flink > Issue Type: Improvement >Reporter: Alan Zhang >Priority: Minor > > In the "Metrics and Logging" doc, it doesn't use the full name of the > operator config(aka, add operator config prefix "kubernetes.operator."): > [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging.] > {code:java} > It’s possible to publish additional metrics by Http response code received > from API server by setting > kubernetes.client.metrics.http.response.code.groups.enabled to true . {code} > This is confusing, we should make it be consistent with the naming in the > "Configuration" doc: > [https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#system-metrics-configuration] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-36673) Operator is not properly handling failed deployments without savepoints
[ https://issues.apache.org/jira/browse/FLINK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931738#comment-17931738 ] Alan Zhang edited comment on FLINK-36673 at 3/2/25 1:41 AM: > Checkpointing is enabled as a very last step when building the execution > graph. If the job fails before that (e.g. when registering a source or a > sink), the Flink runtime will return the "Checkpointing has not been enabled" > exception. [~gyfora] I observed an exact same problem as [~sap1ens] mentioned here, and I'm using operator 1.10 and Flink 1.16. And I think this explanation makes sense. I have one Flink application which consumes data from Kafka topic, however somehow I used a wrong topic name which results schema not found. {code:java} org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Error fetching avro schema for topic Samza-PageViewEvent1 at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98) {code} It seems JM didn't go through the logic that enabling the checkpoints, this error occurred before checkpointing. I could see this Flink job be marked as "failed" status[1], and I see JM return exception "Checkpointing has not been enabled" by calling JM rest APIs. In this case, rollback by updating the image to a last-known-good version doesn't work because of this SnapshotObserver blocked it by throwing the exception: "ReconciliationException: Could not observe latest savepoint information" (full stacktrace is attached)". Also job status of the FlinkDeployment in this case should be FAILED instead of RECONCILING[3]. [1] !Screenshot 2025-02-28 at 4.15.26 PM.png|width=481,height=207! [2] !Screenshot 2025-02-28 at 8.51.37 PM.png|width=679,height=201! [3] !Screenshot 2025-02-28 at 8.55.36 PM.png|width=618,height=386! was (Author: alnzng): > Checkpointing is enabled as a very last step when building the execution > graph. If the job fails before that (e.g. when registering a source or a > sink), the Flink runtime will return the "Checkpointing has not been enabled" > exception. [~gyfora] I observed an exact same problem as [~sap1ens] mentioned here, and I'm using operator 1.10 and Flink 1.16. And I think this explanation makes sense. I have one Flink application which consumes data from Kafka topic, however somehow I used a wrong topic name which results schema not found. {code:java} org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Error fetching avro schema for topic Samza-PageViewEvent1 at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98) {code} It seems JM didn't go through the logic that enabling the checkpoints, this error occurred before checkpointing. I could see this Flink job be marked as "failed" status[1], and I see JM return exception "Checkpointing has not been enabled" by calling JM rest APIs. In this case, rollback by updating the image to a last-known-good version doesn't work because of this SnapshotObserver blocked it by throwing the exception: "ReconciliationException: Could not observe latest savepoint information" (full stacktrace is attached)". Also job status in this case should be FAILED instead of RECONCILING[3]. [1] !Screenshot 2025-02-28 at 4.15.26 PM.png|width=481,height=207! [2] !Screenshot 2025-02-28 at 8.51.37 PM.png|width=679,height=201! [3] !Screenshot 2025-02-28 at 8.55.36 PM.png|width=618,height=386! > Operator is not properly handling failed deployments without savepoints > --- > > Key: FLINK-36673 > URL: https://issues.apache.org/jira/browse/FLINK-36673 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Yaroslav Tkachenko >Priority: Major > Attachments: Screenshot 2025-02-28 at 4.15.26 PM.png, Screenshot > 2025-02-28 at 8.51.37 PM.png, Screenshot 2025-02-28 at 8.55.36 PM.png, > stacktrace.txt > > > I noticed an issue after upgrading Flink Kubernetes Operator from 1.9 to 1.10. > When I deploy a FlinkDeployment that fails during the startup, I get a > "ReconciliationException: Could not observe latest savepoint information" > (full stacktrace is attached). > I think the issue was introduced here: > [https://github.com/apache/flink-kubernetes-operator/pull/871.] > *AbstractFlinkServ
[jira] [Updated] (FLINK-36673) Operator is not properly handling failed deployments without savepoints
[ https://issues.apache.org/jira/browse/FLINK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Zhang updated FLINK-36673: --- Attachment: Screenshot 2025-02-28 at 8.51.37 PM.png > Operator is not properly handling failed deployments without savepoints > --- > > Key: FLINK-36673 > URL: https://issues.apache.org/jira/browse/FLINK-36673 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Yaroslav Tkachenko >Priority: Major > Attachments: Screenshot 2025-02-28 at 4.15.26 PM.png, Screenshot > 2025-02-28 at 8.51.37 PM.png, stacktrace.txt > > > I noticed an issue after upgrading Flink Kubernetes Operator from 1.9 to 1.10. > When I deploy a FlinkDeployment that fails during the startup, I get a > "ReconciliationException: Could not observe latest savepoint information" > (full stacktrace is attached). > I think the issue was introduced here: > [https://github.com/apache/flink-kubernetes-operator/pull/871.] > *AbstractFlinkService.getLastCheckpoint* now throws a > *ReconciliationException* when a savepoint is not available, and > *SnapshotObserver.observeLatestCheckpoint* doesn't handle it properly. I > think having no savepoint is completely normal in some situations (e.g. a > brand new job). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-36673) Operator is not properly handling failed deployments without savepoints
[ https://issues.apache.org/jira/browse/FLINK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Zhang updated FLINK-36673: --- Attachment: Screenshot 2025-02-28 at 4.15.26 PM.png > Operator is not properly handling failed deployments without savepoints > --- > > Key: FLINK-36673 > URL: https://issues.apache.org/jira/browse/FLINK-36673 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Yaroslav Tkachenko >Priority: Major > Attachments: Screenshot 2025-02-28 at 4.15.26 PM.png, Screenshot > 2025-02-28 at 8.51.37 PM.png, stacktrace.txt > > > I noticed an issue after upgrading Flink Kubernetes Operator from 1.9 to 1.10. > When I deploy a FlinkDeployment that fails during the startup, I get a > "ReconciliationException: Could not observe latest savepoint information" > (full stacktrace is attached). > I think the issue was introduced here: > [https://github.com/apache/flink-kubernetes-operator/pull/871.] > *AbstractFlinkService.getLastCheckpoint* now throws a > *ReconciliationException* when a savepoint is not available, and > *SnapshotObserver.observeLatestCheckpoint* doesn't handle it properly. I > think having no savepoint is completely normal in some situations (e.g. a > brand new job). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-36673) Operator is not properly handling failed deployments without savepoints
[ https://issues.apache.org/jira/browse/FLINK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931738#comment-17931738 ] Alan Zhang commented on FLINK-36673: > Checkpointing is enabled as a very last step when building the execution > graph. If the job fails before that (e.g. when registering a source or a > sink), the Flink runtime will return the "Checkpointing has not been enabled" > exception. [~gyfora] I observed an exact same problem as [~sap1ens] mentioned here, and I'm using operator 1.10. And I think this explanation makes sense. I have one Flink application which consumes data from Kafka topic, however somehow I used a wrong topic name which results schema not found. {code:java} org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Error fetching avro schema for topic Samza-PageViewEvent1 at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98) {code} It seems JM didn't go through the logic that enabling the checkpoints, this error occurred before checkpointing. I could see this Flink job be marked as "failed" status[1], and I see JM return exception "Checkpointing has not been enabled" by calling JM rest APIs. In this case, rollback by updating the image to a last-known-good version doesn't work because of this SnapshotObserver blocked it by throwing the exception: "ReconciliationException: Could not observe latest savepoint information" (full stacktrace is attached)". Also job status in this case should be FAILED instead of RECONCILING[3]. [1] !Screenshot 2025-02-28 at 4.15.26 PM.png|width=481,height=207! [2] !Screenshot 2025-02-28 at 8.51.37 PM.png|width=679,height=201! [3] !Screenshot 2025-02-28 at 8.55.36 PM.png|width=618,height=386! > Operator is not properly handling failed deployments without savepoints > --- > > Key: FLINK-36673 > URL: https://issues.apache.org/jira/browse/FLINK-36673 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Yaroslav Tkachenko >Priority: Major > Attachments: Screenshot 2025-02-28 at 4.15.26 PM.png, Screenshot > 2025-02-28 at 8.51.37 PM.png, Screenshot 2025-02-28 at 8.55.36 PM.png, > stacktrace.txt > > > I noticed an issue after upgrading Flink Kubernetes Operator from 1.9 to 1.10. > When I deploy a FlinkDeployment that fails during the startup, I get a > "ReconciliationException: Could not observe latest savepoint information" > (full stacktrace is attached). > I think the issue was introduced here: > [https://github.com/apache/flink-kubernetes-operator/pull/871.] > *AbstractFlinkService.getLastCheckpoint* now throws a > *ReconciliationException* when a savepoint is not available, and > *SnapshotObserver.observeLatestCheckpoint* doesn't handle it properly. I > think having no savepoint is completely normal in some situations (e.g. a > brand new job). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-36673) Operator is not properly handling failed deployments without savepoints
[ https://issues.apache.org/jira/browse/FLINK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931738#comment-17931738 ] Alan Zhang edited comment on FLINK-36673 at 3/2/25 1:40 AM: > Checkpointing is enabled as a very last step when building the execution > graph. If the job fails before that (e.g. when registering a source or a > sink), the Flink runtime will return the "Checkpointing has not been enabled" > exception. [~gyfora] I observed an exact same problem as [~sap1ens] mentioned here, and I'm using operator 1.10 and Flink 1.16. And I think this explanation makes sense. I have one Flink application which consumes data from Kafka topic, however somehow I used a wrong topic name which results schema not found. {code:java} org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Error fetching avro schema for topic Samza-PageViewEvent1 at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98) {code} It seems JM didn't go through the logic that enabling the checkpoints, this error occurred before checkpointing. I could see this Flink job be marked as "failed" status[1], and I see JM return exception "Checkpointing has not been enabled" by calling JM rest APIs. In this case, rollback by updating the image to a last-known-good version doesn't work because of this SnapshotObserver blocked it by throwing the exception: "ReconciliationException: Could not observe latest savepoint information" (full stacktrace is attached)". Also job status in this case should be FAILED instead of RECONCILING[3]. [1] !Screenshot 2025-02-28 at 4.15.26 PM.png|width=481,height=207! [2] !Screenshot 2025-02-28 at 8.51.37 PM.png|width=679,height=201! [3] !Screenshot 2025-02-28 at 8.55.36 PM.png|width=618,height=386! was (Author: alnzng): > Checkpointing is enabled as a very last step when building the execution > graph. If the job fails before that (e.g. when registering a source or a > sink), the Flink runtime will return the "Checkpointing has not been enabled" > exception. [~gyfora] I observed an exact same problem as [~sap1ens] mentioned here, and I'm using operator 1.10. And I think this explanation makes sense. I have one Flink application which consumes data from Kafka topic, however somehow I used a wrong topic name which results schema not found. {code:java} org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Error fetching avro schema for topic Samza-PageViewEvent1 at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98) {code} It seems JM didn't go through the logic that enabling the checkpoints, this error occurred before checkpointing. I could see this Flink job be marked as "failed" status[1], and I see JM return exception "Checkpointing has not been enabled" by calling JM rest APIs. In this case, rollback by updating the image to a last-known-good version doesn't work because of this SnapshotObserver blocked it by throwing the exception: "ReconciliationException: Could not observe latest savepoint information" (full stacktrace is attached)". Also job status in this case should be FAILED instead of RECONCILING[3]. [1] !Screenshot 2025-02-28 at 4.15.26 PM.png|width=481,height=207! [2] !Screenshot 2025-02-28 at 8.51.37 PM.png|width=679,height=201! [3] !Screenshot 2025-02-28 at 8.55.36 PM.png|width=618,height=386! > Operator is not properly handling failed deployments without savepoints > --- > > Key: FLINK-36673 > URL: https://issues.apache.org/jira/browse/FLINK-36673 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Yaroslav Tkachenko >Priority: Major > Attachments: Screenshot 2025-02-28 at 4.15.26 PM.png, Screenshot > 2025-02-28 at 8.51.37 PM.png, Screenshot 2025-02-28 at 8.55.36 PM.png, > stacktrace.txt > > > I noticed an issue after upgrading Flink Kubernetes Operator from 1.9 to 1.10. > When I deploy a FlinkDeployment that fails during the startup, I get a > "ReconciliationException: Could not observe latest savepoint information" > (full stacktrace is attached). > I think the issue was introduced here: > [https://github.com/apache/flink-kubernetes-operator/pull/871.] > *AbstractFlinkService.getLastCheckpoint* now throws a >
[jira] [Updated] (FLINK-36673) Operator is not properly handling failed deployments without savepoints
[ https://issues.apache.org/jira/browse/FLINK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Zhang updated FLINK-36673: --- Attachment: Screenshot 2025-02-28 at 8.55.36 PM.png > Operator is not properly handling failed deployments without savepoints > --- > > Key: FLINK-36673 > URL: https://issues.apache.org/jira/browse/FLINK-36673 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Yaroslav Tkachenko >Priority: Major > Attachments: Screenshot 2025-02-28 at 4.15.26 PM.png, Screenshot > 2025-02-28 at 8.51.37 PM.png, Screenshot 2025-02-28 at 8.55.36 PM.png, > stacktrace.txt > > > I noticed an issue after upgrading Flink Kubernetes Operator from 1.9 to 1.10. > When I deploy a FlinkDeployment that fails during the startup, I get a > "ReconciliationException: Could not observe latest savepoint information" > (full stacktrace is attached). > I think the issue was introduced here: > [https://github.com/apache/flink-kubernetes-operator/pull/871.] > *AbstractFlinkService.getLastCheckpoint* now throws a > *ReconciliationException* when a savepoint is not available, and > *SnapshotObserver.observeLatestCheckpoint* doesn't handle it properly. I > think having no savepoint is completely normal in some situations (e.g. a > brand new job). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-36673) Operator is not properly handling failed deployments without savepoints
[ https://issues.apache.org/jira/browse/FLINK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931839#comment-17931839 ] Alan Zhang commented on FLINK-36673: Thanks [~gyfora] . This fix seems to work for streaming jobs as well even it was intended for fixing batch jobs, I want to test this patch. Is the version 1.11 released? I wonder what is our release cadence for this operator, I didn't find related information in operator docs. The latest release notes I found is for 1.10, I didn't find one for 1.11: https://flink.apache.org/2024/10/25/apache-flink-kubernetes-operator-1.10.0-release-announcement/ > Operator is not properly handling failed deployments without savepoints > --- > > Key: FLINK-36673 > URL: https://issues.apache.org/jira/browse/FLINK-36673 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Yaroslav Tkachenko >Priority: Major > Attachments: Screenshot 2025-02-28 at 4.15.26 PM.png, Screenshot > 2025-02-28 at 8.51.37 PM.png, Screenshot 2025-02-28 at 8.55.36 PM.png, > stacktrace.txt > > > I noticed an issue after upgrading Flink Kubernetes Operator from 1.9 to 1.10. > When I deploy a FlinkDeployment that fails during the startup, I get a > "ReconciliationException: Could not observe latest savepoint information" > (full stacktrace is attached). > I think the issue was introduced here: > [https://github.com/apache/flink-kubernetes-operator/pull/871.] > *AbstractFlinkService.getLastCheckpoint* now throws a > *ReconciliationException* when a savepoint is not available, and > *SnapshotObserver.observeLatestCheckpoint* doesn't handle it properly. I > think having no savepoint is completely normal in some situations (e.g. a > brand new job). -- This message was sent by Atlassian Jira (v8.20.10#820010)