[ https://issues.apache.org/jira/browse/FLINK-37730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17948811#comment-17948811 ]
Santwana Verma edited comment on FLINK-37730 at 5/2/25 6:45 AM: ---------------------------------------------------------------- I looked into this and here is my very early proposal: * Introduce the API method in RestClusterClient to get the exceptions history in the flink-runtime, more specifically [here|https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java] * Use this client in the Flink Kubernetes Operator to get the exception at the time of reconcilation. For this, we would need a new method in the `FlinkService`, something like `getJobManagerExceptionHistory(Configuration conf, String jobId)`, which will call the API method. I think a good place for this reconciliation is `ApplicationReconciler#reconcileOtherChanges` after we have verified that no restart is needed. FKO currently uses flink version 1.20.1 was (Author: sverma): I looked into this and here is my very early proposal: * Introduce the API method in RestClusterClient to get the exceptions history in the flink-runtime, more specifically [here|https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java] * Use this client in the Flink Kubernetes Operator to get the exception at the time of reconcilation. For this, we would need a new method in the `FlinkService`, something like `getJobManagerExceptionHistory(Configuration conf, String jobId)`, which will call the API method. I think a good place for this reconciliation is `ApplicationReconciler#reconcileOtherChanges` after we have verified that no restart is needed. FKO currently uses flink version 1.20.1 > Collect job exceptions as kubernetes events > ------------------------------------------- > > Key: FLINK-37730 > URL: https://issues.apache.org/jira/browse/FLINK-37730 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator > Reporter: Robert Metzger > Priority: Major > > In my understanding, the Flink Kubernetes Operator is currently not tracking > the exception history for a job, listed in the JobManager UI. > Exposing the exception history in the CR is not feasible due to size concerns. > Exposing the exception history as kubernetes events seems to be a reasonable > middle ground. Events have a default expiration of 1 hour on the Kubernetes > API server. > We could introduce a config parameter for the number of exceptions from the > history to replicate into k8s events. > Assume a Flink Job has 5 exceptions, the user has configured the history size > to be 4. FKO will regularly check, if there are exception events (based on > the exception timestamp) for the last 4 exceptions. If not, those events will > be created. -- This message was sent by Atlassian Jira (v8.20.10#820010)