[jira] [Comment Edited] (FLINK-37730) Collect job exceptions as kubernetes events

Santwana Verma (Jira) Thu, 01 May 2025 23:47:31 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-37730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17948811#comment-17948811
 ]


Santwana Verma edited comment on FLINK-37730 at 5/2/25 6:45 AM:
----------------------------------------------------------------

I looked into this and here is my very early proposal:
 * Introduce the API method in RestClusterClient to get the exceptions history 
in the flink-runtime, more specifically 
[here|https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java]
 * Use this client in the Flink Kubernetes Operator to get the exception at the 
time of reconcilation. For this, we would need a new method in the 
`FlinkService`, something like `getJobManagerExceptionHistory(Configuration 
conf, String jobId)`, which will call the API method. I think a good place for 
this reconciliation is `ApplicationReconciler#reconcileOtherChanges` after we 
have verified that no restart is needed.

FKO currently uses flink version 1.20.1


was (Author: sverma):
I looked into this and here is my very early proposal:
 * Introduce the API method in RestClusterClient to get the exceptions history 
in the flink-runtime, more specifically here
 * Use this client in the Flink Kubernetes Operator to get the exception at the 
time of reconcilation. For this, we would need a new method in the 
`FlinkService`, something like `getJobManagerExceptionHistory(Configuration 
conf, String jobId)`, which will call the API method. I think a good place for 
this reconciliation is `ApplicationReconciler#reconcileOtherChanges` after we 
have verified that no restart is needed.

FKO currently uses flink version 1.20.1

> Collect job exceptions as kubernetes events
> -------------------------------------------
>
>                 Key: FLINK-37730
>                 URL: https://issues.apache.org/jira/browse/FLINK-37730
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>            Reporter: Robert Metzger
>            Priority: Major
>
> In my understanding, the Flink Kubernetes Operator is currently not tracking 
> the exception history for a job, listed in the JobManager UI.
> Exposing the exception history in the CR is not feasible due to size concerns.
> Exposing the exception history as kubernetes events seems to be a reasonable 
> middle ground. Events have a default expiration of 1 hour on the Kubernetes 
> API server.
> We could introduce a config parameter for the number of exceptions from the 
> history to replicate into k8s events.
> Assume a Flink Job has 5 exceptions, the user has configured the history size 
> to be 4. FKO will regularly check, if there are exception events (based on 
> the exception timestamp) for the last 4 exceptions. If not, those events will 
> be created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-37730) Collect job exceptions as kubernetes events

Reply via email to