Re: [PR] [FLINK-37730][Job Manager] Expose JM exception as K8s exceptions [flink-kubernetes-operator]

via GitHub Thu, 22 May 2025 01:42:13 -0700


gyfora commented on code in PR #978:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/978#discussion_r2101977625



##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java:
##########
@@ -95,6 +108,148 @@ public boolean observe(FlinkResourceContext<R> ctx) {
         return false;
     }
 
+    /**
+     * Observe the exceptions raised in the job manager and take appropriate 
action.
+     *
+     * @param ctx the context with which the operation is executed
+     */
+    protected void observeJobManagerExceptions(FlinkResourceContext<R> ctx) {
+        var resource = ctx.getResource();
+        var operatorConfig = ctx.getOperatorConfig();
+        var jobStatus = resource.getStatus().getJobStatus();
+
+        try {
+            var jobId = JobID.fromHexString(jobStatus.getJobId());
+            // TODO: Ideally the best way to restrict the number of events is 
to use the query param
+            // `maxExceptions`
+            //  but the JobExceptionsMessageParameters does not expose the 
parameters and nor does
+            // it have setters.
+            var history =
+                    ctx.getFlinkService().getJobExceptions(resource, jobId, 
ctx.getObserveConfig());
+
+            if (history == null || history.getExceptionHistory() == null) {
+                return;
+            }
+
+            var exceptionHistory = history.getExceptionHistory();
+            List<JobExceptionsInfoWithHistory.RootExceptionInfo> exceptions =
+                    exceptionHistory.getEntries();
+            if (exceptions == null || exceptions.isEmpty()) {
+                return;
+            }
+
+            if (exceptionHistory.isTruncated()) {
+                LOG.warn(
+                        "Job exception history is truncated for jobId '{}'. 
Some exceptions may be missing.",
+                        jobId);
+            }
+
+            String currentJobId = jobStatus.getJobId();
+            Instant lastRecorded = null; // first reconciliation
+
+            var cacheEntry = ctx.getExceptionCacheEntry();
+            // a cache entry is created should always be present. The 
timestamp for the first
+            // reconciliation would be
+            // when the job was created. This check is still necessary because 
even though there
+            // might be an entry,
+            // the jobId could have changed since the job was first created.
+            if (cacheEntry.getJobId().equals(currentJobId)) {
+                lastRecorded = 
Instant.ofEpochMilli(cacheEntry.getLastTimestamp());
+            }
+
+            Instant now = Instant.now();
+            int maxEvents = 
operatorConfig.getReportedExceptionEventsMaxCount();
+            int maxStackTraceLines = 
operatorConfig.getReportedExceptionEventsMaxStackTraceLength();
+
+            // Sort and reverse to prioritize the newest exceptions
+            var sortedExceptions = new ArrayList<>(exceptions);
+            sortedExceptions.sort(
+                    Comparator.comparingLong(
+                                    
JobExceptionsInfoWithHistory.RootExceptionInfo::getTimestamp)
+                            .reversed());
+
+            int count = 0;
+            for (var exception : sortedExceptions) {
+                Instant exceptionTime = 
Instant.ofEpochMilli(exception.getTimestamp());
+                // Skip already recorded exceptions
+                if (lastRecorded != null && 
exceptionTime.isBefore(lastRecorded)) {
+                    continue;
+                }
+                emitJobManagerExceptionEvent(ctx, exception, exceptionTime, 
maxStackTraceLines);
+                if (++count >= maxEvents) {
+                    break;
+                }
+            }
+            ctx.getExceptionCacheEntry().setJobId(currentJobId);
+            ctx.getExceptionCacheEntry().setLastTimestamp(now.toEpochMilli());
+        } catch (Exception e) {
+            LOG.warn("Failed to fetch JobManager exception info.", e);
+        }
+    }
+
+    private void emitJobManagerExceptionEvent(
+            FlinkResourceContext<R> ctx,
+            JobExceptionsInfoWithHistory.RootExceptionInfo exception,
+            Instant exceptionTime,
+            int maxStackTraceLines) {
+
+        String exceptionName = exception.getExceptionName();
+        if (exceptionName == null || exceptionName.isBlank()) {
+            return;
+        }
+
+        Map<String, String> annotations = new HashMap<>();
+        annotations.put(
+                "event-time-readable",
+                DateTimeUtils.readable(exceptionTime, ZoneId.systemDefault()));
+        annotations.put("event-timestamp-millis", 
String.valueOf(exceptionTime.toEpochMilli()));

Review Comment:
   I think this should be a single field called: `exception-timestamp` and 
should follow the Kubernetes timestamp formatting standard (Instant.toString 
should work I guess)



##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java:
##########
@@ -95,6 +108,148 @@ public boolean observe(FlinkResourceContext<R> ctx) {
         return false;
     }
 
+    /**
+     * Observe the exceptions raised in the job manager and take appropriate 
action.
+     *
+     * @param ctx the context with which the operation is executed
+     */
+    protected void observeJobManagerExceptions(FlinkResourceContext<R> ctx) {
+        var resource = ctx.getResource();
+        var operatorConfig = ctx.getOperatorConfig();
+        var jobStatus = resource.getStatus().getJobStatus();
+
+        try {
+            var jobId = JobID.fromHexString(jobStatus.getJobId());
+            // TODO: Ideally the best way to restrict the number of events is 
to use the query param
+            // `maxExceptions`
+            //  but the JobExceptionsMessageParameters does not expose the 
parameters and nor does
+            // it have setters.
+            var history =
+                    ctx.getFlinkService().getJobExceptions(resource, jobId, 
ctx.getObserveConfig());
+
+            if (history == null || history.getExceptionHistory() == null) {
+                return;
+            }
+
+            var exceptionHistory = history.getExceptionHistory();
+            List<JobExceptionsInfoWithHistory.RootExceptionInfo> exceptions =
+                    exceptionHistory.getEntries();
+            if (exceptions == null || exceptions.isEmpty()) {
+                return;
+            }
+
+            if (exceptionHistory.isTruncated()) {
+                LOG.warn(
+                        "Job exception history is truncated for jobId '{}'. 
Some exceptions may be missing.",
+                        jobId);
+            }
+
+            String currentJobId = jobStatus.getJobId();
+            Instant lastRecorded = null; // first reconciliation
+
+            var cacheEntry = ctx.getExceptionCacheEntry();
+            // a cache entry is created should always be present. The 
timestamp for the first
+            // reconciliation would be
+            // when the job was created. This check is still necessary because 
even though there
+            // might be an entry,
+            // the jobId could have changed since the job was first created.
+            if (cacheEntry.getJobId().equals(currentJobId)) {
+                lastRecorded = 
Instant.ofEpochMilli(cacheEntry.getLastTimestamp());
+            }
+
+            Instant now = Instant.now();
+            int maxEvents = 
operatorConfig.getReportedExceptionEventsMaxCount();
+            int maxStackTraceLines = 
operatorConfig.getReportedExceptionEventsMaxStackTraceLength();
+
+            // Sort and reverse to prioritize the newest exceptions
+            var sortedExceptions = new ArrayList<>(exceptions);
+            sortedExceptions.sort(
+                    Comparator.comparingLong(
+                                    
JobExceptionsInfoWithHistory.RootExceptionInfo::getTimestamp)
+                            .reversed());
+
+            int count = 0;
+            for (var exception : sortedExceptions) {
+                Instant exceptionTime = 
Instant.ofEpochMilli(exception.getTimestamp());
+                // Skip already recorded exceptions
+                if (lastRecorded != null && 
exceptionTime.isBefore(lastRecorded)) {
+                    continue;
+                }
+                emitJobManagerExceptionEvent(ctx, exception, exceptionTime, 
maxStackTraceLines);
+                if (++count >= maxEvents) {
+                    break;
+                }
+            }
+            ctx.getExceptionCacheEntry().setJobId(currentJobId);
+            ctx.getExceptionCacheEntry().setLastTimestamp(now.toEpochMilli());
+        } catch (Exception e) {
+            LOG.warn("Failed to fetch JobManager exception info.", e);
+        }
+    }
+
+    private void emitJobManagerExceptionEvent(
+            FlinkResourceContext<R> ctx,
+            JobExceptionsInfoWithHistory.RootExceptionInfo exception,
+            Instant exceptionTime,
+            int maxStackTraceLines) {
+
+        String exceptionName = exception.getExceptionName();
+        if (exceptionName == null || exceptionName.isBlank()) {
+            return;
+        }
+
+        Map<String, String> annotations = new HashMap<>();
+        annotations.put(
+                "event-time-readable",
+                DateTimeUtils.readable(exceptionTime, ZoneId.systemDefault()));
+        annotations.put("event-timestamp-millis", 
String.valueOf(exceptionTime.toEpochMilli()));
+
+        if (exception.getTaskName() != null) {
+            annotations.put("task-name", exception.getTaskName());
+        }
+        if (exception.getEndpoint() != null) {
+            annotations.put("endpoint", exception.getEndpoint());
+        }
+        if (exception.getTaskManagerId() != null) {
+            annotations.put("tm-id", exception.getTaskManagerId());
+        }
+
+        if (exception.getFailureLabels() != null) {
+            exception
+                    .getFailureLabels()
+                    .forEach((k, v) -> annotations.put("failure-label-" + k, 
v));
+        }
+
+        StringBuilder eventMessage = new StringBuilder(exceptionName);
+        String stacktrace = exception.getStacktrace();
+        if (stacktrace != null && !stacktrace.isBlank()) {
+            String[] lines = stacktrace.split("\n");
+            eventMessage.append("\n\nStacktrace (truncated):\n");
+            for (int i = 0; i < Math.min(maxStackTraceLines, lines.length); 
i++) {
+                eventMessage.append(lines[i]).append("\n");
+            }
+            if (lines.length > maxStackTraceLines) {
+                eventMessage
+                        .append("... (")
+                        .append(lines.length - maxStackTraceLines)
+                        .append(" more lines)");
+            }
+        }
+
+        String keyMessage =
+                exceptionName.length() > 128 ? exceptionName.substring(0, 128) 
: exceptionName;
+
+        eventRecorder.triggerEventOnceWithAnnotations(

Review Comment:
   I think the `eventRecorder.triggerEventOnceWithAnnotations` logic won't be 
good here because these can be actually distinct errors with the same message. 
   
   I think it's good to have the message key so that we can simply bump the 
count but the "if not exists" logic won't work



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-37730][Job Manager] Expose JM exception as K8s exceptions [flink-kubernetes-operator]

Reply via email to