gyfora commented on code in PR #978:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/978#discussion_r2077560344


##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java:
##########
@@ -299,9 +303,92 @@ public boolean 
reconcileOtherChanges(FlinkResourceContext<FlinkDeployment> ctx)
             return true;
         }
 
+        // check for JobManager exceptions if the REST API server is still up.
+        if (!ReconciliationUtils.isJobInTerminalState(deployment.getStatus())) 
{
+            observeJobManagerExceptions(ctx, deployment, observeConfig);
+        }
+
         return cleanupTerminalJmAfterTtl(ctx.getFlinkService(), deployment, 
observeConfig);
     }
 
+    private void observeJobManagerExceptions(
+            FlinkResourceContext<FlinkDeployment> ctx,
+            FlinkDeployment deployment,
+            Configuration observeConfig) {
+        try {
+            var jobId = 
JobID.fromHexString(deployment.getStatus().getJobStatus().getJobId());
+            var history = ctx.getFlinkService().getJobExceptions(deployment, 
jobId, observeConfig);
+            if (history == null || history.getExceptionHistory() == null) {
+                return;
+            }
+            var exceptionHistory = history.getExceptionHistory();
+            var exceptions = exceptionHistory.getEntries();
+            if (exceptions.isEmpty()) {
+                LOG.info(String.format("No exceptions found in job exception 
history for jobId '%s'.", jobId));
+                return;
+            }
+            if (exceptionHistory.isTruncated()) {
+                LOG.warn(String.format("Job exception history is truncated for 
jobId '%s'. "
+                        + "Some exceptions are not shown.", jobId));
+            }
+            for (var exception : exceptions) {
+                emitJobManagerExceptionEvent(ctx, deployment, exception);

Review Comment:
   We should be careful here as the currently logic has a very high potential 
for introducing large load on the kubernetes api server.
   
   We need to add 2 safeguards here:
   1. Only trigger events for exceptions that happened since the last 
reconciliation. We can filter based on timestamp, cache the last sent event or 
some other mechanism but we cannot rely on the EventRecorder here as it will 
destroy the k8s api server
   
   2. In general have a limit on number of exception events triggered per 
reconciliation and simply drop the rest. Jobs can fail/generate exceptions very 
quickly. So I would not record more than let's say 5-10 errors per 
reconciliation this way (we can make this configurable but let's have a 
conservative starting limit here to avoid some bad surprises)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to