Hi Peter, For our online batch task, there is a scene where the failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately exit (the probability of JM loss is greatly improved when thousands of Containers is to be started). It is found that the JM disconnection (the reason for JM loss is unknown) will cause the notifyAllocationFailure not to take effect.
After the introduction of FLINK-13184 <https://jira.apache.org/jira/browse/FLINK-13184> to start the container with multi-threaded, the JM disconnection situation has been alleviated. In order to stably implement the client immediate exit, we use the following code to determine whether call onFatalError when MaximumFailedTaskManagerExceedingException is occurd: @Override public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, Exception cause) { validateRunsInMainThread(); JobManagerRegistration jobManagerRegistration = jobManagerRegistrations.get(jobId); if (jobManagerRegistration != null) { jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, cause); } else { if (exitProcessOnJobManagerTimedout) { ResourceManagerException exception = new ResourceManagerException("Job Manager is lost, can not notify allocation failure."); onFatalError(exception); } } } Best regards, Anyang