Hi ZhenQiu && Rohrmann:

Currently I backport the FLINK-10868 to flink-1.5, most of my jobs (all
batch jobs) can be exited immediately after applying for the failed
container to the upper limit, but there are still some jobs cannot be
exited immediately. Through the log, it is observed that these jobs have
the job manager timed out first for unknown reasons. The execution of code
segment 1 is after the job manager timed out but before the job manager is
reconnected, so it is suspected that the job manager is out of
synchronization and notifyAllocationFailure() method in the code segment 2
is not executed.


I'm wandering if you have encountered similar problems and is there a
solution? In order to solve the problem that cannot be immediately quit, it
is currently considered that if (jobManagerRegistration==null) then
executes the onFatalError() method to immediately exit the process, it is
temporarily unclear whether this violent practice will have any side
effects.


Thanks,
Anyang


code segment 1 in ResourceManager.java:

private void cancelAllPendingSlotRequests(Exception cause) {
   slotManager.cancelAllPendingSlotRequests(cause);
}


code segment 2 in ResourceManager.java:

@Override
public void notifyAllocationFailure(JobID jobId, AllocationID
allocationId, Exception cause) {
   validateRunsInMainThread();
   log.info("Slot request with allocation id {} for job {} failed.",
allocationId, jobId, cause);

   JobManagerRegistration jobManagerRegistration =
jobManagerRegistrations.get(jobId);
   if (jobManagerRegistration != null) {
      
jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId,
cause);
   }
}

Reply via email to