Suggestion 1 makes sense. For the quick termination I think we need to think a bit more about it to find a good solution also to support strict SLA requirements.
Cheers, Till On Wed, Sep 11, 2019 at 11:11 AM Anyang Hu <huanyang1...@gmail.com> wrote: > Hi Till, > > Some of our online batch tasks have strict SLA requirements, and they are > not allowed to be stuck for a long time. Therefore, we take a rude way to > make the job exit immediately. The way to wait for connection recovery is a > better solution. Maybe we need to add a timeout to wait for JM to restore > the connection? > > For suggestion 1, make interval configurable, given that we have done it, > and if we can, we hope to give back to the community. > > Best regards, > Anyang > > Till Rohrmann <trohrm...@apache.org> 于2019年9月9日周一 下午3:09写道: > >> Hi Anyang, >> >> I think we cannot take your proposal because this means that whenever we >> want to call notifyAllocationFailure when there is a connection problem >> between the RM and the JM, then we fail the whole cluster. This is >> something a robust and resilient system should not do because connection >> problems are expected and need to be handled gracefully. Instead if one >> deems the notifyAllocationFailure message to be very important, then one >> would need to keep it and tell the JM once it has connected back. >> >> Cheers, >> Till >> >> On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <huanyang1...@gmail.com> wrote: >> >>> Hi Peter, >>> >>> For our online batch task, there is a scene where the failed Container >>> reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately >>> exit (the probability of JM loss is greatly improved when thousands of >>> Containers is to be started). It is found that the JM disconnection (the >>> reason for JM loss is unknown) will cause the notifyAllocationFailure not >>> to take effect. >>> >>> After the introduction of FLINK-13184 >>> <https://jira.apache.org/jira/browse/FLINK-13184> to start the >>> container with multi-threaded, the JM disconnection situation has been >>> alleviated. In order to stably implement the client immediate exit, we use >>> the following code to determine whether call onFatalError when >>> MaximumFailedTaskManagerExceedingException is occurd: >>> >>> @Override >>> public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, >>> Exception cause) { >>> validateRunsInMainThread(); >>> >>> JobManagerRegistration jobManagerRegistration = >>> jobManagerRegistrations.get(jobId); >>> if (jobManagerRegistration != null) { >>> >>> jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, >>> cause); >>> } else { >>> if (exitProcessOnJobManagerTimedout) { >>> ResourceManagerException exception = new >>> ResourceManagerException("Job Manager is lost, can not notify allocation >>> failure."); >>> onFatalError(exception); >>> } >>> } >>> } >>> >>> >>> Best regards, >>> >>> Anyang >>> >>>