Suggestion 1 makes sense. For the quick termination I think we need to
think a bit more about it to find a good solution also to support strict
SLA requirements.


On Wed, Sep 11, 2019 at 11:11 AM Anyang Hu <> wrote:

> Hi Till,
> Some of our online batch tasks have strict SLA requirements, and they are
> not allowed to be stuck for a long time. Therefore, we take a rude way to
> make the job exit immediately. The way to wait for connection recovery is a
> better solution. Maybe we need to add a timeout to wait for JM to restore
> the connection?
> For suggestion 1, make interval configurable, given that we have done it,
> and if we can, we hope to give back to the community.
> Best regards,
> Anyang
> Till Rohrmann <> 于2019年9月9日周一 下午3:09写道:
>> Hi Anyang,
>> I think we cannot take your proposal because this means that whenever we
>> want to call notifyAllocationFailure when there is a connection problem
>> between the RM and the JM, then we fail the whole cluster. This is
>> something a robust and resilient system should not do because connection
>> problems are expected and need to be handled gracefully. Instead if one
>> deems the notifyAllocationFailure message to be very important, then one
>> would need to keep it and tell the JM once it has connected back.
>> Cheers,
>> Till
>> On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <> wrote:
>>> Hi Peter,
>>> For our online batch task, there is a scene where the failed Container
>>> reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately
>>> exit (the probability of JM loss is greatly improved when thousands of
>>> Containers is to be started). It is found that the JM disconnection (the
>>> reason for JM loss is unknown) will cause the notifyAllocationFailure not
>>> to take effect.
>>> After the introduction of FLINK-13184
>>> <> to start  the
>>> container with multi-threaded, the JM disconnection situation has been
>>> alleviated. In order to stably implement the client immediate exit, we use
>>> the following code to determine  whether call onFatalError when
>>> MaximumFailedTaskManagerExceedingException is occurd:
>>> @Override
>>> public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, 
>>> Exception cause) {
>>>    validateRunsInMainThread();
>>>    JobManagerRegistration jobManagerRegistration = 
>>> jobManagerRegistrations.get(jobId);
>>>    if (jobManagerRegistration != null) {
>>> jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId,
>>>  cause);
>>>    } else {
>>>       if (exitProcessOnJobManagerTimedout) {
>>>          ResourceManagerException exception = new 
>>> ResourceManagerException("Job Manager is lost, can not notify allocation 
>>> failure.");
>>>          onFatalError(exception);
>>>       }
>>>    }
>>> }
>>> Best regards,
>>> Anyang

Reply via email to