Re: suggestion of FLINK-10868

Till Rohrmann Mon, 09 Sep 2019 00:09:37 -0700

Hi Anyang,

I think we cannot take your proposal because this means that whenever we
want to call notifyAllocationFailure when there is a connection problem
between the RM and the JM, then we fail the whole cluster. This is
something a robust and resilient system should not do because connection
problems are expected and need to be handled gracefully. Instead if one
deems the notifyAllocationFailure message to be very important, then one
would need to keep it and tell the JM once it has connected back.


Cheers,
Till

On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <huanyang1...@gmail.com> wrote:

> Hi Peter,
>
> For our online batch task, there is a scene where the failed Container
> reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately
> exit (the probability of JM loss is greatly improved when thousands of
> Containers is to be started). It is found that the JM disconnection (the
> reason for JM loss is unknown) will cause the notifyAllocationFailure not
> to take effect.
>
> After the introduction of FLINK-13184
> <https://jira.apache.org/jira/browse/FLINK-13184> to start  the container
> with multi-threaded, the JM disconnection situation has been alleviated. In
> order to stably implement the client immediate exit, we use the following
> code to determine  whether call onFatalError when
> MaximumFailedTaskManagerExceedingException is occurd:
>
> @Override
> public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, 
> Exception cause) {
>    validateRunsInMainThread();
>
>    JobManagerRegistration jobManagerRegistration = 
> jobManagerRegistrations.get(jobId);
>    if (jobManagerRegistration != null) {
>       
> jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId,
>  cause);
>    } else {
>       if (exitProcessOnJobManagerTimedout) {
>          ResourceManagerException exception = new 
> ResourceManagerException("Job Manager is lost, can not notify allocation 
> failure.");
>          onFatalError(exception);
>       }
>    }
> }
>
>
> Best regards,
>
> Anyang
>
>

Re: suggestion of FLINK-10868

Reply via email to