Re: Flink 1.5 Yarn Connection unexpectedly closed

Fabian Hueske Thu, 21 Jun 2018 01:27:02 -0700

Hi Garrett,

I agree, there seems to be an issue and increasing the timeout should not
be the right approach to solve it.
Are you running streaming or batch jobs, i.e., do some of the tasks finish
much earlier than others?


I'm adding Till to this thread who's very familiar with scheduling and
process communication.

Best, Fabian

2018-06-19 0:03 GMT+02:00 Garrett Barton <garrett.bar...@gmail.com>:

> Hey all,
>
>  My jobs that I am trying to write in Flink 1.5 are failing after a few
> minutes.  I think its because the idle task managers are shutting down,
> which seems to kill the client and the running job. The running job itself
> was still going on one of the other task managers.  I get:
>
> org.apache.flink.client.program.ProgramInvocationException:
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connection unexpectedly closed by remote task manager 'xxxx'. This might
> indicate that the remote task manager was lost.
> at org.apache.flink.runtime.io.network.netty.
> CreditBasedPartitionRequestClientHandler.channelInactive(
> CreditBasedPartitionRequestClientHandler.java:143)
>
> Now I happen to have the last part of the flow paralleled to 1 right now
> for debugging, so the 4 task managers that are spun up, 3 of them hit the
> timeout period (currently set to 240000).  I think as soon as the first one
> goes the client throws up and the whole job dies as a result.
>
>  Is this expected behavior and if so, is there another way around it? Do I
> keep increasing the slotmanager.taskmanager-timeout to a really really
> large number? I have verified setting the timeout to 840000 lets the job
> complete without error.
>
> Thank you!
>

Re: Flink 1.5 Yarn Connection unexpectedly closed

Reply via email to