Hi Rainie, there are two probably causes: * Network instabilities * Taskmanager died, then you can further dig in the taskmanager logs for errors right before that time.
In both cases, Flink should restart the job with the correct restart policies if configured. On Sat, Feb 20, 2021 at 10:07 PM Rainie Li <raini...@pinterest.com> wrote: > Hello, > > I launched a job with a larger load on hadoop yarn cluster. > The Job finished after running 5 hours, I didn't find any error from > JobManger log besides this connect exception. > > > > > > *2021-02-20 13:20:14,110 WARN akka.remote.transport.netty.NettyTransport > - Remote connection to [/10.1.57.146:48368 > <http://10.1.57.146:48368>] failed with java.io.IOException: Connection > reset by peer2021-02-20 13:20:14,110 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system [akka.tcp://flink-metrics@host:35241] has > failed, address is now gated for [50] ms. Reason: [Disassociated] > 2021-02-20 13:20:14,110 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@host:39493] has failed, address is now gated for [50] ms. > Reason: [Disassociated] 2021-02-20 13:20:14,110 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system [akka.tcp://flink-metrics@host:38481] has > failed, address is now gated for [50] ms. Reason: [Disassociated] * > > Any idea what caused the job to be finished and how to resolve it? > Any suggestions are appreciated. > > Thanks > Best regards > Rainie >