I see, I will check tm log. Thank you Arvid. Best regards Rainie
On Wed, Feb 24, 2021 at 5:27 AM Arvid Heise <ar...@apache.org> wrote: > Hi Rainie, > > there are two probably causes: > * Network instabilities > * Taskmanager died, then you can further dig in the taskmanager logs for > errors right before that time. > > In both cases, Flink should restart the job with the correct restart > policies if configured. > > On Sat, Feb 20, 2021 at 10:07 PM Rainie Li <raini...@pinterest.com> wrote: > >> Hello, >> >> I launched a job with a larger load on hadoop yarn cluster. >> The Job finished after running 5 hours, I didn't find any error from >> JobManger log besides this connect exception. >> >> >> >> >> >> *2021-02-20 13:20:14,110 WARN akka.remote.transport.netty.NettyTransport >> - Remote connection to [/10.1.57.146:48368 >> <http://10.1.57.146:48368>] failed with java.io.IOException: Connection >> reset by peer2021-02-20 13:20:14,110 WARN >> akka.remote.ReliableDeliverySupervisor - >> Association with remote system [akka.tcp://flink-metrics@host:35241] has >> failed, address is now gated for [50] ms. Reason: [Disassociated] >> 2021-02-20 13:20:14,110 WARN akka.remote.ReliableDeliverySupervisor >> - Association with remote system >> [akka.tcp://flink@host:39493] has failed, address is now gated for [50] ms. >> Reason: [Disassociated] 2021-02-20 13:20:14,110 WARN >> akka.remote.ReliableDeliverySupervisor - >> Association with remote system [akka.tcp://flink-metrics@host:38481] has >> failed, address is now gated for [50] ms. Reason: [Disassociated] * >> >> Any idea what caused the job to be finished and how to resolve it? >> Any suggestions are appreciated. >> >> Thanks >> Best regards >> Rainie >> >