Are you able to replay this scenario? Did you accidently send killing signal to the job mananger process?
On Thu, 13 Oct 2022 at 4:02 PM, Puneet Duggal <puneetduggal1...@gmail.com> wrote: > Hi, > > We use session deployment mode with HA setup. Currently we have 3 job > managers and 3 task managers running on flink version 1.12.1. Please find > attached the complete job manager logs. > > > > > > On 13-Oct-2022, at 7:28 AM, Xintong Song <tonysong...@gmail.com> wrote: > > I meant your jobmanager also received a SIGTERM signal, and you would need > to figure out where it comes from. > > To be specific, this line of log: > >> 2022-10-11 22:11:21,683 INFO >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - RECEIVED >> SIGNAL 15: SIGTERM. Shutting down as requested. >> > > I believe this is from the jobmanager log, as `ClusterEntrypoint` is a > class used by jobmanager only. > > Best, > Xintong > > > > On Thu, Oct 13, 2022 at 9:06 AM yu'an huang <h.yuan...@gmail.com> wrote: > >> Hi, >> >> Which deployment mode do you use? What is the Flink version? >> I think killing TaskManagers won't make the JobMananger restart. You can >> provide the whole log as an attachment to investigate. >> >> On Wed, 12 Oct 2022 at 6:01 PM, Puneet Duggal <puneetduggal1...@gmail.com> >> wrote: >> >>> Hi Xintong Song, >>> >>> Thanks for your immediate reply. Yes, I do restart task manager via kill >>> command and then flink restart because I have seen cases where simple flink >>> restart does not pickup the latest configuration. But what I am confused >>> about is why killing the task manager process and then restarting it is >>> causing the job manager to stop and restart. >>> >>> Regards, >>> Puneet >>> >>> >>> On 12-Oct-2022, at 7:33 AM, Xintong Song <tonysong...@gmail.com> wrote: >>> >>> The log shows that the jobmanager received a SIGTERM signal from >>> external. Depending on how you deploy Flink, that could be a 'kill <PID>' >>> command, or a kubernetes pod removal / eviction, etc. You may want to check >>> where the signal came from. >>> >>> Best, >>> Xintong >>> >>> >>> >>> On Wed, Oct 12, 2022 at 6:26 AM Puneet Duggal < >>> puneetduggal1...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I am facing an issue where when restarting task manager after adding >>>> some configuration changes, even though task manager restarts successfully >>>> with the updated configuration change, is causing the leader job manager to >>>> restart as well. Pasting the leader job manager logs here >>>> >>>> >>>> 2022-10-11 22:11:02,207 WARN akka.remote.ReliableDeliverySupervisor >>>> [] - Association with remote system [ >>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for >>>> [50] ms. Reason: [Disassociated] >>>> 2022-10-11 22:11:02,411 WARN >>>> akka.remote.transport.netty.NettyTransport [] - Remote >>>> connection to [null] failed with java.net.ConnectException: Connection >>>> refused: /<TM-IP>:35376 >>>> 2022-10-11 22:11:02,413 WARN akka.remote.ReliableDeliverySupervisor >>>> [] - Association with remote system [ >>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for >>>> [50] ms. Reason: [Association failed with [ >>>> akka.tcp://flink@<TM-IP>:35376]] Caused by: >>>> [java.net.ConnectException: Connection refused: /<TM-IP>:35376] >>>> 2022-10-11 22:11:02,682 WARN >>>> akka.remote.transport.netty.NettyTransport [] - Remote >>>> connection to [null] failed with java.net.ConnectException: Connection >>>> refused: /<TM-IP>:35376 >>>> 2022-10-11 22:11:02,683 WARN akka.remote.ReliableDeliverySupervisor >>>> [] - Association with remote system [ >>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for >>>> [50] ms. Reason: [Association failed with [ >>>> akka.tcp://flink@<TM-IP>:35376]] Caused by: >>>> [java.net.ConnectException: Connection refused: /<TM-IP>:35376] >>>> 2022-10-11 22:11:12,702 WARN >>>> akka.remote.transport.netty.NettyTransport [] - Remote >>>> connection to [null] failed with java.net.ConnectException: Connection >>>> refused: /<TM-IP>:35376 >>>> 2022-10-11 22:11:12,703 WARN akka.remote.ReliableDeliverySupervisor >>>> [] - Association with remote system [ >>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for >>>> [50] ms. Reason: [Association failed with [ >>>> akka.tcp://flink@<TM-IP>:35376]] Caused by: >>>> [java.net.ConnectException: Connection refused: /<TM-IP>:35376] >>>> 2022-10-11 22:11:21,683 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - RECEIVED >>>> SIGNAL 15: SIGTERM. Shutting down as requested. >>>> 2022-10-11 22:11:21,687 INFO org.apache.flink.runtime.blob.BlobServer >>>> [] - Stopped BLOB server at 0.0.0.0:33887 >>>> >>>> >>>> Regards, >>>> Puneet >>>> >>>> >>>> >>> >