Thanks again, Peter for sharing your logs. I looked into the issue with the help of Chesnay. Essentially, it's FLINK-27354 [1] that is causing this issue. We couldn't come up with a reason why it should have popped up just now with 1.15. The bug itself is already present in 1.14. You can find more details on the investigation in FLINK-27354 [1] itself.
Best, Matthias [1] https://issues.apache.org/jira/browse/FLINK-27354 On Mon, Apr 25, 2022 at 2:00 PM Matthias Pohl <matth...@ververica.com> wrote: > Thanks Peter, we're looking into it... > > On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott <pe...@bluerootlabs.io> > wrote: > >> Hi, >> >> sorry for the late reply. It took me quite some time to get the logs out >> of the system. I have attached them now. >> >> Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1 that >> the job starts crashing and recovering a few times. This happens >> until 2022-04-20 12:12:14,607. After that the above described behavior can >> be seen. >> >> I hope this helps. >> >> Best, Peter >> >> On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl <matth...@ververica.com> >> wrote: >> >>> FYI: I created FLINK-27354 [1] to cover the issue of retrying to connect >>> to the RM while shutting down the JobMaster. >>> >>> This doesn't explain your issue though, Peter. It's still unclear why >>> the JobMaster is still around as stated in my previous email. >>> >>> Matthias >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-27354 >>> >>> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl <matth...@ververica.com> >>> wrote: >>> >>>> Just by looking through the code, it appears that these logs could be >>>> produced while stopping the job. The ResourceManager sends a confirmation >>>> of the JobMaster being disconnected at the end back to the JobMaster. If >>>> the JobMaster is still around to process the request, it would try to >>>> reconnect (I'd consider that a bug because the JobMaster is in shutdown >>>> mode already and wouldn't need to re-establish the connection). This method >>>> would have been swallowed otherwise if the JobMaster was already >>>> terminated. >>>> >>>> The only explanation I can come up with right now (without having any >>>> logs) is that stopping the JobMaster didn't finish for some reason. For >>>> that it would be helpful to look at the logs to see whether there is some >>>> other issue that causes the JobMaster to stop entirely. >>>> >>>> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl <matth...@ververica.com> >>>> wrote: >>>> >>>>> ...if possible it would be good to get debug rather than only info >>>>> logs. Did you encounter anything odd in the TaskManager logs as well. >>>>> Sharing those might be of value as well. >>>>> >>>>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl <matth...@ververica.com> >>>>> wrote: >>>>> >>>>>> Hi Peter, >>>>>> thanks for sharing. That doesn't sound right. May you provide the >>>>>> entire jobmanager logs? >>>>>> >>>>>> Best, >>>>>> Matthias >>>>>> >>>>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <pe...@bluerootlabs.io> >>>>>> wrote: >>>>>> >>>>>>> Hi Flink-Users, >>>>>>> >>>>>>> I am not sure if this does something to my cluster or not. But since >>>>>>> updating to Flink 1.15 (atm rc4) I find the following logs: >>>>>>> >>>>>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762 >>>>>>> @akka.tcp:// >>>>>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job >>>>>>> 5566648d9b1aac6c1a1b78187fd56975. >>>>>>> >>>>>>> as many times as number of parallelisms (here 10 times). These logs >>>>>>> are triggered every 5 minutes. >>>>>>> >>>>>>> Then they are followed by: >>>>>>> >>>>>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762 >>>>>>> @akka.tcp:// >>>>>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 failed. >>>>>>> >>>>>>> also 10 log entries. >>>>>>> >>>>>>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975), >>>>>>> it was a long-running sql streaming job, started on Apr 13th on a >>>>>>> standalone cluster. After some recovery attempts it finally failed >>>>>>> (using >>>>>>> the failover strategy) on the 20th Apr (yesterday) for good. Then those >>>>>>> logs started to appear. Now there was no other job running on my cluster >>>>>>> anymore but the logs appeared every 5 minutes until I restarted this >>>>>>> jobmanager service. >>>>>>> >>>>>>> This job was just an example, it happens to other jobs too. >>>>>>> >>>>>>> It's just INFO logs but it does not look healthy either. >>>>>>> >>>>>>> Thanks & Best >>>>>>> Peter >>>>>>> >>>>>> -- Matthias Pohl | Engineer Follow us @VervericaData Ververica <https://www.ververica.com/> -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton Wehner