Re: Jobmanager trying to be registered for Zombie Job

Matthias Pohl Mon, 25 Apr 2022 08:48:15 -0700

Thanks again, Peter for sharing your logs. I looked into the issue with the
help of Chesnay. Essentially, it's FLINK-27354 [1] that is causing this
issue. We couldn't come up with a reason why it should have popped up just
now with 1.15. The bug itself is already present in 1.14. You can find more
details on the investigation in FLINK-27354 [1] itself.


Best,
Matthias

[1] https://issues.apache.org/jira/browse/FLINK-27354

On Mon, Apr 25, 2022 at 2:00 PM Matthias Pohl <matth...@ververica.com>
wrote:

> Thanks Peter, we're looking into it...
>
> On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott <pe...@bluerootlabs.io>
> wrote:
>
>> Hi,
>>
>> sorry for the late reply. It took me quite some time to get the logs out
>> of the system. I have attached them now.
>>
>> Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1 that
>> the job starts crashing and recovering a few times. This happens
>> until 2022-04-20 12:12:14,607. After that the above described behavior can
>> be seen.
>>
>> I hope this helps.
>>
>> Best, Peter
>>
>> On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl <matth...@ververica.com>
>> wrote:
>>
>>> FYI: I created FLINK-27354 [1] to cover the issue of retrying to connect
>>> to the RM while shutting down the JobMaster.
>>>
>>> This doesn't explain your issue though, Peter. It's still unclear why
>>> the JobMaster is still around as stated in my previous email.
>>>
>>> Matthias
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>>
>>> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl <matth...@ververica.com>
>>> wrote:
>>>
>>>> Just by looking through the code, it appears that these logs could be
>>>> produced while stopping the job. The ResourceManager sends a confirmation
>>>> of the JobMaster being disconnected at the end back to the JobMaster. If
>>>> the JobMaster is still around to process the request, it would try to
>>>> reconnect (I'd consider that a bug because the JobMaster is in shutdown
>>>> mode already and wouldn't need to re-establish the connection). This method
>>>> would have been swallowed otherwise if the JobMaster was already 
>>>> terminated.
>>>>
>>>> The only explanation I can come up with right now (without having any
>>>> logs) is that stopping the JobMaster didn't finish for some reason. For
>>>> that it would be helpful to look at the logs to see whether there is some
>>>> other issue that causes the JobMaster to stop entirely.
>>>>
>>>> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl <matth...@ververica.com>
>>>> wrote:
>>>>
>>>>> ...if possible it would be good to get debug rather than only info
>>>>> logs. Did you encounter anything odd in the TaskManager logs as well.
>>>>> Sharing those might be of value as well.
>>>>>
>>>>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl <matth...@ververica.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Peter,
>>>>>> thanks for sharing. That doesn't sound right. May you provide the
>>>>>> entire jobmanager logs?
>>>>>>
>>>>>> Best,
>>>>>> Matthias
>>>>>>
>>>>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <pe...@bluerootlabs.io>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Flink-Users,
>>>>>>>
>>>>>>> I am not sure if this does something to my cluster or not. But since
>>>>>>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>>>>>>
>>>>>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>>> @akka.tcp://
>>>>>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job
>>>>>>> 5566648d9b1aac6c1a1b78187fd56975.
>>>>>>>
>>>>>>> as many times as number of parallelisms (here 10 times). These logs
>>>>>>> are triggered every 5 minutes.
>>>>>>>
>>>>>>> Then they are followed by:
>>>>>>>
>>>>>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>>> @akka.tcp://
>>>>>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 failed.
>>>>>>>
>>>>>>> also 10 log entries.
>>>>>>>
>>>>>>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>>>>>>> it was a long-running sql streaming job, started on Apr 13th on a
>>>>>>> standalone cluster. After some recovery attempts it finally failed 
>>>>>>> (using
>>>>>>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>>>>>>> logs started to appear. Now there was no other job running on my cluster
>>>>>>> anymore but the logs appeared every 5 minutes until I restarted this
>>>>>>> jobmanager service.
>>>>>>>
>>>>>>> This job was just an example, it happens to other jobs too.
>>>>>>>
>>>>>>> It's just INFO logs but it does not look healthy either.
>>>>>>>
>>>>>>> Thanks & Best
>>>>>>> Peter
>>>>>>>
>>>>>>

-- 

Matthias Pohl | Engineer

Follow us @VervericaData Ververica <https://www.ververica.com/>

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
Wehner

Re: Jobmanager trying to be registered for Zombie Job

Reply via email to