Hi,


A couple of months ago we observed a scenario in our Flink deployment where the 
‘numRegisteredTaskManagers’ job manager metric reported the presence of 3 task 
managers, despite the fact that only 2 task managers were active at the time, 
because one of the task managers had crashed. We observed that, while the task 
manager was down, metrics like the ‘TaskManager.Status.JVM.CPU.Load’ metric 
were no longer reported for the task manager that went down. This situation 
where ‘numRegisteredTaskManagers’ reported an incorrect value lasted for approx 
10 hours. The Flink version in question was 1.8.1.



Unfortunately, we are no longer in possession of job/task manager logs for this 
issue. However, we would like to ask some general questions:

Is there a scenario where a task manager could fail but the number of 
registered task managers metric reported by the job manager is not updated?

Are there any known issues/recent bug fixes in this area that could possibly be 
related to this issue? We have since upgraded to Flink 1.11.3 and would like to 
know if this is a bug that might have been fixed in this version or a later 
version.

Are there any recommendations for detecting this scenario through monitoring?



Thanks for your help.



Regards,

Conor

Reply via email to