Hi,
A couple of months ago we observed a scenario in our Flink deployment where the ‘numRegisteredTaskManagers’ job manager metric reported the presence of 3 task managers, despite the fact that only 2 task managers were active at the time, because one of the task managers had crashed. We observed that, while the task manager was down, metrics like the ‘TaskManager.Status.JVM.CPU.Load’ metric were no longer reported for the task manager that went down. This situation where ‘numRegisteredTaskManagers’ reported an incorrect value lasted for approx 10 hours. The Flink version in question was 1.8.1. Unfortunately, we are no longer in possession of job/task manager logs for this issue. However, we would like to ask some general questions: Is there a scenario where a task manager could fail but the number of registered task managers metric reported by the job manager is not updated? Are there any known issues/recent bug fixes in this area that could possibly be related to this issue? We have since upgraded to Flink 1.11.3 and would like to know if this is a bug that might have been fixed in this version or a later version. Are there any recommendations for detecting this scenario through monitoring? Thanks for your help. Regards, Conor