> Is there a scenario where a task manager could fail but the number of registered task managers metric reported by the job manager is not updated?

The "common" case would be where you have configured a /really/ large heartbeat timeout, such that Flink does not notice that the TaskExecutor has in fact crashed.

> Are there any known issues/recent bug fixes in this area that could possibly be related to this issue? We have since upgraded to Flink 1.11.3 and would like to know if this is a bug that might have been fixed in this version or a later version.

I don't believe we have made any changes in this regard.

> Are there any recommendations for detecting this scenario through monitoring?

If it is indeed the "common" case, then no. If it's some other issue (say, the reporter just reporting incorrect values) then it would be possible to periodically query the number of task managers through the REST API. https://ci.apache.org/projects/flink/flink-docs-release-1.11/monitoring/rest_api.html#taskmanagers

On 27/08/2021 16:39, Conor McGovern wrote:

Hi,

A couple of months ago we observed a scenario in our Flink deployment where the ‘numRegisteredTaskManagers’ job manager metric reported the presence of 3 task managers, despite the fact that only 2 task managers were active at the time, because one of the task managers had crashed. We observed that, while the task manager was down, metrics like the ‘TaskManager.Status.JVM.CPU.Load’ metric were no longer reported for the task manager that went down. This situation where ‘numRegisteredTaskManagers’ reported an incorrect value lasted for approx 10 hours. The Flink version in question was 1.8.1.

Unfortunately, we are no longer in possession of job/task manager logs for this issue. However, we would like to ask some general questions:

Is there a scenario where a task manager could fail but the number of registered task managers metric reported by the job manager is not updated?

Are there any known issues/recent bug fixes in this area that could possibly be related to this issue? We have since upgraded to Flink 1.11.3 and would like to know if this is a bug that might have been fixed in this version or a later version.

Are there any recommendations for detecting this scenario through monitoring?

Thanks for your help.

Regards,

Conor


Reply via email to