Re: Job Manager metric 'numRegisteredTaskManagers' reporting wrong value

Chesnay Schepler Fri, 27 Aug 2021 08:21:37 -0700

> Is there a scenario where a task manager could fail but the number ofregistered task managers metric reported by the job manager is not updated?

The "common" case would be where you have configured a /really/ largeheartbeat timeout, such that Flink does not notice that the TaskExecutorhas in fact crashed.

> Are there any known issues/recent bug fixes in this area that couldpossibly be related to this issue? We have since upgraded to Flink1.11.3 and would like to know if this is a bug that might have beenfixed in this version or a later version.


I don't believe we have made any changes in this regard.

> Are there any recommendations for detecting this scenario throughmonitoring?

If it is indeed the "common" case, then no. If it's some other issue(say, the reporter just reporting incorrect values) then it would bepossible to periodically query the number of task managers through theREST API.https://ci.apache.org/projects/flink/flink-docs-release-1.11/monitoring/rest_api.html#taskmanagers


On 27/08/2021 16:39, Conor McGovern wrote:

Hi,
A couple of months ago we observed a scenario in our Flink deploymentwhere the ‘numRegisteredTaskManagers’ job manager metric reported thepresence of 3 task managers, despite the fact that only 2 taskmanagers were active at the time, because one of the task managers hadcrashed. We observed that, while the task manager was down, metricslike the ‘TaskManager.Status.JVM.CPU.Load’ metric were no longerreported for the task manager that went down. This situation where‘numRegisteredTaskManagers’ reported an incorrect value lasted forapprox 10 hours. The Flink version in question was 1.8.1.
Unfortunately, we are no longer in possession of job/task manager logsfor this issue. However, we would like to ask some general questions:
Is there a scenario where a task manager could fail but the number ofregistered task managers metric reported by the job manager is notupdated?
Are there any known issues/recent bug fixes in this area that couldpossibly be related to this issue? We have since upgraded to Flink1.11.3 and would like to know if this is a bug that might have beenfixed in this version or a later version.
Are there any recommendations for detecting this scenario throughmonitoring?
Thanks for your help.

Regards,

Conor

Re: Job Manager metric 'numRegisteredTaskManagers' reporting wrong value

Reply via email to