> Is there a scenario where a task manager could fail but the number of
registered task managers metric reported by the job manager is not updated?
The "common" case would be where you have configured a /really/ large
heartbeat timeout, such that Flink does not notice that the TaskExecutor
has in fact crashed.
> Are there any known issues/recent bug fixes in this area that could
possibly be related to this issue? We have since upgraded to Flink
1.11.3 and would like to know if this is a bug that might have been
fixed in this version or a later version.
I don't believe we have made any changes in this regard.
> Are there any recommendations for detecting this scenario through
monitoring?
If it is indeed the "common" case, then no. If it's some other issue
(say, the reporter just reporting incorrect values) then it would be
possible to periodically query the number of task managers through the
REST API.
https://ci.apache.org/projects/flink/flink-docs-release-1.11/monitoring/rest_api.html#taskmanagers
On 27/08/2021 16:39, Conor McGovern wrote:
Hi,
A couple of months ago we observed a scenario in our Flink deployment
where the ‘numRegisteredTaskManagers’ job manager metric reported the
presence of 3 task managers, despite the fact that only 2 task
managers were active at the time, because one of the task managers had
crashed. We observed that, while the task manager was down, metrics
like the ‘TaskManager.Status.JVM.CPU.Load’ metric were no longer
reported for the task manager that went down. This situation where
‘numRegisteredTaskManagers’ reported an incorrect value lasted for
approx 10 hours. The Flink version in question was 1.8.1.
Unfortunately, we are no longer in possession of job/task manager logs
for this issue. However, we would like to ask some general questions:
Is there a scenario where a task manager could fail but the number of
registered task managers metric reported by the job manager is not
updated?
Are there any known issues/recent bug fixes in this area that could
possibly be related to this issue? We have since upgraded to Flink
1.11.3 and would like to know if this is a bug that might have been
fixed in this version or a later version.
Are there any recommendations for detecting this scenario through
monitoring?
Thanks for your help.
Regards,
Conor