Hi, Thanks for the extra information. So, there seems to be 2 separate issues here. I’ll go through them one by one.
I was also checking our Grafana and the metric we were using was "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the attempt thingy). After looking at the code changes in FLINK-8419, this unfortunately is a accidental “break” in the scope of the metric. In 1.4.0, the Kafka-shipped metrics were exposed under the “KafkaConsumer” metrics group. After FLINK-8419, this was changed, as you observed. In 1.5.0, however, I think the metrics are exposed under both patterns. Now, with the fact that some subtasks are returning -Inf for ‘record-lag-max’: If I understood the metric semantics correctly, this metric represents the "max record lag across **partitions subscribed by a Kafka consumer client**. So, the only possibility that could think of causing this, is that either the subtask does not have any partitions assigned to it, or simply there is a bug with the Kafka client returning this value. Is it possible that you verify that all subtasks have a partition assigned to it? That should be possible by just checking the job status in the Web UI, and observe the numRecordsOut value for each source subtask. Cheers, Gordon On 13 June 2018 at 2:05:17 PM, Julio Biason (julio.bia...@azion.com) wrote: Hi Gordon, We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10 driver. There are a bunch of flink_taskmanager_job_task_operator_* metrics, including some about the committed offset for each partition. It seems I have 4 different records_lag_max with different attempt_id, though, 3 with -Inf and 1 with a value -- which will give me some more understand of Prometheus to extract this properly. I was also checking our Grafana and the metric we were using was "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the attempt thingy). On the "KafkaConsumer" front, but it only has the "commited_offset" for each partition. On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai <tzuli...@apache.org> wrote: Hi, Which Kafka version are you using? AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x series would be FLINK-8419 [1]. The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded from the internally used Kafka client, so nothing should have been affected. Do you see other metrics under the pattern of ‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics should still follow this pattern. If not, could you find the ‘records_lag_max’ metric (or any other Kafka-shipped metrics [2]) under the user scope ‘KafkaConsumer’? The above should provide more insight into what may be wrong here. - Gordon [1] https://issues.apache.org/jira/browse/FLINK-8419 [2] https://docs.confluent.io/current/kafka/monitoring.html#fetch-metrics On 12 June 2018 at 11:47:51 PM, Julio Biason (julio.bia...@azion.com) wrote: Hey guys, I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus monitoring is not getting the current Kafka lag. After updating to 1.4.2 and making the symlink between opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back on Prometheus, but the most important one, flink_taskmanager_job_task_operator_records_lag_max is now returning -Inf. Did I miss something? -- Julio Biason, Sofware Engineer AZION | Deliver. Accelerate. Protect. Office: +55 51 3083 8101 | Mobile: +55 51 99907 0554 -- Julio Biason, Sofware Engineer AZION | Deliver. Accelerate. Protect. Office: +55 51 3083 8101 | Mobile: +55 51 99907 0554