Hey Gordon, (Reviving this long thread) I think I found part of the problem: It seems the metric is capturing the lag from time to time and reseting the value in-between. I managed to replicate this attaching a SQL Sink (JDBCOutputFormat) connecting to an outside database -- something that took about 2 minutes to write 500 records.
I opened the ticket https://issues.apache.org/jira/browse/FLINK-9998 with a bit more information about this ('cause I completely forgot to open a ticket a month ago about this). On Thu, Jun 14, 2018 at 11:31 AM, Julio Biason <julio.bia...@azion.com> wrote: > Hey Gordon, > > The job restarted somewhere in the middle of the night (I haven't checked > why yet) and now I have this weird status of the first TaskManager with > only one valid lag, the second with 2 and the third with none. > > I dunno if I could see the partition in the logs, but all "numRecordsOut" > are increasing over time (attached the screenshot of the graphs). > > On Thu, Jun 14, 2018 at 5:27 AM, Tzu-Li (Gordon) Tai <tzuli...@apache.org> > wrote: > >> Hi, >> >> Thanks for the extra information. So, there seems to be 2 separate issues >> here. I’ll go through them one by one. >> >> I was also checking our Grafana and the metric we were using was >> "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", >> actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to >> be new (with the attempt thingy). >> >> >> After looking at the code changes in FLINK-8419, this unfortunately is a >> accidental “break” in the scope of the metric. >> In 1.4.0, the Kafka-shipped metrics were exposed under the >> “KafkaConsumer” metrics group. After FLINK-8419, this was changed, as you >> observed. >> In 1.5.0, however, I think the metrics are exposed under both patterns. >> >> Now, with the fact that some subtasks are returning -Inf for >> ‘record-lag-max’: >> If I understood the metric semantics correctly, this metric represents >> the "max record lag across **partitions subscribed by a Kafka consumer >> client**. >> So, the only possibility that could think of causing this, is that either >> the subtask does not have any partitions assigned to it, or simply there is >> a bug with the Kafka client returning this value. >> >> Is it possible that you verify that all subtasks have a partition >> assigned to it? That should be possible by just checking the job status in >> the Web UI, and observe the numRecordsOut value for each source subtask. >> >> Cheers, >> Gordon >> >> >> On 13 June 2018 at 2:05:17 PM, Julio Biason (julio.bia...@azion.com) >> wrote: >> >> Hi Gordon, >> >> We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10 >> driver. >> >> There are a bunch of flink_taskmanager_job_task_operator_* metrics, >> including some about the committed offset for each partition. It seems I >> have 4 different records_lag_max with different attempt_id, though, 3 with >> -Inf and 1 with a value -- which will give me some more understand of >> Prometheus to extract this properly. >> >> I was also checking our Grafana and the metric we were using was >> "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", >> actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to >> be new (with the attempt thingy). >> >> On the "KafkaConsumer" front, but it only has the "commited_offset" for >> each partition. >> >> On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai <tzuli...@apache.org >> > wrote: >> >>> Hi, >>> >>> Which Kafka version are you using? >>> >>> AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x >>> series would be FLINK-8419 [1]. >>> The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded >>> from the internally used Kafka client, so nothing should have been affected. >>> >>> Do you see other metrics under the pattern of >>> ‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics >>> should still follow this pattern. >>> If not, could you find the ‘records_lag_max’ metric (or any other >>> Kafka-shipped metrics [2]) under the user scope ‘KafkaConsumer’? >>> >>> The above should provide more insight into what may be wrong here. >>> >>> - Gordon >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-8419 >>> [2] https://docs.confluent.io/current/kafka/monitoring.html# >>> fetch-metrics >>> >>> On 12 June 2018 at 11:47:51 PM, Julio Biason (julio.bia...@azion.com) >>> wrote: >>> >>> Hey guys, >>> >>> I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus >>> monitoring is not getting the current Kafka lag. >>> >>> After updating to 1.4.2 and making the symlink between >>> opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back >>> on Prometheus, but the most important one, >>> flink_taskmanager_job_task_operator_records_lag_max >>> is now returning -Inf. >>> >>> Did I miss something? >>> >>> -- >>> *Julio Biason*, Sofware Engineer >>> *AZION* | Deliver. Accelerate. Protect. >>> Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 >>> <callto:+5551996209291>*99907 0554* >>> >>> >> >> >> -- >> *Julio Biason*, Sofware Engineer >> *AZION* | Deliver. Accelerate. Protect. >> Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 >> <callto:+5551996209291>*99907 0554* >> >> > > > -- > *Julio Biason*, Sofware Engineer > *AZION* | Deliver. Accelerate. Protect. > Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 > <callto:+5551996209291>*99907 0554* > -- *Julio Biason*, Sofware Engineer *AZION* | Deliver. Accelerate. Protect. Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 <callto:+5551996209291>*99907 0554*