Just to add some more info, here is the data I have on Prometheus (with some names redacted):
flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="0",task_attempt_id="fa104111e1f493bbec6f4b2ce44ec1da",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"} 83529 flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="3",task_attempt_id="4c5d8395882fb1ad26bdd6fd7f4e789c",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"} -Inf flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="6",task_attempt_id="f05e3171c446c19c7b928eeffe0fa52f",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"} -Inf flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="9",task_attempt_id="9533c5fa9fafadf4878ce90a08f83213",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"} 84096 flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="1",task_attempt_id="7ea45523850f5d08bab719418321e410",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"} 83867 flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="10",task_attempt_id="cf6c2349ccf818f6870fdf0296be121b",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"} 83829 flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="4",task_attempt_id="b39c5366b90d74e57d058b64e9e08e56",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"} -Inf flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="7",task_attempt_id="db563e7e360227585cff9fa3d0035b0d",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"} -Inf flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="11",task_attempt_id="4e9231f9187b0dffc728d8cd77cfef9e",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"} 83730 flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="2",task_attempt_id="e920a1672b0a31c2d186e3f6fee38bed",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"} -Inf flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="5",task_attempt_id="0e22fd213905d4da222e3651e7007106",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"} 83472 flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="8",task_attempt_id="f36fe63b0688a821f5abf685551c47fa",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"} 83842 What we have are 3 servers running with 4 slots each. Our Kafka have 12 partitions and the Job is running with a parallelism of 12. In this set up, I'd expect that each slot would grab one partition and process it, giving me each a different lag respective to each partition. But it seems some slots are stubbornly refusing to either grab a partition or update their status. It doesn't seem (from a perspective of someone that doesn't know the code) that it's not related to TaskManagers using the same Kafka connection, as 004 is consuming 3 partitions while 002 and 003 are consuming just 2. And I was wrong about the attempt_id: It's not what's messing with my Prometheus query, it's some slots reporting -Inf on their partitions. On Wed, Jun 13, 2018 at 9:05 AM, Julio Biason <julio.bia...@azion.com> wrote: > Hi Gordon, > > We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10 > driver. > > There are a bunch of flink_taskmanager_job_task_operator_* metrics, > including some about the committed offset for each partition. It seems I > have 4 different records_lag_max with different attempt_id, though, 3 with > -Inf and 1 with a value -- which will give me some more understand of > Prometheus to extract this properly. > > I was also checking our Grafana and the metric we were using was > "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", > actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to > be new (with the attempt thingy). > > On the "KafkaConsumer" front, but it only has the "commited_offset" for > each partition. > > On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai <tzuli...@apache.org> > wrote: > >> Hi, >> >> Which Kafka version are you using? >> >> AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x >> series would be FLINK-8419 [1]. >> The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded >> from the internally used Kafka client, so nothing should have been affected. >> >> Do you see other metrics under the pattern of >> ‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics >> should still follow this pattern. >> If not, could you find the ‘records_lag_max’ metric (or any other >> Kafka-shipped metrics [2]) under the user scope ‘KafkaConsumer’? >> >> The above should provide more insight into what may be wrong here. >> >> - Gordon >> >> [1] https://issues.apache.org/jira/browse/FLINK-8419 >> [2] https://docs.confluent.io/current/kafka/monitoring.html#fetch-metrics >> >> On 12 June 2018 at 11:47:51 PM, Julio Biason (julio.bia...@azion.com) >> wrote: >> >> Hey guys, >> >> I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus >> monitoring is not getting the current Kafka lag. >> >> After updating to 1.4.2 and making the symlink between >> opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back >> on Prometheus, but the most important one, >> flink_taskmanager_job_task_operator_records_lag_max >> is now returning -Inf. >> >> Did I miss something? >> >> -- >> *Julio Biason*, Sofware Engineer >> *AZION* | Deliver. Accelerate. Protect. >> Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 >> <callto:+5551996209291>*99907 0554* >> >> > > > -- > *Julio Biason*, Sofware Engineer > *AZION* | Deliver. Accelerate. Protect. > Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 > <callto:+5551996209291>*99907 0554* > -- *Julio Biason*, Sofware Engineer *AZION* | Deliver. Accelerate. Protect. Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 <callto:+5551996209291>*99907 0554*