Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

Julio Biason Wed, 13 Jun 2018 05:21:53 -0700

Just to add some more info, here is the data I have on Prometheus (with
some names redacted):


flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="0",task_attempt_id="fa104111e1f493bbec6f4b2ce44ec1da",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"}
83529
flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="3",task_attempt_id="4c5d8395882fb1ad26bdd6fd7f4e789c",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"}
-Inf
flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="6",task_attempt_id="f05e3171c446c19c7b928eeffe0fa52f",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"}
-Inf
flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="9",task_attempt_id="9533c5fa9fafadf4878ce90a08f83213",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"}
84096
flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="1",task_attempt_id="7ea45523850f5d08bab719418321e410",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"}
83867
flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="10",task_attempt_id="cf6c2349ccf818f6870fdf0296be121b",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"}
83829
flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="4",task_attempt_id="b39c5366b90d74e57d058b64e9e08e56",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"}
-Inf
flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="7",task_attempt_id="db563e7e360227585cff9fa3d0035b0d",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"}
-Inf
flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="11",task_attempt_id="4e9231f9187b0dffc728d8cd77cfef9e",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"}
83730
flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="2",task_attempt_id="e920a1672b0a31c2d186e3f6fee38bed",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"}
-Inf
flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="5",task_attempt_id="0e22fd213905d4da222e3651e7007106",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"}
83472
flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="8",task_attempt_id="f36fe63b0688a821f5abf685551c47fa",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"}
83842

What we have are 3 servers running with 4 slots each. Our Kafka have 12
partitions and the Job is running with a parallelism of 12. In this set up,
I'd expect that each slot would grab one partition and process it, giving
me each a different lag respective to each partition. But it seems some
slots are stubbornly refusing to either grab a partition or update their
status. It doesn't seem (from a perspective of someone that doesn't know
the code) that it's not related to TaskManagers using the same Kafka
connection, as 004 is consuming 3 partitions while 002 and 003 are
consuming just 2.

And I was wrong about the attempt_id: It's not what's messing with my
Prometheus query, it's some slots reporting -Inf on their partitions.

On Wed, Jun 13, 2018 at 9:05 AM, Julio Biason <julio.bia...@azion.com>
wrote:

> Hi Gordon,
>
> We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10
> driver.
>
> There are a bunch of flink_taskmanager_job_task_operator_* metrics,
> including some about the committed offset for each partition. It seems I
> have 4 different records_lag_max with different attempt_id, though, 3 with
> -Inf and 1 with a value -- which will give me some more understand of
> Prometheus to extract this properly.
>
> I was also checking our Grafana and the metric we were using was
> "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max",
> actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to
> be new (with the attempt thingy).
>
> On the "KafkaConsumer" front, but it only has the "commited_offset" for
> each partition.
>
> On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai <tzuli...@apache.org>
> wrote:
>
>> Hi,
>>
>> Which Kafka version are you using?
>>
>> AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x
>> series would be FLINK-8419 [1].
>> The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded
>> from the internally used Kafka client, so nothing should have been affected.
>>
>> Do you see other metrics under the pattern of
>> ‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics
>> should still follow this pattern.
>> If not, could you find the ‘records_lag_max’ metric (or any other
>> Kafka-shipped metrics [2]) under the user scope ‘KafkaConsumer’?
>>
>> The above should provide more insight into what may be wrong here.
>>
>> - Gordon
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-8419
>> [2] https://docs.confluent.io/current/kafka/monitoring.html#fetch-metrics
>>
>> On 12 June 2018 at 11:47:51 PM, Julio Biason (julio.bia...@azion.com)
>> wrote:
>>
>> Hey guys,
>>
>> I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus
>> monitoring is not getting the current Kafka lag.
>>
>> After updating to 1.4.2 and making the symlink between
>> opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back
>> on Prometheus, but the most important one, 
>> flink_taskmanager_job_task_operator_records_lag_max
>> is now returning -Inf.
>>
>> Did I miss something?
>>
>> --
>> *Julio Biason*, Sofware Engineer
>> *AZION*  |  Deliver. Accelerate. Protect.
>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>> <callto:+5551996209291>*99907 0554*
>>
>>
>
>
> --
> *Julio Biason*, Sofware Engineer
> *AZION*  |  Deliver. Accelerate. Protect.
> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
> <callto:+5551996209291>*99907 0554*
>



-- 
*Julio Biason*, Sofware Engineer
*AZION*  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
<callto:+5551996209291>*99907 0554*

Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

Reply via email to