Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

Tzu-Li (Gordon) Tai Thu, 14 Jun 2018 01:27:48 -0700

Hi,

Thanks for the extra information. So, there seems to be 2 separate issues here. 
I’ll go through them one by one.


I was also checking our Grafana and the metric we were using was 
"flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. 
"flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the 
attempt thingy).

After looking at the code changes in FLINK-8419, this unfortunately is a 
accidental “break” in the scope of the metric.
In 1.4.0, the Kafka-shipped metrics were exposed under the “KafkaConsumer” 
metrics group. After FLINK-8419, this was changed, as you observed.
In 1.5.0, however, I think the metrics are exposed under both patterns.

Now, with the fact that some subtasks are returning -Inf for ‘record-lag-max’:
If I understood the metric semantics correctly, this metric represents the "max 
record lag across **partitions subscribed by a Kafka consumer client**.
So, the only possibility that could think of causing this, is that either the 
subtask does not have any partitions assigned to it, or simply there is a bug 
with the Kafka client returning this value.

Is it possible that you verify that all subtasks have a partition assigned to 
it? That should be possible by just checking the job status in the Web UI, and 
observe the numRecordsOut value for each source subtask.

Cheers,
Gordon


On 13 June 2018 at 2:05:17 PM, Julio Biason (julio.bia...@azion.com) wrote:

Hi Gordon,

We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10 driver.

There are a bunch of flink_taskmanager_job_task_operator_* metrics, including 
some about the committed offset for each partition. It seems I have 4 different 
records_lag_max with different attempt_id, though, 3 with -Inf and 1 with a 
value -- which will give me some more understand of Prometheus to extract this 
properly.

I was also checking our Grafana and the metric we were using was 
"flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. 
"flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the 
attempt thingy).

On the "KafkaConsumer" front, but it only has the "commited_offset" for each 
partition.

On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai <tzuli...@apache.org> 
wrote:
Hi,

Which Kafka version are you using?

AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x series 
would be FLINK-8419 [1].
The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded from 
the internally used Kafka client, so nothing should have been affected.

Do you see other metrics under the pattern of 
‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics should still 
follow this pattern.
If not, could you find the ‘records_lag_max’ metric (or any other Kafka-shipped 
metrics [2]) under the user scope ‘KafkaConsumer’?

The above should provide more insight into what may be wrong here.

- Gordon

[1] https://issues.apache.org/jira/browse/FLINK-8419
[2] https://docs.confluent.io/current/kafka/monitoring.html#fetch-metrics

On 12 June 2018 at 11:47:51 PM, Julio Biason (julio.bia...@azion.com) wrote:

Hey guys,

I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus 
monitoring is not getting the current Kafka lag.

After updating to 1.4.2 and making the symlink between 
opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back on 
Prometheus, but the most important one, 
flink_taskmanager_job_task_operator_records_lag_max is now returning -Inf.

Did I miss something?

--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101  |  Mobile: +55 51 99907 0554



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101  |  Mobile: +55 51 99907 0554

Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

Reply via email to