Julio Biason created FLINK-9998:
-----------------------------------

             Summary: FlinkKafkaConsumer produces lag -Inf when the pipeline 
lags
                 Key: FLINK-9998
                 URL: https://issues.apache.org/jira/browse/FLINK-9998
             Project: Flink
          Issue Type: Bug
          Components: Kafka Connector
    Affects Versions: 1.4.0
            Reporter: Julio Biason


I reported this in the list, but now I have enough information to understand 
what's going on.

Sometimes, the kafkaConsumer will report a lag 
(flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max) of -Inf.

The problem seems to related to the capture time.

If the pipeline (defines with EXACTLY_ONCE) starts lagging at some point, there 
won't be enough information in a certain period and the reported lag becomes 
-Inf.

Example: We had an external SQL Sink, pointing to a RDS source, but with a 
cluster outside AWS. This produced a flush time of about 2 minutes for 500 
records (captured 'cause we added a metric around `upload.executeBatch()` 
inside JDBCOutputFormat); although absurd (which is another problem), during 
this time, the metric would report `-Inf` and return the a proper value once 
the stream finished.

So it seems the lag, instead of being a captured value and kept in memory, it's 
calculated from time to time instead of being kept in memory and updated from 
time to time (just because there wasn't any record processed in a certain 
period, it doesn't mean the lag went down).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to