Hi Padarn, 

We configure our Flink KafkaConsumer with setCommitOffsetsOnCheckpoints(true). 
In this case, the offsets are committed on each checkpoint for the conumer 
group of the application. We have an external monitoring on our kafka consumer 
groups (Just a small script) which writes kafka infos like: startOffset, 
endOffset and current committed position for all consumer groups for each topic 
and partition to our metrics db. I like that approach of monitoring as it is 
rather independent of Flink and thus reliable in terms of detecting problems if 
Flink is too slow. Of course, we also rely heavily on flink internal metrics, 
but for the first check of "is everything ok?", we check the kafka topic 
metrics and see "there are XX events coming in and there is no lag 
(backpressure) => All fine". 

Best regards 
Theo 


Von: "Padarn Wilson" <padarn.wil...@grab.com> 
An: "Robert Metzger" <rmetz...@apache.org>, "user" <user@flink.apache.org> 
Gesendet: Dienstag, 16. Juni 2020 02:52:16 
Betreff: Re: [External] Measuring Kafka consumer lag 

Thanks Robert. 
Yes we monitor many of the Flink internal metric, which is why I was surprised 
that we were unable to notice the warning signs before our consumers notified 
us. 

It would be nice to measure the topic vs consumer group offset of the flink 
consumer. 

On Tue, Jun 16, 2020 at 1:57 AM Robert Metzger < [ mailto:rmetz...@apache.org | 
rmetz...@apache.org ] > wrote: 



Hi Padarn, 
I usually recommend the approach you described: accessing/monitoring the lag 
via Flink's metrics system. Sometimes it also makes sense to consider 
application level metrics. 
I checked Youtube for past Flink Forward talks, but I couldn't find a video. 
I'm sure there were users talking about best practices for monitoring Flink in 
the past ... 

Best, 
Robert 

On Sun, Jun 14, 2020 at 5:47 AM Padarn Wilson < [ mailto:padarn.wil...@grab.com 
| padarn.wil...@grab.com ] > wrote: 

BQ_BEGIN

Hi all, 
I'm looking for some advice on how other people measure consumer lag for Kafka 
consumers. Recently we had an application that looked like it was performing 
identically to before, but all of a sudden the throughput of the job decreased 
dramatically. However it was not clear from our Flink metrics, only from the 
lag in time vs watermark time that our consumers were measuring. 

How do people approach measuring this? 

Thanks, 
Padarn 


By communicating with Grab Inc and/or its subsidiaries, associate companies and 
jointly controlled entities (“Grab Group”), you are deemed to have consented to 
the processing of your personal data as set out in the Privacy Notice which can 
be viewed at [ https://grab.com/privacy/ | https://grab.com/privacy/ ] 

This email contains confidential information and is only for the intended 
recipient(s). If you are not the intended recipient(s), please do not 
disseminate, distribute or copy this email Please notify Grab Group immediately 
if you have received this by mistake and delete this email from your system. 
Email transmission cannot be guaranteed to be secure or error-free as any 
information therein could be intercepted, corrupted, lost, destroyed, delayed 
or incomplete, or contain viruses. Grab Group do not accept liability for any 
errors or omissions in the contents of this email arises as a result of email 
transmission. All intellectual property rights in this email and attachments 
therein shall remain vested in Grab Group, unless otherwise provided by law. 




BQ_END



By communicating with Grab Inc and/or its subsidiaries, associate companies and 
jointly controlled entities (“Grab Group”), you are deemed to have consented to 
the processing of your personal data as set out in the Privacy Notice which can 
be viewed at [ https://grab.com/privacy/ | https://grab.com/privacy/ ] 

This email contains confidential information and is only for the intended 
recipient(s). If you are not the intended recipient(s), please do not 
disseminate, distribute or copy this email Please notify Grab Group immediately 
if you have received this by mistake and delete this email from your system. 
Email transmission cannot be guaranteed to be secure or error-free as any 
information therein could be intercepted, corrupted, lost, destroyed, delayed 
or incomplete, or contain viruses. Grab Group do not accept liability for any 
errors or omissions in the contents of this email arises as a result of email 
transmission. All intellectual property rights in this email and attachments 
therein shall remain vested in Grab Group, unless otherwise provided by law. 

Reply via email to