Thanks for the information Tom and Jagadish.
Jeremiah Adams Software Engineer www.helixeducation.com Blog | Twitter | Facebook | LinkedIn ________________________________________ From: Jagadish Venkatraman <jagadish1...@gmail.com> Sent: Tuesday, November 27, 2018 4:43 PM To: dev@samza.apache.org Subject: Re: Alerting and Monitoring Samza Checkpointing? Hi Jeremiah, +1 to what Tom said. Samza currently does not rely on Kafka consumer's checkpointing behavior and exposes its own notion of a "lag". This is reported as a per-partition metric under KafkaSystemConsumerMetrics#messagesBehindHighWatermark <https://url.emailprotection.link/?aZyQRg2CGut2qgyHrdHxA3r2wRZBhFBnHgQFe8bv7-el-kky1mI0NNorGT67_4hGvk9kwfq-xqe3AhS5fbDhgMvvevjSSie9YcK39B_XuWMa1wwlETzquMJOzILPI_fs8kXNOSL8pFBntugBWc-9uZMmpGC-UEQpMb9CvPN_cxKSus8rQ50_dgENdDG7FYPqW7BSwuNTl7DzgPysJqxE0ow~~> . My first recommendation would be to setup alerts on this metric and monitor it. To use *Burrow* for lag monitoring, you can extend the *KafkaSystemConsumer* to implement the CheckpointListener <https://url.emailprotection.link/?a5r3dp0BEHNX0237G8v6KFo8_GB1EZ1iyx01zxvfwK5Rm6d4ZFP_sbdggncCCDswySzk_Y1fqB5Ud0NAEk1IuUhjU20WoKjGH5uErkSWavdUPur-7LHWIppwLwsNRbGWmTvYjsmOVPtSptgIn0w625ZGv_jYlGxBoTiXzqhFLDOQ~> interface. This interface allows you to intercept Samza's checkpointing sequence and plug-in your own logic. An example implementation of the CheckpointListener could instantiate a Kafka consumer and periodically commit Samza's checkpointed offsets to it. Please let me know if you have any questions. Best, Jagadish On Tue, Nov 27, 2018 at 10:50 AM Jeremiah Adams <jad...@helixeducation.com> wrote: > > I am referring to the "Lag" that can exist when a Consumer Offset is > significantly less than the Log Size. This difference is Lag and is often > symptomatic of a problem - processing has stopped or being overwhelmed etc. > > Our Legacy Node.js system uses Consumer Groups, (same as say older > Spark). To get the Offset, we can use kafka-consumer-groups.sh tool to get > the offset. For Ops related work we use Kafkamon for these to get a UI up > for Ops folks. > > Our newer stuff uses Samza and I see zero Consumer Groups. Instead I see > checkpoint topics (example: > __samza_checkpoint_ver_1_for_generic-delivery_1). I can consume this topic > and get the current offset by partition, but I don't have the log size, so > cannot compute the lag. All I can do is see these numbers increment but > know clue how behind my process is. > > I just took Linkedin's Burrow > (https://url.emailprotection.link/?aZyQRg2CGut2qgyHrdHxA3nJh1kgchfH4Ntw9gzf6uyPMb70s0sXRMiq-yoBXqdnS8SqL9elFNrlOevL0tB3-NQ~~) > for a > test drive locally, hoping it would solve my problem due to it looking at > the internal consumers. However, I have the same problem - can't get data > on a consumer group that doesn't exist. > > > > > > Jeremiah Adams > Software Engineer > https://url.emailprotection.link/?ahfhEufaAWbezBrUFPG98ZJcterGfIerU3ZwsA3Gv_C0~ > Blog | Twitter | Facebook | LinkedIn > > ________________________________________ > From: Tom Davis <t...@recursivedream.com> > Sent: Monday, November 26, 2018 6:59 PM > To: dev@samza.apache.org > Subject: Re: Alerting and Monitoring Samza Checkpointing? > > Have you looked into KafkaSystemConsumerMetrics? Is the meaning of "lag" > there different from what you mean? > > > Jeremiah Adams <jad...@helixeducation.com> writes: > > > We are replacing a node.js app that consumed topics on a Kafka cluster > with > > Samza jobs. We use kafka-offsets to trigger alerts based on message lag. > e.g., > > message lag is greater than 10, wake up support persons. > > > > > > Samza doesn't use the same mechanism for offset storage and the tools for > > examining a topic's checkpoint aren't readily useful for application > > consumption. > > > > > > Can some of you share your approaches to monitoring and alerting on > consumer > > lag? > > > > > > Regards. > > > > > > Jeremiah Adams > > Software Engineer > > > https://url.emailprotection.link/?ahfhEufaAWbezBrUFPG98ZJcterGfIerU3ZwsA3Gv_C0~ > < > https://url.emailprotection.link/?a49H2rNGIIBtQOw6md8OcHp-qKE3Xn2gNiZ3dlqAeSDA~ > > > > Blog< > https://url.emailprotection.link/?a49H2rNGIIBtQOw6md8OcHgFEZu-KYuiu8doY66NWwmmyWxz7kC-27Yfnbdgd2wyh5gjXUa6LMT_NRXsj1g1VVg~~> > | > > Twitter< > https://url.emailprotection.link/?a0Q7ct5_6cOdbJ86kpWB0zx6RbtgugTVC7lU_W7za50jLdZQGpLgVlR1V06zckSaM5oOKb6QBo46Qp9xt0Tt7Aw~~> > | > > Facebook< > https://url.emailprotection.link/?aAmyAO_nS_C1aDgBLeKyGTu0tksTt1_mn2PcS8KJXNJPM04iRHKgX96qGgENV-dMSER5wl8zDVRr3RsS0OmcF9A~~> > | > > LinkedIn< > https://url.emailprotection.link/?aanlcNI-cN74Gdz-TD332xAl6lHu7TRNICWoHUFjYf-KlBjrCGHoYR65b3rl-OyW10nWFv6hwYvUSoVHL4b3vGA~~ > > > -- Jagadish V, Graduate Student, Department of Computer Science, Stanford University