Good day,

I'm looking into using SimpleConsumer#getOffsetsBefore and offsets
committed in ZooKeeper for monitoring the lag of a consumer group.

Our current use case is that we have a service that is continuously
consuming messages of a large number of topics and persisting the messages
to S3 at somewhat regular intervals (depends on time and the total size of
consumed messages for each partition). Offsets are committed to ZooKeeper
after the messages have been persisted to S3.
The partitions are of varying load, so a simple threshold based on the
number of messages we're lagging behind would be cumbersome to maintain due
to the number of topics, and most likely prone to unnecessary alerts.

Currently our broker configuration specifies log.roll.hours=1 and
log.segment.bytes=1GB, and my proposed solution is to have a separate
service that would iterate through all topics/partitions and use
#getOffsetsBefore with a timestamp that is one (1) or two (2) hours ago and
compare the first offset (which from my testing looks to be the offset that
is closest in time, i.e. from the log segment that is closest to the
timestamp given) with the one that is saved to ZooKeeper.
It feels like a pretty solid solution, given that we just want a rough
estimate of how much we're lagging behind in time, so that we know (again,
roughly) how much time we have to fix whatever is broken before the log
segments are deleted by Kafka.

Is there anyone doing monitoring similar to this? Are there any obvious
downsides of this approach that I'm not thinking about? Thoughts on
alternatives?

Best regards,
Mathias

Reply via email to