Good day, I'm looking into using SimpleConsumer#getOffsetsBefore and offsets committed in ZooKeeper for monitoring the lag of a consumer group.
Our current use case is that we have a service that is continuously consuming messages of a large number of topics and persisting the messages to S3 at somewhat regular intervals (depends on time and the total size of consumed messages for each partition). Offsets are committed to ZooKeeper after the messages have been persisted to S3. The partitions are of varying load, so a simple threshold based on the number of messages we're lagging behind would be cumbersome to maintain due to the number of topics, and most likely prone to unnecessary alerts. Currently our broker configuration specifies log.roll.hours=1 and log.segment.bytes=1GB, and my proposed solution is to have a separate service that would iterate through all topics/partitions and use #getOffsetsBefore with a timestamp that is one (1) or two (2) hours ago and compare the first offset (which from my testing looks to be the offset that is closest in time, i.e. from the log segment that is closest to the timestamp given) with the one that is saved to ZooKeeper. It feels like a pretty solid solution, given that we just want a rough estimate of how much we're lagging behind in time, so that we know (again, roughly) how much time we have to fix whatever is broken before the log segments are deleted by Kafka. Is there anyone doing monitoring similar to this? Are there any obvious downsides of this approach that I'm not thinking about? Thoughts on alternatives? Best regards, Mathias