Sean Glover created KAFKA-8656:
----------------------------------
Summary: Kafka Consumer Record Latency Metric
Key: KAFKA-8656
URL: https://issues.apache.org/jira/browse/KAFKA-8656
Project: Kafka
Issue Type: New Feature
Components: metrics
Reporter: Sean Glover
Assignee: Sean Glover
Consumer lag is a useful metric to monitor how many records are queued to be
processed. We can look at individual lag per partition or we may aggregate
metrics. For example, we may want to monitor what the maximum lag of any
particular partition in our consumer subscription so we can identify hot
partitions, caused by an insufficient producing partitioning strategy. We may
want to monitor a sum of lag across all partitions so we have a sense as to our
total backlog of messages to consume. Lag in offsets is useful when you have a
good understanding of your messages and processing characteristics, but it
doesn’t tell us how far behind _in time_ we are. This is known as wait time in
queueing theory, or more informally it’s referred to as latency.
The latency of a message can be defined as the difference between when that
message was first produced to when the message is received by a consumer. The
latency of records in a partition correlates with lag, but a larger lag doesn’t
necessarily mean a larger latency. For example, a topic consumed by two
separate application consumer groups A and B may have similar lag, but
different latency per partition. Application A is a consumer which performs
CPU intensive business logic on each message it receives. It’s distributed
across many consumer group members to handle the load quickly enough, but since
its processing time is slower, it takes longer to process each message per
partition. Meanwhile, Application B is a consumer which performs a simple ETL
operation to land streaming data in another system, such as HDFS. It may have
similar lag to Application A, but because it has a faster processing time its
latency per partition is significantly less.
If the Kafka Consumer reported a latency metric it would be easier to build
Service Level Agreements (SLAs) based on non-functional requirements of the
streaming system. For example, the system must never have a latency of greater
than 10 minutes. This SLA could be used in monitoring alerts or as input to
automatic scaling solutions.
[KIP-488|[https://cwiki.apache.org/confluence/display/KAFKA/488%3A+Kafka+Consumer+Record+Latency+Metric]]
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)