Re: [DISCUSS] KIP-196: Add metrics to Kafka Connect framework

Randall Hauch Tue, 12 Sep 2017 10:48:40 -0700

On Tue, Sep 12, 2017 at 10:36 AM, Roger Hoover <roger.hoo...@gmail.com>
wrote:


> Randall/Ewen,
>
> I think the timing info is still useful even if it's measured since the
> last rebalance.  How else do you know where time is being spent?
>

I think Ewen's concern (correct me if I'm wrong) is that measuring
time-based metrics might result in excessive performance degradation,
especially when batch sizes are small.


>
> The use case for seeing the batch size is that you generally have two knobs
> to configure - max batch size and max wait time.  The batch size metrics
> would tell you how full your batches are based on your current linger time
> so you can adjust the config.
>

It does seem that batch sizes are useful, and the KIP includes these
("batch-size-max" and "batch-size-avg").


>
> Cheers,
>
> Roger
>
> On Mon, Sep 11, 2017 at 7:50 PM, Ewen Cheslack-Postava <e...@confluent.io>
> wrote:
>
> > re: questions about additional metrics, I think we'll undoubtedly find
> more
> > that people want in practice, but as I mentioned earlier I think it's
> > better to add the ones we know we need and then fill out the rest as we
> > figure it out. So, e.g., batch size metrics sound like they could be
> > useful, but I'd probably wait until we have a clear use case. It seems
> > likely that it could be useful in diagnosing slow connectors (e.g. the
> > implementation just does something inefficient), but I'm not really sure
> > about that yet.
> >
> > -Ewen
> >
> > On Mon, Sep 11, 2017 at 7:11 PM, Randall Hauch <rha...@gmail.com> wrote:
> >
> > > Based on Roger and Ewen's feedback, I removed the aggregate metrics as
> > they
> > > would be difficult to make use of without extra work. This simplified
> > > things a great deal, and I took the opportunity to reorganize the
> groups
> > of
> > > metrics. Also, based upon Ewen's concerns regarding measuring
> > > times/durations, I removed all time-related metrics except for the
> offset
> > > commits and rebalances, which are infrequent enough to warrant the
> > capture
> > > of percentiles. Roger asked about capturing batch size metrics for
> source
> > > and sink tasks, and offset lag metrics for sink tasks. Finally, Ewen
> > > pointed out that all count/total metrics are only valid since the most
> > > recent rebalance and are therefore less meaningful, and were removed.
> > >
> > > On Mon, Sep 11, 2017 at 6:50 PM, Randall Hauch <rha...@gmail.com>
> wrote:
> > >
> > > > Thanks, Ewen. Comments inline below.
> > > >
> > > > On Mon, Sep 11, 2017 at 5:46 PM, Ewen Cheslack-Postava <
> > > e...@confluent.io>
> > > > wrote:
> > > >
> > > >> Randall,
> > > >>
> > > >> A couple of questions:
> > > >>
> > > >> * Some metrics don't seem to have unique names? e.g.
> > > >> source-record-produce-rate and source-record-produce-total seem like
> > > they
> > > >> are duplicated. Looks like maybe just an oversight that the second
> > ones
> > > >> should be changed from "produce" to "write".
> > > >>
> > > >
> > > > Nice catch. You are correct - should be "write" instead of
> "produce". I
> > > > will correct.
> > > >
> > > >
> > > >> * I think there's a stray extra character in a couple of
> > > >> places: kafka.connect:type=source-task-metrics,name=source-record-
> > > >> produce-total,worker=([-.\w]+)l,connector=([-.\w]+),task=([\d]+)
> > > >> has an extra char after the worker name.
> > > >>
> > > >
> > > > Thanks. Removed in 2 places.
> > > >
> > > >
> > > >> * Are the produce totals actually useful given rebalancing would
> > cancel
> > > >> them out anyway? Doesn't seem like you could do much with them.
> > > >>
> > > >
> > > > Yes, the totals would be since the last rebalance. Maybe that isn't
> > that
> > > > useful. Might be better to capture the offsets and lag as Roger was
> > > > suggestion. Thoughts?
> > > >
> > > >
> > > >> * Why do transformations get their own metric but not converters?
> And
> > > are
> > > >> we concerned at all about the performance impact of getting such
> fine
> > > >> grained info? Getting current time isn't free and we've seen before
> > that
> > > >> we
> > > >> ended up w/ accidental performance regressions as we tried to check
> it
> > > too
> > > >> frequently to enforce timeouts fine grained in the producer (iirc).
> > > >> Batching helps w/ this, but on the consumer side, a
> max.poll.records=1
> > > >> setting could put you in a bad place, especially since transforms
> > might
> > > be
> > > >> very lightweight (or nothing) and converters are expected to be
> > > relatively
> > > >> cheap as well.
> > > >>
> > > >
> > > > We could remove the read, transform, and put time-based metrics for
> > sink
> > > > tasks, and poll, transform, and write time-based metrics. Can/should
> > they
> > > > be replaced with anything else?
> > > >
> > > >
> > > >> * If we include the worker id everywhere and don't have metrics
> > without
> > > >> that included, isn't that a pain for users that dump this data into
> > some
> > > >> other system? They have to know which worker the connector/task is
> > > >> currently on *or* need to do extra work to merge the metrics from
> > across
> > > >> machines. Including versions with the worker ID can make sense for
> > > >> completeness and accuracy (e.g. technically there are still very
> slim
> > > >> risks
> > > >> of having a task running twice due to zombies), but it seems like
> bad
> > > >> usability for the common case.
> > > >>
> > > >
> > > > Part of the reason was also to help identify where each of the
> metrics
> > > > came from, but per the next comment this may not be as useful,
> either.
> > > > So remove the worker ID in all the task and connector metric names?
> > What
> > > > about the worker metrics?
> > > >
> > > >
> > > >> * Is aggregating things like source record rate at the (worker,
> > > connector)
> > > >> level really useful since you're just going to need to do additional
> > > >> aggregation anyway once you've collected metrics across all workers?
> > I'd
> > > >> rather add a smaller number of metrics w/ clear use cases than just
> > try
> > > to
> > > >> be exhaustive and then have to maintain stuff that nobody actually
> > uses.
> > > >>
> > > >
> > > > Yes, the connector aggregate metrics are maybe not as useful if you
> > also
> > > > have to aggregate them from different workers. Removing them probably
> > > also
> > > > reduces the risk of them being misinterpretted.
> > > >
> > > >
> > > >> * You have status for connectors but not for tasks. Any reason why?
> > > Seems
> > > >> like it'd make sense to expose both, especially since users
> generally
> > > care
> > > >> about task status more than connector status (not many connectors
> > > actually
> > > >> run a monitoring thread.)
> > > >>
> > > >
> > > > Ack.
> > > >
> > > >
> > > >> * Is number of tasks for each connector a useful metric? Not sure
> > > whether
> > > >> someone would find this useful or not. Probably not for alerts, but
> > > might
> > > >> be useful to be able to check it via your metrics dashboard.
> > > >>
> > > >
> > > > Seems like it might be useful, at least in terms of tracking the
> number
> > > of
> > > > tasks over time. Might not be as useful for connectors that have
> > > relatively
> > > > static tasks, but it would be more interesting/useful for connectors
> > that
> > > > create tasks dynamically and periodically request task
> > reconfigurations.
> > > >
> > > >
> > > >> * Same questions re: granularity of sink tasks/connectors timing and
> > > >> whether the connectors need all the roll-ups of individual (worker,
> > > task)
> > > >> values to (worker, connector) level.
> > > >>
> > > >
> > > > I'm fine with taking out the aggregates to keep things simple and
> > prevent
> > > > misunderstanding.
> > > >
> > > >
> > > >> * If we expose the who the worker currently thinks is leader, it
> might
> > > >> also
> > > >> make sense to expose the underlying epoch. Not actually sure if we
> > > expose
> > > >> that for the consumer today, but it's an indicator of who is
> properly
> > up
> > > >> to
> > > >> date.
> > > >>
> > > >
> > > > Ack.
> > > >
> > > >
> > > >> * Why worker-level offset commit stats? It's not clear to me that
> > these
> > > >> are
> > > >> useful without considering the specific connector.
> > > >>
> > > >
> > > > So would they make more sense on the tasks? Again, on the worker
> > they're
> > > > aggregates.
> > > >
> > > >
> > > >>
> > > >> -Ewen
> > > >>
> > > >>
> > > >> On Mon, Sep 11, 2017 at 9:43 AM, Randall Hauch <rha...@gmail.com>
> > > wrote:
> > > >>
> > > >> > Thanks for reviewing. Responses inline below.
> > > >> >
> > > >> > On Mon, Sep 11, 2017 at 11:22 AM, Roger Hoover <
> > > roger.hoo...@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > Randall,
> > > >> > >
> > > >> > > Thank you for the KIP.  This should improve visibility
> greatly.  I
> > > >> had a
> > > >> > > few questions/ideas for more metrics.
> > > >> > >
> > > >> > >
> > > >> > >    1. What's the relationship between the worker state and the
> > > >> connector
> > > >> > >    status?  Does the 'paused' status at the Connector level
> > include
> > > >> the
> > > >> > > time
> > > >> > >    that worker is 'rebalancing'?
> > > >> > >
> > > >> >
> > > >> > The worker state metric simply reports whether the worker is
> running
> > > or
> > > >> > rebalancing. This state is independent of how many connectors are
> > > >> > deployed/running/paused. During a rebalance, the connectors are
> > being
> > > >> > stopped and restarted but are effectively not running.
> > > >> >
> > > >> >
> > > >> > >    2. Are the "Source Connector" metrics like record rate an
> > > >> aggregation
> > > >> > of
> > > >> > >    the "Source Task" metrics?
> > > >> > >
> > > >> >
> > > >> > Yes.
> > > >> >
> > > >> >
> > > >> > >       - How much value is there is monitoring at the "Source
> > > >> Connector"
> > > >> > >       level (other than status) if the number of constituent
> tasks
> > > may
> > > >> > > change
> > > >> > >       over time?
> > > >> > >
> > > >> >
> > > >> > The task metrics allow you to know whether the tasks are evenly
> > loaded
> > > >> and
> > > >> > each making progress. The aggregate connector metrics tell you how
> > > much
> > > >> > work has been performed by all the tasks in that worker. Both are
> > > useful
> > > >> > IMO.
> > > >> >
> > > >> >
> > > >> > >       - I'm imagining that it's most useful to collect metrics
> at
> > > the
> > > >> > task
> > > >> > >       level as the task-level metrics should be stable
> regardless
> > of
> > > >> > tasks
> > > >> > >       shifting to different workers
> > > >> > >
> > > >> >
> > > >> > Correct, this is where the most value is because it is the most
> fine
> > > >> > grained.
> > > >> >
> > > >> >
> > > >> > >       - If so, can we duplicate the Connector Status down at the
> > > task
> > > >> > level
> > > >> > >          so that all important metrics can be tracked by task?
> > > >> > >
> > > >> >
> > > >> > Possibly. The challenge is that the threads running the tasks are
> > > >> blocked
> > > >> > when a connector is paused.
> > > >> >
> > > >> >
> > > >> > >          3. For the Sink Task metrics
> > > >> > >       - Can we add offset lag and timestamp lag on commit?
> > > >> > >          - After records are flushed/committed
> > > >> > >             - what is the diff between the record timestamps and
> > > >> commit
> > > >> > >             time (histogram)?  this is a measure of end-to-end
> > > >> pipeline
> > > >> > > latency
> > > >> > >             - what is the diff between record offsets and latest
> > > >> offset
> > > >> > of
> > > >> > >             their partition at commit time (histogram)? this is
> a
> > > >> > > measure of whether
> > > >> > >             this particular task is keeping up
> > > >> > >
> > > >> >
> > > >> > Yeah, possibly. Will have to compare with the consumer metrics to
> > see
> > > >> what
> > > >> > we can get.
> > > >> >
> > > >> >
> > > >> > >          - How about flush error rate?  Assuming the sink
> > connectors
> > > >> are
> > > >> > >       using retries, it would be helpful to know how many errors
> > > >> they're
> > > >> > > seeing
> > > >> > >
> > > >> >
> > > >> > We could add a metric to track how many times the framework
> > receives a
> > > >> > retry exception and then retries, but the connectors may also do
> > this
> > > on
> > > >> > their own.
> > > >> >
> > > >> >
> > > >> > >       - Can we tell at the framework level how many records were
> > > >> inserted
> > > >> > >       vs updated vs deleted?
> > > >> > >
> > > >> >
> > > >> > No, there's no distinction in the Connect framework.
> > > >> >
> > > >> >
> > > >> > >       - Batching stats
> > > >> > >          - Histogram of flush batch size
> > > >> > >          - Counts of flush trigger method (time vs max batch
> size)
> > > >> > >
> > > >> >
> > > >> > Should be able to add these.
> > > >> >
> > > >> >
> > > >> > >
> > > >> > > Cheers,
> > > >> > >
> > > >> > > Roger
> > > >> > >
> > > >> > > On Sun, Sep 10, 2017 at 8:45 AM, Randall Hauch <
> rha...@gmail.com>
> > > >> wrote:
> > > >> > >
> > > >> > > > Thanks, Gwen.
> > > >> > > >
> > > >> > > > That's a great idea, so I've changed the KIP to add those
> > metrics.
> > > >> I've
> > > >> > > > also made a few other changes:
> > > >> > > >
> > > >> > > >
> > > >> > > >    1. The context of all metrics is limited to the activity
> > within
> > > >> the
> > > >> > > >    worker. This wasn't clear before, so I changed the
> motivation
> > > and
> > > >> > > metric
> > > >> > > >    descriptions to explicitly state this.
> > > >> > > >    2. Added the worker ID to all MBean attributes. In addition
> > to
> > > >> > > hopefully
> > > >> > > >    making this same scope obvious from within JMX or other
> > metric
> > > >> > > reporting
> > > >> > > >    system. This is also similar to how the Kafka producer and
> > > >> consumer
> > > >> > > > metrics
> > > >> > > >    include the client ID in their MBean attributes. Hopefully
> > this
> > > >> does
> > > >> > > not
> > > >> > > >    negatively impact or complicate how external reporting
> > systems'
> > > >> > > > aggregate
> > > >> > > >    metrics from multiple workers.
> > > >> > > >    3. Stated explicitly that aggregating metrics across
> workers
> > > was
> > > >> out
> > > >> > > of
> > > >> > > >    scope of this KIP.
> > > >> > > >    4. Added metrics to report the connector class and version
> > for
> > > >> both
> > > >> > > sink
> > > >> > > >    and source connectors.
> > > >> > > >
> > > >> > > > Check this KIP's history for details of these changes.
> > > >> > > >
> > > >> > > > Please let me know if you have any other suggestions. I hope
> to
> > > >> start
> > > >> > the
> > > >> > > > voting soon!
> > > >> > > >
> > > >> > > > Best regards,
> > > >> > > >
> > > >> > > > Randall
> > > >> > > >
> > > >> > > > On Thu, Sep 7, 2017 at 9:35 PM, Gwen Shapira <
> g...@confluent.io
> > >
> > > >> > wrote:
> > > >> > > >
> > > >> > > > > Thanks for the KIP, Randall. Those are badly needed!
> > > >> > > > >
> > > >> > > > > Can we have two metrics with record rate per task? One
> before
> > > SMT
> > > >> and
> > > >> > > one
> > > >> > > > > after?
> > > >> > > > > We can have cases where we read 5000 rows from JDBC but
> write
> > 5
> > > to
> > > >> > > Kafka,
> > > >> > > > > or read 5000 records from Kafka and write 5 due to
> filtering.
> > I
> > > >> think
> > > >> > > its
> > > >> > > > > important to know both numbers.
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > Gwen
> > > >> > > > >
> > > >> > > > > On Thu, Sep 7, 2017 at 7:50 PM, Randall Hauch <
> > rha...@gmail.com
> > > >
> > > >> > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi everyone.
> > > >> > > > > >
> > > >> > > > > > I've created a new KIP to add metrics to the Kafka Connect
> > > >> > framework:
> > > >> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > >> > > > > > 196%3A+Add+metrics+to+Kafka+Connect+framework
> > > >> > > > > >
> > > >> > > > > > The KIP approval deadline is looming, so if you're
> > interested
> > > in
> > > >> > > Kafka
> > > >> > > > > > Connect metrics please review and provide feedback as soon
> > as
> > > >> > > possible.
> > > >> > > > > I'm
> > > >> > > > > > interested not only in whether the metrics are sufficient
> > and
> > > >> > > > > appropriate,
> > > >> > > > > > but also in whether the MBean naming conventions are okay.
> > > >> > > > > >
> > > >> > > > > > Best regards,
> > > >> > > > > >
> > > >> > > > > > Randall
> > > >> > > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > --
> > > >> > > > > *Gwen Shapira*
> > > >> > > > > Product Manager | Confluent
> > > >> > > > > 650.450.2760 | @gwenshap
> > > >> > > > > Follow us: Twitter <https://twitter.com/ConfluentInc> |
> blog
> > > >> > > > > <http://www.confluent.io/blog>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-196: Add metrics to Kafka Connect framework

Reply via email to