Re: [DISCUSS] KIP-196: Add metrics to Kafka Connect framework

Randall Hauch Tue, 12 Sep 2017 11:38:07 -0700

Okay, I think I've incorporated all feedback except for Gwen and Roger than
would like to have timing metrics. Given the deadline and Ewen's concern
about degraded performance, I think it's prudent to leave those out of this
KIP and proceed as is.




On Tue, Sep 12, 2017 at 12:48 PM, Randall Hauch <rha...@gmail.com> wrote:

>
>
> On Tue, Sep 12, 2017 at 10:36 AM, Roger Hoover <roger.hoo...@gmail.com>
> wrote:
>
>> Randall/Ewen,
>>
>> I think the timing info is still useful even if it's measured since the
>> last rebalance.  How else do you know where time is being spent?
>>
>
> I think Ewen's concern (correct me if I'm wrong) is that measuring
> time-based metrics might result in excessive performance degradation,
> especially when batch sizes are small.
>
>
>>
>> The use case for seeing the batch size is that you generally have two
>> knobs
>> to configure - max batch size and max wait time.  The batch size metrics
>> would tell you how full your batches are based on your current linger time
>> so you can adjust the config.
>>
>
> It does seem that batch sizes are useful, and the KIP includes these
> ("batch-size-max" and "batch-size-avg").
>
>
>>
>> Cheers,
>>
>> Roger
>>
>> On Mon, Sep 11, 2017 at 7:50 PM, Ewen Cheslack-Postava <e...@confluent.io
>> >
>> wrote:
>>
>> > re: questions about additional metrics, I think we'll undoubtedly find
>> more
>> > that people want in practice, but as I mentioned earlier I think it's
>> > better to add the ones we know we need and then fill out the rest as we
>> > figure it out. So, e.g., batch size metrics sound like they could be
>> > useful, but I'd probably wait until we have a clear use case. It seems
>> > likely that it could be useful in diagnosing slow connectors (e.g. the
>> > implementation just does something inefficient), but I'm not really sure
>> > about that yet.
>> >
>> > -Ewen
>> >
>> > On Mon, Sep 11, 2017 at 7:11 PM, Randall Hauch <rha...@gmail.com>
>> wrote:
>> >
>> > > Based on Roger and Ewen's feedback, I removed the aggregate metrics as
>> > they
>> > > would be difficult to make use of without extra work. This simplified
>> > > things a great deal, and I took the opportunity to reorganize the
>> groups
>> > of
>> > > metrics. Also, based upon Ewen's concerns regarding measuring
>> > > times/durations, I removed all time-related metrics except for the
>> offset
>> > > commits and rebalances, which are infrequent enough to warrant the
>> > capture
>> > > of percentiles. Roger asked about capturing batch size metrics for
>> source
>> > > and sink tasks, and offset lag metrics for sink tasks. Finally, Ewen
>> > > pointed out that all count/total metrics are only valid since the most
>> > > recent rebalance and are therefore less meaningful, and were removed.
>> > >
>> > > On Mon, Sep 11, 2017 at 6:50 PM, Randall Hauch <rha...@gmail.com>
>> wrote:
>> > >
>> > > > Thanks, Ewen. Comments inline below.
>> > > >
>> > > > On Mon, Sep 11, 2017 at 5:46 PM, Ewen Cheslack-Postava <
>> > > e...@confluent.io>
>> > > > wrote:
>> > > >
>> > > >> Randall,
>> > > >>
>> > > >> A couple of questions:
>> > > >>
>> > > >> * Some metrics don't seem to have unique names? e.g.
>> > > >> source-record-produce-rate and source-record-produce-total seem
>> like
>> > > they
>> > > >> are duplicated. Looks like maybe just an oversight that the second
>> > ones
>> > > >> should be changed from "produce" to "write".
>> > > >>
>> > > >
>> > > > Nice catch. You are correct - should be "write" instead of
>> "produce". I
>> > > > will correct.
>> > > >
>> > > >
>> > > >> * I think there's a stray extra character in a couple of
>> > > >> places: kafka.connect:type=source-task-metrics,name=source-record-
>> > > >> produce-total,worker=([-.\w]+)l,connector=([-.\w]+),task=([\d]+)
>> > > >> has an extra char after the worker name.
>> > > >>
>> > > >
>> > > > Thanks. Removed in 2 places.
>> > > >
>> > > >
>> > > >> * Are the produce totals actually useful given rebalancing would
>> > cancel
>> > > >> them out anyway? Doesn't seem like you could do much with them.
>> > > >>
>> > > >
>> > > > Yes, the totals would be since the last rebalance. Maybe that isn't
>> > that
>> > > > useful. Might be better to capture the offsets and lag as Roger was
>> > > > suggestion. Thoughts?
>> > > >
>> > > >
>> > > >> * Why do transformations get their own metric but not converters?
>> And
>> > > are
>> > > >> we concerned at all about the performance impact of getting such
>> fine
>> > > >> grained info? Getting current time isn't free and we've seen before
>> > that
>> > > >> we
>> > > >> ended up w/ accidental performance regressions as we tried to
>> check it
>> > > too
>> > > >> frequently to enforce timeouts fine grained in the producer (iirc).
>> > > >> Batching helps w/ this, but on the consumer side, a
>> max.poll.records=1
>> > > >> setting could put you in a bad place, especially since transforms
>> > might
>> > > be
>> > > >> very lightweight (or nothing) and converters are expected to be
>> > > relatively
>> > > >> cheap as well.
>> > > >>
>> > > >
>> > > > We could remove the read, transform, and put time-based metrics for
>> > sink
>> > > > tasks, and poll, transform, and write time-based metrics. Can/should
>> > they
>> > > > be replaced with anything else?
>> > > >
>> > > >
>> > > >> * If we include the worker id everywhere and don't have metrics
>> > without
>> > > >> that included, isn't that a pain for users that dump this data into
>> > some
>> > > >> other system? They have to know which worker the connector/task is
>> > > >> currently on *or* need to do extra work to merge the metrics from
>> > across
>> > > >> machines. Including versions with the worker ID can make sense for
>> > > >> completeness and accuracy (e.g. technically there are still very
>> slim
>> > > >> risks
>> > > >> of having a task running twice due to zombies), but it seems like
>> bad
>> > > >> usability for the common case.
>> > > >>
>> > > >
>> > > > Part of the reason was also to help identify where each of the
>> metrics
>> > > > came from, but per the next comment this may not be as useful,
>> either.
>> > > > So remove the worker ID in all the task and connector metric names?
>> > What
>> > > > about the worker metrics?
>> > > >
>> > > >
>> > > >> * Is aggregating things like source record rate at the (worker,
>> > > connector)
>> > > >> level really useful since you're just going to need to do
>> additional
>> > > >> aggregation anyway once you've collected metrics across all
>> workers?
>> > I'd
>> > > >> rather add a smaller number of metrics w/ clear use cases than just
>> > try
>> > > to
>> > > >> be exhaustive and then have to maintain stuff that nobody actually
>> > uses.
>> > > >>
>> > > >
>> > > > Yes, the connector aggregate metrics are maybe not as useful if you
>> > also
>> > > > have to aggregate them from different workers. Removing them
>> probably
>> > > also
>> > > > reduces the risk of them being misinterpretted.
>> > > >
>> > > >
>> > > >> * You have status for connectors but not for tasks. Any reason why?
>> > > Seems
>> > > >> like it'd make sense to expose both, especially since users
>> generally
>> > > care
>> > > >> about task status more than connector status (not many connectors
>> > > actually
>> > > >> run a monitoring thread.)
>> > > >>
>> > > >
>> > > > Ack.
>> > > >
>> > > >
>> > > >> * Is number of tasks for each connector a useful metric? Not sure
>> > > whether
>> > > >> someone would find this useful or not. Probably not for alerts, but
>> > > might
>> > > >> be useful to be able to check it via your metrics dashboard.
>> > > >>
>> > > >
>> > > > Seems like it might be useful, at least in terms of tracking the
>> number
>> > > of
>> > > > tasks over time. Might not be as useful for connectors that have
>> > > relatively
>> > > > static tasks, but it would be more interesting/useful for connectors
>> > that
>> > > > create tasks dynamically and periodically request task
>> > reconfigurations.
>> > > >
>> > > >
>> > > >> * Same questions re: granularity of sink tasks/connectors timing
>> and
>> > > >> whether the connectors need all the roll-ups of individual (worker,
>> > > task)
>> > > >> values to (worker, connector) level.
>> > > >>
>> > > >
>> > > > I'm fine with taking out the aggregates to keep things simple and
>> > prevent
>> > > > misunderstanding.
>> > > >
>> > > >
>> > > >> * If we expose the who the worker currently thinks is leader, it
>> might
>> > > >> also
>> > > >> make sense to expose the underlying epoch. Not actually sure if we
>> > > expose
>> > > >> that for the consumer today, but it's an indicator of who is
>> properly
>> > up
>> > > >> to
>> > > >> date.
>> > > >>
>> > > >
>> > > > Ack.
>> > > >
>> > > >
>> > > >> * Why worker-level offset commit stats? It's not clear to me that
>> > these
>> > > >> are
>> > > >> useful without considering the specific connector.
>> > > >>
>> > > >
>> > > > So would they make more sense on the tasks? Again, on the worker
>> > they're
>> > > > aggregates.
>> > > >
>> > > >
>> > > >>
>> > > >> -Ewen
>> > > >>
>> > > >>
>> > > >> On Mon, Sep 11, 2017 at 9:43 AM, Randall Hauch <rha...@gmail.com>
>> > > wrote:
>> > > >>
>> > > >> > Thanks for reviewing. Responses inline below.
>> > > >> >
>> > > >> > On Mon, Sep 11, 2017 at 11:22 AM, Roger Hoover <
>> > > roger.hoo...@gmail.com>
>> > > >> > wrote:
>> > > >> >
>> > > >> > > Randall,
>> > > >> > >
>> > > >> > > Thank you for the KIP.  This should improve visibility
>> greatly.  I
>> > > >> had a
>> > > >> > > few questions/ideas for more metrics.
>> > > >> > >
>> > > >> > >
>> > > >> > >    1. What's the relationship between the worker state and the
>> > > >> connector
>> > > >> > >    status?  Does the 'paused' status at the Connector level
>> > include
>> > > >> the
>> > > >> > > time
>> > > >> > >    that worker is 'rebalancing'?
>> > > >> > >
>> > > >> >
>> > > >> > The worker state metric simply reports whether the worker is
>> running
>> > > or
>> > > >> > rebalancing. This state is independent of how many connectors are
>> > > >> > deployed/running/paused. During a rebalance, the connectors are
>> > being
>> > > >> > stopped and restarted but are effectively not running.
>> > > >> >
>> > > >> >
>> > > >> > >    2. Are the "Source Connector" metrics like record rate an
>> > > >> aggregation
>> > > >> > of
>> > > >> > >    the "Source Task" metrics?
>> > > >> > >
>> > > >> >
>> > > >> > Yes.
>> > > >> >
>> > > >> >
>> > > >> > >       - How much value is there is monitoring at the "Source
>> > > >> Connector"
>> > > >> > >       level (other than status) if the number of constituent
>> tasks
>> > > may
>> > > >> > > change
>> > > >> > >       over time?
>> > > >> > >
>> > > >> >
>> > > >> > The task metrics allow you to know whether the tasks are evenly
>> > loaded
>> > > >> and
>> > > >> > each making progress. The aggregate connector metrics tell you
>> how
>> > > much
>> > > >> > work has been performed by all the tasks in that worker. Both are
>> > > useful
>> > > >> > IMO.
>> > > >> >
>> > > >> >
>> > > >> > >       - I'm imagining that it's most useful to collect metrics
>> at
>> > > the
>> > > >> > task
>> > > >> > >       level as the task-level metrics should be stable
>> regardless
>> > of
>> > > >> > tasks
>> > > >> > >       shifting to different workers
>> > > >> > >
>> > > >> >
>> > > >> > Correct, this is where the most value is because it is the most
>> fine
>> > > >> > grained.
>> > > >> >
>> > > >> >
>> > > >> > >       - If so, can we duplicate the Connector Status down at
>> the
>> > > task
>> > > >> > level
>> > > >> > >          so that all important metrics can be tracked by task?
>> > > >> > >
>> > > >> >
>> > > >> > Possibly. The challenge is that the threads running the tasks are
>> > > >> blocked
>> > > >> > when a connector is paused.
>> > > >> >
>> > > >> >
>> > > >> > >          3. For the Sink Task metrics
>> > > >> > >       - Can we add offset lag and timestamp lag on commit?
>> > > >> > >          - After records are flushed/committed
>> > > >> > >             - what is the diff between the record timestamps
>> and
>> > > >> commit
>> > > >> > >             time (histogram)?  this is a measure of end-to-end
>> > > >> pipeline
>> > > >> > > latency
>> > > >> > >             - what is the diff between record offsets and
>> latest
>> > > >> offset
>> > > >> > of
>> > > >> > >             their partition at commit time (histogram)? this
>> is a
>> > > >> > > measure of whether
>> > > >> > >             this particular task is keeping up
>> > > >> > >
>> > > >> >
>> > > >> > Yeah, possibly. Will have to compare with the consumer metrics to
>> > see
>> > > >> what
>> > > >> > we can get.
>> > > >> >
>> > > >> >
>> > > >> > >          - How about flush error rate?  Assuming the sink
>> > connectors
>> > > >> are
>> > > >> > >       using retries, it would be helpful to know how many
>> errors
>> > > >> they're
>> > > >> > > seeing
>> > > >> > >
>> > > >> >
>> > > >> > We could add a metric to track how many times the framework
>> > receives a
>> > > >> > retry exception and then retries, but the connectors may also do
>> > this
>> > > on
>> > > >> > their own.
>> > > >> >
>> > > >> >
>> > > >> > >       - Can we tell at the framework level how many records
>> were
>> > > >> inserted
>> > > >> > >       vs updated vs deleted?
>> > > >> > >
>> > > >> >
>> > > >> > No, there's no distinction in the Connect framework.
>> > > >> >
>> > > >> >
>> > > >> > >       - Batching stats
>> > > >> > >          - Histogram of flush batch size
>> > > >> > >          - Counts of flush trigger method (time vs max batch
>> size)
>> > > >> > >
>> > > >> >
>> > > >> > Should be able to add these.
>> > > >> >
>> > > >> >
>> > > >> > >
>> > > >> > > Cheers,
>> > > >> > >
>> > > >> > > Roger
>> > > >> > >
>> > > >> > > On Sun, Sep 10, 2017 at 8:45 AM, Randall Hauch <
>> rha...@gmail.com>
>> > > >> wrote:
>> > > >> > >
>> > > >> > > > Thanks, Gwen.
>> > > >> > > >
>> > > >> > > > That's a great idea, so I've changed the KIP to add those
>> > metrics.
>> > > >> I've
>> > > >> > > > also made a few other changes:
>> > > >> > > >
>> > > >> > > >
>> > > >> > > >    1. The context of all metrics is limited to the activity
>> > within
>> > > >> the
>> > > >> > > >    worker. This wasn't clear before, so I changed the
>> motivation
>> > > and
>> > > >> > > metric
>> > > >> > > >    descriptions to explicitly state this.
>> > > >> > > >    2. Added the worker ID to all MBean attributes. In
>> addition
>> > to
>> > > >> > > hopefully
>> > > >> > > >    making this same scope obvious from within JMX or other
>> > metric
>> > > >> > > reporting
>> > > >> > > >    system. This is also similar to how the Kafka producer and
>> > > >> consumer
>> > > >> > > > metrics
>> > > >> > > >    include the client ID in their MBean attributes. Hopefully
>> > this
>> > > >> does
>> > > >> > > not
>> > > >> > > >    negatively impact or complicate how external reporting
>> > systems'
>> > > >> > > > aggregate
>> > > >> > > >    metrics from multiple workers.
>> > > >> > > >    3. Stated explicitly that aggregating metrics across
>> workers
>> > > was
>> > > >> out
>> > > >> > > of
>> > > >> > > >    scope of this KIP.
>> > > >> > > >    4. Added metrics to report the connector class and version
>> > for
>> > > >> both
>> > > >> > > sink
>> > > >> > > >    and source connectors.
>> > > >> > > >
>> > > >> > > > Check this KIP's history for details of these changes.
>> > > >> > > >
>> > > >> > > > Please let me know if you have any other suggestions. I hope
>> to
>> > > >> start
>> > > >> > the
>> > > >> > > > voting soon!
>> > > >> > > >
>> > > >> > > > Best regards,
>> > > >> > > >
>> > > >> > > > Randall
>> > > >> > > >
>> > > >> > > > On Thu, Sep 7, 2017 at 9:35 PM, Gwen Shapira <
>> g...@confluent.io
>> > >
>> > > >> > wrote:
>> > > >> > > >
>> > > >> > > > > Thanks for the KIP, Randall. Those are badly needed!
>> > > >> > > > >
>> > > >> > > > > Can we have two metrics with record rate per task? One
>> before
>> > > SMT
>> > > >> and
>> > > >> > > one
>> > > >> > > > > after?
>> > > >> > > > > We can have cases where we read 5000 rows from JDBC but
>> write
>> > 5
>> > > to
>> > > >> > > Kafka,
>> > > >> > > > > or read 5000 records from Kafka and write 5 due to
>> filtering.
>> > I
>> > > >> think
>> > > >> > > its
>> > > >> > > > > important to know both numbers.
>> > > >> > > > >
>> > > >> > > > >
>> > > >> > > > > Gwen
>> > > >> > > > >
>> > > >> > > > > On Thu, Sep 7, 2017 at 7:50 PM, Randall Hauch <
>> > rha...@gmail.com
>> > > >
>> > > >> > > wrote:
>> > > >> > > > >
>> > > >> > > > > > Hi everyone.
>> > > >> > > > > >
>> > > >> > > > > > I've created a new KIP to add metrics to the Kafka
>> Connect
>> > > >> > framework:
>> > > >> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > > >> > > > > > 196%3A+Add+metrics+to+Kafka+Connect+framework
>> > > >> > > > > >
>> > > >> > > > > > The KIP approval deadline is looming, so if you're
>> > interested
>> > > in
>> > > >> > > Kafka
>> > > >> > > > > > Connect metrics please review and provide feedback as
>> soon
>> > as
>> > > >> > > possible.
>> > > >> > > > > I'm
>> > > >> > > > > > interested not only in whether the metrics are sufficient
>> > and
>> > > >> > > > > appropriate,
>> > > >> > > > > > but also in whether the MBean naming conventions are
>> okay.
>> > > >> > > > > >
>> > > >> > > > > > Best regards,
>> > > >> > > > > >
>> > > >> > > > > > Randall
>> > > >> > > > > >
>> > > >> > > > >
>> > > >> > > > >
>> > > >> > > > >
>> > > >> > > > > --
>> > > >> > > > > *Gwen Shapira*
>> > > >> > > > > Product Manager | Confluent
>> > > >> > > > > 650.450.2760 | @gwenshap
>> > > >> > > > > Follow us: Twitter <https://twitter.com/ConfluentInc> |
>> blog
>> > > >> > > > > <http://www.confluent.io/blog>
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] KIP-196: Add metrics to Kafka Connect framework

Reply via email to