Randall,

A couple of questions:

* Some metrics don't seem to have unique names? e.g.
source-record-produce-rate and source-record-produce-total seem like they
are duplicated. Looks like maybe just an oversight that the second ones
should be changed from "produce" to "write".
* I think there's a stray extra character in a couple of
places: 
kafka.connect:type=source-task-metrics,name=source-record-produce-total,worker=([-.\w]+)l,connector=([-.\w]+),task=([\d]+)
has an extra char after the worker name.
* Are the produce totals actually useful given rebalancing would cancel
them out anyway? Doesn't seem like you could do much with them.
* Why do transformations get their own metric but not converters? And are
we concerned at all about the performance impact of getting such fine
grained info? Getting current time isn't free and we've seen before that we
ended up w/ accidental performance regressions as we tried to check it too
frequently to enforce timeouts fine grained in the producer (iirc).
Batching helps w/ this, but on the consumer side, a max.poll.records=1
setting could put you in a bad place, especially since transforms might be
very lightweight (or nothing) and converters are expected to be relatively
cheap as well.
* If we include the worker id everywhere and don't have metrics without
that included, isn't that a pain for users that dump this data into some
other system? They have to know which worker the connector/task is
currently on *or* need to do extra work to merge the metrics from across
machines. Including versions with the worker ID can make sense for
completeness and accuracy (e.g. technically there are still very slim risks
of having a task running twice due to zombies), but it seems like bad
usability for the common case.
* Is aggregating things like source record rate at the (worker, connector)
level really useful since you're just going to need to do additional
aggregation anyway once you've collected metrics across all workers? I'd
rather add a smaller number of metrics w/ clear use cases than just try to
be exhaustive and then have to maintain stuff that nobody actually uses.
* You have status for connectors but not for tasks. Any reason why? Seems
like it'd make sense to expose both, especially since users generally care
about task status more than connector status (not many connectors actually
run a monitoring thread.)
* Is number of tasks for each connector a useful metric? Not sure whether
someone would find this useful or not. Probably not for alerts, but might
be useful to be able to check it via your metrics dashboard.
* Same questions re: granularity of sink tasks/connectors timing and
whether the connectors need all the roll-ups of individual (worker, task)
values to (worker, connector) level.
* If we expose the who the worker currently thinks is leader, it might also
make sense to expose the underlying epoch. Not actually sure if we expose
that for the consumer today, but it's an indicator of who is properly up to
date.
* Why worker-level offset commit stats? It's not clear to me that these are
useful without considering the specific connector.

-Ewen


On Mon, Sep 11, 2017 at 9:43 AM, Randall Hauch <rha...@gmail.com> wrote:

> Thanks for reviewing. Responses inline below.
>
> On Mon, Sep 11, 2017 at 11:22 AM, Roger Hoover <roger.hoo...@gmail.com>
> wrote:
>
> > Randall,
> >
> > Thank you for the KIP.  This should improve visibility greatly.  I had a
> > few questions/ideas for more metrics.
> >
> >
> >    1. What's the relationship between the worker state and the connector
> >    status?  Does the 'paused' status at the Connector level include the
> > time
> >    that worker is 'rebalancing'?
> >
>
> The worker state metric simply reports whether the worker is running or
> rebalancing. This state is independent of how many connectors are
> deployed/running/paused. During a rebalance, the connectors are being
> stopped and restarted but are effectively not running.
>
>
> >    2. Are the "Source Connector" metrics like record rate an aggregation
> of
> >    the "Source Task" metrics?
> >
>
> Yes.
>
>
> >       - How much value is there is monitoring at the "Source Connector"
> >       level (other than status) if the number of constituent tasks may
> > change
> >       over time?
> >
>
> The task metrics allow you to know whether the tasks are evenly loaded and
> each making progress. The aggregate connector metrics tell you how much
> work has been performed by all the tasks in that worker. Both are useful
> IMO.
>
>
> >       - I'm imagining that it's most useful to collect metrics at the
> task
> >       level as the task-level metrics should be stable regardless of
> tasks
> >       shifting to different workers
> >
>
> Correct, this is where the most value is because it is the most fine
> grained.
>
>
> >       - If so, can we duplicate the Connector Status down at the task
> level
> >          so that all important metrics can be tracked by task?
> >
>
> Possibly. The challenge is that the threads running the tasks are blocked
> when a connector is paused.
>
>
> >          3. For the Sink Task metrics
> >       - Can we add offset lag and timestamp lag on commit?
> >          - After records are flushed/committed
> >             - what is the diff between the record timestamps and commit
> >             time (histogram)?  this is a measure of end-to-end pipeline
> > latency
> >             - what is the diff between record offsets and latest offset
> of
> >             their partition at commit time (histogram)? this is a
> > measure of whether
> >             this particular task is keeping up
> >
>
> Yeah, possibly. Will have to compare with the consumer metrics to see what
> we can get.
>
>
> >          - How about flush error rate?  Assuming the sink connectors are
> >       using retries, it would be helpful to know how many errors they're
> > seeing
> >
>
> We could add a metric to track how many times the framework receives a
> retry exception and then retries, but the connectors may also do this on
> their own.
>
>
> >       - Can we tell at the framework level how many records were inserted
> >       vs updated vs deleted?
> >
>
> No, there's no distinction in the Connect framework.
>
>
> >       - Batching stats
> >          - Histogram of flush batch size
> >          - Counts of flush trigger method (time vs max batch size)
> >
>
> Should be able to add these.
>
>
> >
> > Cheers,
> >
> > Roger
> >
> > On Sun, Sep 10, 2017 at 8:45 AM, Randall Hauch <rha...@gmail.com> wrote:
> >
> > > Thanks, Gwen.
> > >
> > > That's a great idea, so I've changed the KIP to add those metrics. I've
> > > also made a few other changes:
> > >
> > >
> > >    1. The context of all metrics is limited to the activity within the
> > >    worker. This wasn't clear before, so I changed the motivation and
> > metric
> > >    descriptions to explicitly state this.
> > >    2. Added the worker ID to all MBean attributes. In addition to
> > hopefully
> > >    making this same scope obvious from within JMX or other metric
> > reporting
> > >    system. This is also similar to how the Kafka producer and consumer
> > > metrics
> > >    include the client ID in their MBean attributes. Hopefully this does
> > not
> > >    negatively impact or complicate how external reporting systems'
> > > aggregate
> > >    metrics from multiple workers.
> > >    3. Stated explicitly that aggregating metrics across workers was out
> > of
> > >    scope of this KIP.
> > >    4. Added metrics to report the connector class and version for both
> > sink
> > >    and source connectors.
> > >
> > > Check this KIP's history for details of these changes.
> > >
> > > Please let me know if you have any other suggestions. I hope to start
> the
> > > voting soon!
> > >
> > > Best regards,
> > >
> > > Randall
> > >
> > > On Thu, Sep 7, 2017 at 9:35 PM, Gwen Shapira <g...@confluent.io>
> wrote:
> > >
> > > > Thanks for the KIP, Randall. Those are badly needed!
> > > >
> > > > Can we have two metrics with record rate per task? One before SMT and
> > one
> > > > after?
> > > > We can have cases where we read 5000 rows from JDBC but write 5 to
> > Kafka,
> > > > or read 5000 records from Kafka and write 5 due to filtering. I think
> > its
> > > > important to know both numbers.
> > > >
> > > >
> > > > Gwen
> > > >
> > > > On Thu, Sep 7, 2017 at 7:50 PM, Randall Hauch <rha...@gmail.com>
> > wrote:
> > > >
> > > > > Hi everyone.
> > > > >
> > > > > I've created a new KIP to add metrics to the Kafka Connect
> framework:
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > 196%3A+Add+metrics+to+Kafka+Connect+framework
> > > > >
> > > > > The KIP approval deadline is looming, so if you're interested in
> > Kafka
> > > > > Connect metrics please review and provide feedback as soon as
> > possible.
> > > > I'm
> > > > > interested not only in whether the metrics are sufficient and
> > > > appropriate,
> > > > > but also in whether the MBean naming conventions are okay.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Randall
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Gwen Shapira*
> > > > Product Manager | Confluent
> > > > 650.450.2760 | @gwenshap
> > > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > > > <http://www.confluent.io/blog>
> > > >
> > >
> >
>

Reply via email to