Re: [DISCUSS] KIP-196: Add metrics to Kafka Connect framework

Ewen Cheslack-Postava Mon, 11 Sep 2017 19:56:06 -0700

On Mon, Sep 11, 2017 at 4:50 PM, Randall Hauch <rha...@gmail.com> wrote:


> Thanks, Ewen. Comments inline below.
>
> On Mon, Sep 11, 2017 at 5:46 PM, Ewen Cheslack-Postava <e...@confluent.io>
> wrote:
>
> > Randall,
> >
> > A couple of questions:
> >
> > * Some metrics don't seem to have unique names? e.g.
> > source-record-produce-rate and source-record-produce-total seem like they
> > are duplicated. Looks like maybe just an oversight that the second ones
> > should be changed from "produce" to "write".
> >
>
> Nice catch. You are correct - should be "write" instead of "produce". I
> will correct.
>
>
> > * I think there's a stray extra character in a couple of
> > places: kafka.connect:type=source-task-metrics,name=source-
> > record-produce-total,worker=([-.\w]+)l,connector=([-.\w]+),task=([\d]+)
> > has an extra char after the worker name.
> >
>
> Thanks. Removed in 2 places.
>
>
> > * Are the produce totals actually useful given rebalancing would cancel
> > them out anyway? Doesn't seem like you could do much with them.
> >
>
> Yes, the totals would be since the last rebalance. Maybe that isn't that
> useful. Might be better to capture the offsets and lag as Roger was
> suggestion. Thoughts?
>
>
Sounds like you've since removed it. re: offsets/lag, is this something
that makes sense for source connectors? I don't think I quite understand
the suggestion.


>
> > * Why do transformations get their own metric but not converters? And are
> > we concerned at all about the performance impact of getting such fine
> > grained info? Getting current time isn't free and we've seen before that
> we
> > ended up w/ accidental performance regressions as we tried to check it
> too
> > frequently to enforce timeouts fine grained in the producer (iirc).
> > Batching helps w/ this, but on the consumer side, a max.poll.records=1
> > setting could put you in a bad place, especially since transforms might
> be
> > very lightweight (or nothing) and converters are expected to be
> relatively
> > cheap as well.
> >
>
> We could remove the read, transform, and put time-based metrics for sink
> tasks, and poll, transform, and write time-based metrics. Can/should they
> be replaced with anything else?
>

We can start with any breakdown we want. I think the only tricky issue that
arises is if we decide we *do* want to include more detail. Then the
metrics become a bit confusing (or the naming is really important) because
there is some overlap in the metrics (e.g. if we converted
transformation/conversion/send to "processing" time, but then later broke
them down into those components). But I think that's easily resolved with a
nice little image showing the finest granularity components and the
aggregate components as labels over their constituent parts.


>
>
> > * If we include the worker id everywhere and don't have metrics without
> > that included, isn't that a pain for users that dump this data into some
> > other system? They have to know which worker the connector/task is
> > currently on *or* need to do extra work to merge the metrics from across
> > machines. Including versions with the worker ID can make sense for
> > completeness and accuracy (e.g. technically there are still very slim
> risks
> > of having a task running twice due to zombies), but it seems like bad
> > usability for the common case.
> >
>
> Part of the reason was also to help identify where each of the metrics came
> from, but per the next comment this may not be as useful, either.
> So remove the worker ID in all the task and connector metric names? What
> about the worker metrics?
>

Does worker id work as a tag to keep it isolated w/o including it in the
name? tbh, JMX style metric naming always seemed overly complicated and
messy and I've never fully grokked the details of Kafka metrics as a
result. (To me, this is also a huge usability issue w/ Kafka as it stands
today -- Kafka metrics don't seem to integrate naturally with almost
anything that doesn't assume JMX/Java style metrics.)


>
>
> > * Is aggregating things like source record rate at the (worker,
> connector)
> > level really useful since you're just going to need to do additional
> > aggregation anyway once you've collected metrics across all workers? I'd
> > rather add a smaller number of metrics w/ clear use cases than just try
> to
> > be exhaustive and then have to maintain stuff that nobody actually uses.
> >
>
> Yes, the connector aggregate metrics are maybe not as useful if you also
> have to aggregate them from different workers. Removing them probably also
> reduces the risk of them being misinterpretted.
>

I swear I was not at all motivated by reducing the size of the patch that
this KIP will need reviewed :)


>
>
> > * You have status for connectors but not for tasks. Any reason why? Seems
> > like it'd make sense to expose both, especially since users generally
> care
> > about task status more than connector status (not many connectors
> actually
> > run a monitoring thread.)
> >
>
> Ack.
>
>
> > * Is number of tasks for each connector a useful metric? Not sure whether
> > someone would find this useful or not. Probably not for alerts, but might
> > be useful to be able to check it via your metrics dashboard.
> >
>
> Seems like it might be useful, at least in terms of tracking the number of
> tasks over time. Might not be as useful for connectors that have relatively
> static tasks, but it would be more interesting/useful for connectors that
> create tasks dynamically and periodically request task reconfigurations.
>

Yeah, most will be static. It mostly seems useful a) just for historical
reference, b) possibly to alert on for crazy configs (e.g.
max.tasks=1000000 from a newbie user), and c) possibly to alert on changes
in general just so you know that connectors are changing.


>
> > * Same questions re: granularity of sink tasks/connectors timing and
> > whether the connectors need all the roll-ups of individual (worker, task)
> > values to (worker, connector) level.
> >
>
> I'm fine with taking out the aggregates to keep things simple and prevent
> misunderstanding.
>
>
> > * If we expose the who the worker currently thinks is leader, it might
> also
> > make sense to expose the underlying epoch. Not actually sure if we expose
> > that for the consumer today, but it's an indicator of who is properly up
> to
> > date.
> >
>
> Ack.
>
>
> > * Why worker-level offset commit stats? It's not clear to me that these
> are
> > useful without considering the specific connector.
> >
>
> So would they make more sense on the tasks? Again, on the worker they're
> aggregates.
>

Yeah, task level could make sense. I was just trying to think through how I
would put each of the metrics to use and I wasn't sure what I could infer
from the worker-level commit stats since sink connectors vary a lot in how
they process commits.

-Ewen


>
>
> >
> > -Ewen
> >
> >
> > On Mon, Sep 11, 2017 at 9:43 AM, Randall Hauch <rha...@gmail.com> wrote:
> >
> > > Thanks for reviewing. Responses inline below.
> > >
> > > On Mon, Sep 11, 2017 at 11:22 AM, Roger Hoover <roger.hoo...@gmail.com
> >
> > > wrote:
> > >
> > > > Randall,
> > > >
> > > > Thank you for the KIP.  This should improve visibility greatly.  I
> had
> > a
> > > > few questions/ideas for more metrics.
> > > >
> > > >
> > > >    1. What's the relationship between the worker state and the
> > connector
> > > >    status?  Does the 'paused' status at the Connector level include
> the
> > > > time
> > > >    that worker is 'rebalancing'?
> > > >
> > >
> > > The worker state metric simply reports whether the worker is running or
> > > rebalancing. This state is independent of how many connectors are
> > > deployed/running/paused. During a rebalance, the connectors are being
> > > stopped and restarted but are effectively not running.
> > >
> > >
> > > >    2. Are the "Source Connector" metrics like record rate an
> > aggregation
> > > of
> > > >    the "Source Task" metrics?
> > > >
> > >
> > > Yes.
> > >
> > >
> > > >       - How much value is there is monitoring at the "Source
> Connector"
> > > >       level (other than status) if the number of constituent tasks
> may
> > > > change
> > > >       over time?
> > > >
> > >
> > > The task metrics allow you to know whether the tasks are evenly loaded
> > and
> > > each making progress. The aggregate connector metrics tell you how much
> > > work has been performed by all the tasks in that worker. Both are
> useful
> > > IMO.
> > >
> > >
> > > >       - I'm imagining that it's most useful to collect metrics at the
> > > task
> > > >       level as the task-level metrics should be stable regardless of
> > > tasks
> > > >       shifting to different workers
> > > >
> > >
> > > Correct, this is where the most value is because it is the most fine
> > > grained.
> > >
> > >
> > > >       - If so, can we duplicate the Connector Status down at the task
> > > level
> > > >          so that all important metrics can be tracked by task?
> > > >
> > >
> > > Possibly. The challenge is that the threads running the tasks are
> blocked
> > > when a connector is paused.
> > >
> > >
> > > >          3. For the Sink Task metrics
> > > >       - Can we add offset lag and timestamp lag on commit?
> > > >          - After records are flushed/committed
> > > >             - what is the diff between the record timestamps and
> commit
> > > >             time (histogram)?  this is a measure of end-to-end
> pipeline
> > > > latency
> > > >             - what is the diff between record offsets and latest
> offset
> > > of
> > > >             their partition at commit time (histogram)? this is a
> > > > measure of whether
> > > >             this particular task is keeping up
> > > >
> > >
> > > Yeah, possibly. Will have to compare with the consumer metrics to see
> > what
> > > we can get.
> > >
> > >
> > > >          - How about flush error rate?  Assuming the sink connectors
> > are
> > > >       using retries, it would be helpful to know how many errors
> > they're
> > > > seeing
> > > >
> > >
> > > We could add a metric to track how many times the framework receives a
> > > retry exception and then retries, but the connectors may also do this
> on
> > > their own.
> > >
> > >
> > > >       - Can we tell at the framework level how many records were
> > inserted
> > > >       vs updated vs deleted?
> > > >
> > >
> > > No, there's no distinction in the Connect framework.
> > >
> > >
> > > >       - Batching stats
> > > >          - Histogram of flush batch size
> > > >          - Counts of flush trigger method (time vs max batch size)
> > > >
> > >
> > > Should be able to add these.
> > >
> > >
> > > >
> > > > Cheers,
> > > >
> > > > Roger
> > > >
> > > > On Sun, Sep 10, 2017 at 8:45 AM, Randall Hauch <rha...@gmail.com>
> > wrote:
> > > >
> > > > > Thanks, Gwen.
> > > > >
> > > > > That's a great idea, so I've changed the KIP to add those metrics.
> > I've
> > > > > also made a few other changes:
> > > > >
> > > > >
> > > > >    1. The context of all metrics is limited to the activity within
> > the
> > > > >    worker. This wasn't clear before, so I changed the motivation
> and
> > > > metric
> > > > >    descriptions to explicitly state this.
> > > > >    2. Added the worker ID to all MBean attributes. In addition to
> > > > hopefully
> > > > >    making this same scope obvious from within JMX or other metric
> > > > reporting
> > > > >    system. This is also similar to how the Kafka producer and
> > consumer
> > > > > metrics
> > > > >    include the client ID in their MBean attributes. Hopefully this
> > does
> > > > not
> > > > >    negatively impact or complicate how external reporting systems'
> > > > > aggregate
> > > > >    metrics from multiple workers.
> > > > >    3. Stated explicitly that aggregating metrics across workers was
> > out
> > > > of
> > > > >    scope of this KIP.
> > > > >    4. Added metrics to report the connector class and version for
> > both
> > > > sink
> > > > >    and source connectors.
> > > > >
> > > > > Check this KIP's history for details of these changes.
> > > > >
> > > > > Please let me know if you have any other suggestions. I hope to
> start
> > > the
> > > > > voting soon!
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Randall
> > > > >
> > > > > On Thu, Sep 7, 2017 at 9:35 PM, Gwen Shapira <g...@confluent.io>
> > > wrote:
> > > > >
> > > > > > Thanks for the KIP, Randall. Those are badly needed!
> > > > > >
> > > > > > Can we have two metrics with record rate per task? One before SMT
> > and
> > > > one
> > > > > > after?
> > > > > > We can have cases where we read 5000 rows from JDBC but write 5
> to
> > > > Kafka,
> > > > > > or read 5000 records from Kafka and write 5 due to filtering. I
> > think
> > > > its
> > > > > > important to know both numbers.
> > > > > >
> > > > > >
> > > > > > Gwen
> > > > > >
> > > > > > On Thu, Sep 7, 2017 at 7:50 PM, Randall Hauch <rha...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi everyone.
> > > > > > >
> > > > > > > I've created a new KIP to add metrics to the Kafka Connect
> > > framework:
> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > > > 196%3A+Add+metrics+to+Kafka+Connect+framework
> > > > > > >
> > > > > > > The KIP approval deadline is looming, so if you're interested
> in
> > > > Kafka
> > > > > > > Connect metrics please review and provide feedback as soon as
> > > > possible.
> > > > > > I'm
> > > > > > > interested not only in whether the metrics are sufficient and
> > > > > > appropriate,
> > > > > > > but also in whether the MBean naming conventions are okay.
> > > > > > >
> > > > > > > Best regards,
> > > > > > >
> > > > > > > Randall
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > *Gwen Shapira*
> > > > > > Product Manager | Confluent
> > > > > > 650.450.2760 | @gwenshap
> > > > > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > > > > > <http://www.confluent.io/blog>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-196: Add metrics to Kafka Connect framework

Reply via email to