Re: [DISCUSS] KIP-196: Add metrics to Kafka Connect framework

Ewen Cheslack-Postava Tue, 12 Sep 2017 15:07:03 -0700

Requests are generally substantial batches of data, you are not guaranteed
that for the processing batches both because source connectors can hand you
batches of whatever size they want and consumer's max.poll.records can be
overridden.


Both SMTs and converters are a concern because they can both be relatively
cheap such that just checking the time in between them could possibly dwarf
the cost of applying them.

Also, another thought re: rebalance metrics: we are already getting some
info via AbstractCoordinator and those actually provide a bit more detail
in some ways (e.g. join & sync vs the entire rebalance). Not sure if we
want to effectively duplicate some info so it can all be located under
Connect names or rely on the existing metrics for some of these.

-Ewen

On Tue, Sep 12, 2017 at 2:05 PM, Roger Hoover <[email protected]>
wrote:

> Ewen,
>
> I don't know the details of the perf concern.  How is it that the Kafka
> broker can keep latency stats per request without suffering too much
> performance?  Maybe SMTs are the only concern b/c they are per-message.  If
> so, let's remove those and keep timing info for everything else like
> flushes, which are batch-based.
>
>
> On Tue, Sep 12, 2017 at 1:32 PM, Ewen Cheslack-Postava <[email protected]>
> wrote:
>
> > On Tue, Sep 12, 2017 at 10:55 AM, Gwen Shapira <[email protected]>
> wrote:
> >
> > > Ewen, you gave a nice talk at Kafka Summit where you warned about the
> > > danger of SMTs that slow down the data pipe. If we don't provide the
> time
> > > metrics, how will users know when their SMTs are causing performance
> > > issues?
> > >
> >
> > Metrics aren't the only way to gain insight about performance and always
> > measuring this even when it's not necessarily being used may not make
> > sense. SMT authors are much better off starting out with a JMH or similar
> > benchmark. What I was referring to in the talk is more about
> understanding
> > that the processing for SMTs is entirely synchronous and that means
> certain
> > classes of operations will just generally be a bad idea, e.g. anything
> that
> > goes out over the network to another service. You don't even really need
> > performance info to determine that that type of transformation will cause
> > problems.
> >
> > But my point wasn't that timing info isn't useful. It's that we know that
> > getting timestamps is pretty expensive and we'll already be doing so
> > elsewhere (e.g. if a source record doesn't include a timestamp). For some
> > use cases such as ByteArrayConverter + no SMTs + lightweight processing
> > (e.g. just gets handed to a background thread that deals with sending the
> > data), it wouldn't be out of the question that adding 4 or so more calls
> to
> > get timestamps could become a bottleneck. Since I don't know if it would
> > but we have definitely seen the issue come up before, I would be
> > conservative in adding the metrics unless we had some numbers showing it
> > doesn't matter or doesn't matter much.
> >
> > In general, I don't think metrics that require always-on measurement are
> a
> > good way to get fine grained performance information. Instrumenting
> > different phases that imply different types of performance problems can
> be
> > helpful (e.g. "processing time" that should be CPU/memory throughput
> bound
> > vs. "send time" that, at least for many connectors, is more likely to be
> IO
> > bound), but if you want finer-grained details, you probably either want
> > something that can be toggled on/off temporarily or just use a tool
> that's
> > really designed for the job, i.e. a profiler like perf.
> >
> > -Ewen
> >
> >
> > >
> > > Gwen
> > >
> > > On Mon, Sep 11, 2017 at 7:50 PM Ewen Cheslack-Postava <
> [email protected]
> > >
> > > wrote:
> > >
> > > > re: questions about additional metrics, I think we'll undoubtedly
> find
> > > more
> > > > that people want in practice, but as I mentioned earlier I think it's
> > > > better to add the ones we know we need and then fill out the rest as
> we
> > > > figure it out. So, e.g., batch size metrics sound like they could be
> > > > useful, but I'd probably wait until we have a clear use case. It
> seems
> > > > likely that it could be useful in diagnosing slow connectors (e.g.
> the
> > > > implementation just does something inefficient), but I'm not really
> > sure
> > > > about that yet.
> > > >
> > > > -Ewen
> > > >
> > > > On Mon, Sep 11, 2017 at 7:11 PM, Randall Hauch <[email protected]>
> > wrote:
> > > >
> > > > > Based on Roger and Ewen's feedback, I removed the aggregate metrics
> > as
> > > > they
> > > > > would be difficult to make use of without extra work. This
> simplified
> > > > > things a great deal, and I took the opportunity to reorganize the
> > > groups
> > > > of
> > > > > metrics. Also, based upon Ewen's concerns regarding measuring
> > > > > times/durations, I removed all time-related metrics except for the
> > > offset
> > > > > commits and rebalances, which are infrequent enough to warrant the
> > > > capture
> > > > > of percentiles. Roger asked about capturing batch size metrics for
> > > source
> > > > > and sink tasks, and offset lag metrics for sink tasks. Finally,
> Ewen
> > > > > pointed out that all count/total metrics are only valid since the
> > most
> > > > > recent rebalance and are therefore less meaningful, and were
> removed.
> > > > >
> > > > > On Mon, Sep 11, 2017 at 6:50 PM, Randall Hauch <[email protected]>
> > > wrote:
> > > > >
> > > > > > Thanks, Ewen. Comments inline below.
> > > > > >
> > > > > > On Mon, Sep 11, 2017 at 5:46 PM, Ewen Cheslack-Postava <
> > > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > >> Randall,
> > > > > >>
> > > > > >> A couple of questions:
> > > > > >>
> > > > > >> * Some metrics don't seem to have unique names? e.g.
> > > > > >> source-record-produce-rate and source-record-produce-total seem
> > like
> > > > > they
> > > > > >> are duplicated. Looks like maybe just an oversight that the
> second
> > > > ones
> > > > > >> should be changed from "produce" to "write".
> > > > > >>
> > > > > >
> > > > > > Nice catch. You are correct - should be "write" instead of
> > > "produce". I
> > > > > > will correct.
> > > > > >
> > > > > >
> > > > > >> * I think there's a stray extra character in a couple of
> > > > > >> places: kafka.connect:type=source-task
> > -metrics,name=source-record-
> > > > > >> produce-total,worker=([-.\w]+)l,connector=([-.\w]+),task=([\
> d]+)
> > > > > >> has an extra char after the worker name.
> > > > > >>
> > > > > >
> > > > > > Thanks. Removed in 2 places.
> > > > > >
> > > > > >
> > > > > >> * Are the produce totals actually useful given rebalancing would
> > > > cancel
> > > > > >> them out anyway? Doesn't seem like you could do much with them.
> > > > > >>
> > > > > >
> > > > > > Yes, the totals would be since the last rebalance. Maybe that
> isn't
> > > > that
> > > > > > useful. Might be better to capture the offsets and lag as Roger
> was
> > > > > > suggestion. Thoughts?
> > > > > >
> > > > > >
> > > > > >> * Why do transformations get their own metric but not
> converters?
> > > And
> > > > > are
> > > > > >> we concerned at all about the performance impact of getting such
> > > fine
> > > > > >> grained info? Getting current time isn't free and we've seen
> > before
> > > > that
> > > > > >> we
> > > > > >> ended up w/ accidental performance regressions as we tried to
> > check
> > > it
> > > > > too
> > > > > >> frequently to enforce timeouts fine grained in the producer
> > (iirc).
> > > > > >> Batching helps w/ this, but on the consumer side, a
> > > max.poll.records=1
> > > > > >> setting could put you in a bad place, especially since
> transforms
> > > > might
> > > > > be
> > > > > >> very lightweight (or nothing) and converters are expected to be
> > > > > relatively
> > > > > >> cheap as well.
> > > > > >>
> > > > > >
> > > > > > We could remove the read, transform, and put time-based metrics
> for
> > > > sink
> > > > > > tasks, and poll, transform, and write time-based metrics.
> > Can/should
> > > > they
> > > > > > be replaced with anything else?
> > > > > >
> > > > > >
> > > > > >> * If we include the worker id everywhere and don't have metrics
> > > > without
> > > > > >> that included, isn't that a pain for users that dump this data
> > into
> > > > some
> > > > > >> other system? They have to know which worker the connector/task
> is
> > > > > >> currently on *or* need to do extra work to merge the metrics
> from
> > > > across
> > > > > >> machines. Including versions with the worker ID can make sense
> for
> > > > > >> completeness and accuracy (e.g. technically there are still very
> > > slim
> > > > > >> risks
> > > > > >> of having a task running twice due to zombies), but it seems
> like
> > > bad
> > > > > >> usability for the common case.
> > > > > >>
> > > > > >
> > > > > > Part of the reason was also to help identify where each of the
> > > metrics
> > > > > > came from, but per the next comment this may not be as useful,
> > > either.
> > > > > > So remove the worker ID in all the task and connector metric
> names?
> > > > What
> > > > > > about the worker metrics?
> > > > > >
> > > > > >
> > > > > >> * Is aggregating things like source record rate at the (worker,
> > > > > connector)
> > > > > >> level really useful since you're just going to need to do
> > additional
> > > > > >> aggregation anyway once you've collected metrics across all
> > workers?
> > > > I'd
> > > > > >> rather add a smaller number of metrics w/ clear use cases than
> > just
> > > > try
> > > > > to
> > > > > >> be exhaustive and then have to maintain stuff that nobody
> actually
> > > > uses.
> > > > > >>
> > > > > >
> > > > > > Yes, the connector aggregate metrics are maybe not as useful if
> you
> > > > also
> > > > > > have to aggregate them from different workers. Removing them
> > probably
> > > > > also
> > > > > > reduces the risk of them being misinterpretted.
> > > > > >
> > > > > >
> > > > > >> * You have status for connectors but not for tasks. Any reason
> > why?
> > > > > Seems
> > > > > >> like it'd make sense to expose both, especially since users
> > > generally
> > > > > care
> > > > > >> about task status more than connector status (not many
> connectors
> > > > > actually
> > > > > >> run a monitoring thread.)
> > > > > >>
> > > > > >
> > > > > > Ack.
> > > > > >
> > > > > >
> > > > > >> * Is number of tasks for each connector a useful metric? Not
> sure
> > > > > whether
> > > > > >> someone would find this useful or not. Probably not for alerts,
> > but
> > > > > might
> > > > > >> be useful to be able to check it via your metrics dashboard.
> > > > > >>
> > > > > >
> > > > > > Seems like it might be useful, at least in terms of tracking the
> > > number
> > > > > of
> > > > > > tasks over time. Might not be as useful for connectors that have
> > > > > relatively
> > > > > > static tasks, but it would be more interesting/useful for
> > connectors
> > > > that
> > > > > > create tasks dynamically and periodically request task
> > > > reconfigurations.
> > > > > >
> > > > > >
> > > > > >> * Same questions re: granularity of sink tasks/connectors timing
> > and
> > > > > >> whether the connectors need all the roll-ups of individual
> > (worker,
> > > > > task)
> > > > > >> values to (worker, connector) level.
> > > > > >>
> > > > > >
> > > > > > I'm fine with taking out the aggregates to keep things simple and
> > > > prevent
> > > > > > misunderstanding.
> > > > > >
> > > > > >
> > > > > >> * If we expose the who the worker currently thinks is leader, it
> > > might
> > > > > >> also
> > > > > >> make sense to expose the underlying epoch. Not actually sure if
> we
> > > > > expose
> > > > > >> that for the consumer today, but it's an indicator of who is
> > > properly
> > > > up
> > > > > >> to
> > > > > >> date.
> > > > > >>
> > > > > >
> > > > > > Ack.
> > > > > >
> > > > > >
> > > > > >> * Why worker-level offset commit stats? It's not clear to me
> that
> > > > these
> > > > > >> are
> > > > > >> useful without considering the specific connector.
> > > > > >>
> > > > > >
> > > > > > So would they make more sense on the tasks? Again, on the worker
> > > > they're
> > > > > > aggregates.
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> -Ewen
> > > > > >>
> > > > > >>
> > > > > >> On Mon, Sep 11, 2017 at 9:43 AM, Randall Hauch <
> [email protected]>
> > > > > wrote:
> > > > > >>
> > > > > >> > Thanks for reviewing. Responses inline below.
> > > > > >> >
> > > > > >> > On Mon, Sep 11, 2017 at 11:22 AM, Roger Hoover <
> > > > > [email protected]>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Randall,
> > > > > >> > >
> > > > > >> > > Thank you for the KIP.  This should improve visibility
> > > greatly.  I
> > > > > >> had a
> > > > > >> > > few questions/ideas for more metrics.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >    1. What's the relationship between the worker state and
> the
> > > > > >> connector
> > > > > >> > >    status?  Does the 'paused' status at the Connector level
> > > > include
> > > > > >> the
> > > > > >> > > time
> > > > > >> > >    that worker is 'rebalancing'?
> > > > > >> > >
> > > > > >> >
> > > > > >> > The worker state metric simply reports whether the worker is
> > > running
> > > > > or
> > > > > >> > rebalancing. This state is independent of how many connectors
> > are
> > > > > >> > deployed/running/paused. During a rebalance, the connectors
> are
> > > > being
> > > > > >> > stopped and restarted but are effectively not running.
> > > > > >> >
> > > > > >> >
> > > > > >> > >    2. Are the "Source Connector" metrics like record rate an
> > > > > >> aggregation
> > > > > >> > of
> > > > > >> > >    the "Source Task" metrics?
> > > > > >> > >
> > > > > >> >
> > > > > >> > Yes.
> > > > > >> >
> > > > > >> >
> > > > > >> > >       - How much value is there is monitoring at the "Source
> > > > > >> Connector"
> > > > > >> > >       level (other than status) if the number of constituent
> > > tasks
> > > > > may
> > > > > >> > > change
> > > > > >> > >       over time?
> > > > > >> > >
> > > > > >> >
> > > > > >> > The task metrics allow you to know whether the tasks are
> evenly
> > > > loaded
> > > > > >> and
> > > > > >> > each making progress. The aggregate connector metrics tell you
> > how
> > > > > much
> > > > > >> > work has been performed by all the tasks in that worker. Both
> > are
> > > > > useful
> > > > > >> > IMO.
> > > > > >> >
> > > > > >> >
> > > > > >> > >       - I'm imagining that it's most useful to collect
> metrics
> > > at
> > > > > the
> > > > > >> > task
> > > > > >> > >       level as the task-level metrics should be stable
> > > regardless
> > > > of
> > > > > >> > tasks
> > > > > >> > >       shifting to different workers
> > > > > >> > >
> > > > > >> >
> > > > > >> > Correct, this is where the most value is because it is the
> most
> > > fine
> > > > > >> > grained.
> > > > > >> >
> > > > > >> >
> > > > > >> > >       - If so, can we duplicate the Connector Status down at
> > the
> > > > > task
> > > > > >> > level
> > > > > >> > >          so that all important metrics can be tracked by
> task?
> > > > > >> > >
> > > > > >> >
> > > > > >> > Possibly. The challenge is that the threads running the tasks
> > are
> > > > > >> blocked
> > > > > >> > when a connector is paused.
> > > > > >> >
> > > > > >> >
> > > > > >> > >          3. For the Sink Task metrics
> > > > > >> > >       - Can we add offset lag and timestamp lag on commit?
> > > > > >> > >          - After records are flushed/committed
> > > > > >> > >             - what is the diff between the record timestamps
> > and
> > > > > >> commit
> > > > > >> > >             time (histogram)?  this is a measure of
> end-to-end
> > > > > >> pipeline
> > > > > >> > > latency
> > > > > >> > >             - what is the diff between record offsets and
> > latest
> > > > > >> offset
> > > > > >> > of
> > > > > >> > >             their partition at commit time (histogram)? this
> > is
> > > a
> > > > > >> > > measure of whether
> > > > > >> > >             this particular task is keeping up
> > > > > >> > >
> > > > > >> >
> > > > > >> > Yeah, possibly. Will have to compare with the consumer metrics
> > to
> > > > see
> > > > > >> what
> > > > > >> > we can get.
> > > > > >> >
> > > > > >> >
> > > > > >> > >          - How about flush error rate?  Assuming the sink
> > > > connectors
> > > > > >> are
> > > > > >> > >       using retries, it would be helpful to know how many
> > errors
> > > > > >> they're
> > > > > >> > > seeing
> > > > > >> > >
> > > > > >> >
> > > > > >> > We could add a metric to track how many times the framework
> > > > receives a
> > > > > >> > retry exception and then retries, but the connectors may also
> do
> > > > this
> > > > > on
> > > > > >> > their own.
> > > > > >> >
> > > > > >> >
> > > > > >> > >       - Can we tell at the framework level how many records
> > were
> > > > > >> inserted
> > > > > >> > >       vs updated vs deleted?
> > > > > >> > >
> > > > > >> >
> > > > > >> > No, there's no distinction in the Connect framework.
> > > > > >> >
> > > > > >> >
> > > > > >> > >       - Batching stats
> > > > > >> > >          - Histogram of flush batch size
> > > > > >> > >          - Counts of flush trigger method (time vs max batch
> > > size)
> > > > > >> > >
> > > > > >> >
> > > > > >> > Should be able to add these.
> > > > > >> >
> > > > > >> >
> > > > > >> > >
> > > > > >> > > Cheers,
> > > > > >> > >
> > > > > >> > > Roger
> > > > > >> > >
> > > > > >> > > On Sun, Sep 10, 2017 at 8:45 AM, Randall Hauch <
> > > [email protected]>
> > > > > >> wrote:
> > > > > >> > >
> > > > > >> > > > Thanks, Gwen.
> > > > > >> > > >
> > > > > >> > > > That's a great idea, so I've changed the KIP to add those
> > > > metrics.
> > > > > >> I've
> > > > > >> > > > also made a few other changes:
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > >    1. The context of all metrics is limited to the
> activity
> > > > within
> > > > > >> the
> > > > > >> > > >    worker. This wasn't clear before, so I changed the
> > > motivation
> > > > > and
> > > > > >> > > metric
> > > > > >> > > >    descriptions to explicitly state this.
> > > > > >> > > >    2. Added the worker ID to all MBean attributes. In
> > addition
> > > > to
> > > > > >> > > hopefully
> > > > > >> > > >    making this same scope obvious from within JMX or other
> > > > metric
> > > > > >> > > reporting
> > > > > >> > > >    system. This is also similar to how the Kafka producer
> > and
> > > > > >> consumer
> > > > > >> > > > metrics
> > > > > >> > > >    include the client ID in their MBean attributes.
> > Hopefully
> > > > this
> > > > > >> does
> > > > > >> > > not
> > > > > >> > > >    negatively impact or complicate how external reporting
> > > > systems'
> > > > > >> > > > aggregate
> > > > > >> > > >    metrics from multiple workers.
> > > > > >> > > >    3. Stated explicitly that aggregating metrics across
> > > workers
> > > > > was
> > > > > >> out
> > > > > >> > > of
> > > > > >> > > >    scope of this KIP.
> > > > > >> > > >    4. Added metrics to report the connector class and
> > version
> > > > for
> > > > > >> both
> > > > > >> > > sink
> > > > > >> > > >    and source connectors.
> > > > > >> > > >
> > > > > >> > > > Check this KIP's history for details of these changes.
> > > > > >> > > >
> > > > > >> > > > Please let me know if you have any other suggestions. I
> hope
> > > to
> > > > > >> start
> > > > > >> > the
> > > > > >> > > > voting soon!
> > > > > >> > > >
> > > > > >> > > > Best regards,
> > > > > >> > > >
> > > > > >> > > > Randall
> > > > > >> > > >
> > > > > >> > > > On Thu, Sep 7, 2017 at 9:35 PM, Gwen Shapira <
> > > [email protected]
> > > > >
> > > > > >> > wrote:
> > > > > >> > > >
> > > > > >> > > > > Thanks for the KIP, Randall. Those are badly needed!
> > > > > >> > > > >
> > > > > >> > > > > Can we have two metrics with record rate per task? One
> > > before
> > > > > SMT
> > > > > >> and
> > > > > >> > > one
> > > > > >> > > > > after?
> > > > > >> > > > > We can have cases where we read 5000 rows from JDBC but
> > > write
> > > > 5
> > > > > to
> > > > > >> > > Kafka,
> > > > > >> > > > > or read 5000 records from Kafka and write 5 due to
> > > filtering.
> > > > I
> > > > > >> think
> > > > > >> > > its
> > > > > >> > > > > important to know both numbers.
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > Gwen
> > > > > >> > > > >
> > > > > >> > > > > On Thu, Sep 7, 2017 at 7:50 PM, Randall Hauch <
> > > > [email protected]
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Hi everyone.
> > > > > >> > > > > >
> > > > > >> > > > > > I've created a new KIP to add metrics to the Kafka
> > Connect
> > > > > >> > framework:
> > > > > >> > > > > > https://cwiki.apache.org/
> confluence/display/KAFKA/KIP-
> > > > > >> > > > > > 196%3A+Add+metrics+to+Kafka+Connect+framework
> > > > > >> > > > > >
> > > > > >> > > > > > The KIP approval deadline is looming, so if you're
> > > > interested
> > > > > in
> > > > > >> > > Kafka
> > > > > >> > > > > > Connect metrics please review and provide feedback as
> > soon
> > > > as
> > > > > >> > > possible.
> > > > > >> > > > > I'm
> > > > > >> > > > > > interested not only in whether the metrics are
> > sufficient
> > > > and
> > > > > >> > > > > appropriate,
> > > > > >> > > > > > but also in whether the MBean naming conventions are
> > okay.
> > > > > >> > > > > >
> > > > > >> > > > > > Best regards,
> > > > > >> > > > > >
> > > > > >> > > > > > Randall
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > --
> > > > > >> > > > > *Gwen Shapira*
> > > > > >> > > > > Product Manager | Confluent
> > > > > >> > > > > 650.450.2760 <(650)%20450-2760> | @gwenshap
> > > > > >> > > > > Follow us: Twitter <https://twitter.com/ConfluentInc> |
> > > blog
> > > > > >> > > > > <http://www.confluent.io/blog>
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-196: Add metrics to Kafka Connect framework

Reply via email to