Thanks for reviewing. Responses inline below.

On Mon, Sep 11, 2017 at 11:22 AM, Roger Hoover <roger.hoo...@gmail.com>
wrote:

> Randall,
>
> Thank you for the KIP.  This should improve visibility greatly.  I had a
> few questions/ideas for more metrics.
>
>
>    1. What's the relationship between the worker state and the connector
>    status?  Does the 'paused' status at the Connector level include the
> time
>    that worker is 'rebalancing'?
>

The worker state metric simply reports whether the worker is running or
rebalancing. This state is independent of how many connectors are
deployed/running/paused. During a rebalance, the connectors are being
stopped and restarted but are effectively not running.


>    2. Are the "Source Connector" metrics like record rate an aggregation of
>    the "Source Task" metrics?
>

Yes.


>       - How much value is there is monitoring at the "Source Connector"
>       level (other than status) if the number of constituent tasks may
> change
>       over time?
>

The task metrics allow you to know whether the tasks are evenly loaded and
each making progress. The aggregate connector metrics tell you how much
work has been performed by all the tasks in that worker. Both are useful
IMO.


>       - I'm imagining that it's most useful to collect metrics at the task
>       level as the task-level metrics should be stable regardless of tasks
>       shifting to different workers
>

Correct, this is where the most value is because it is the most fine
grained.


>       - If so, can we duplicate the Connector Status down at the task level
>          so that all important metrics can be tracked by task?
>

Possibly. The challenge is that the threads running the tasks are blocked
when a connector is paused.


>          3. For the Sink Task metrics
>       - Can we add offset lag and timestamp lag on commit?
>          - After records are flushed/committed
>             - what is the diff between the record timestamps and commit
>             time (histogram)?  this is a measure of end-to-end pipeline
> latency
>             - what is the diff between record offsets and latest offset of
>             their partition at commit time (histogram)? this is a
> measure of whether
>             this particular task is keeping up
>

Yeah, possibly. Will have to compare with the consumer metrics to see what
we can get.


>          - How about flush error rate?  Assuming the sink connectors are
>       using retries, it would be helpful to know how many errors they're
> seeing
>

We could add a metric to track how many times the framework receives a
retry exception and then retries, but the connectors may also do this on
their own.


>       - Can we tell at the framework level how many records were inserted
>       vs updated vs deleted?
>

No, there's no distinction in the Connect framework.


>       - Batching stats
>          - Histogram of flush batch size
>          - Counts of flush trigger method (time vs max batch size)
>

Should be able to add these.


>
> Cheers,
>
> Roger
>
> On Sun, Sep 10, 2017 at 8:45 AM, Randall Hauch <rha...@gmail.com> wrote:
>
> > Thanks, Gwen.
> >
> > That's a great idea, so I've changed the KIP to add those metrics. I've
> > also made a few other changes:
> >
> >
> >    1. The context of all metrics is limited to the activity within the
> >    worker. This wasn't clear before, so I changed the motivation and
> metric
> >    descriptions to explicitly state this.
> >    2. Added the worker ID to all MBean attributes. In addition to
> hopefully
> >    making this same scope obvious from within JMX or other metric
> reporting
> >    system. This is also similar to how the Kafka producer and consumer
> > metrics
> >    include the client ID in their MBean attributes. Hopefully this does
> not
> >    negatively impact or complicate how external reporting systems'
> > aggregate
> >    metrics from multiple workers.
> >    3. Stated explicitly that aggregating metrics across workers was out
> of
> >    scope of this KIP.
> >    4. Added metrics to report the connector class and version for both
> sink
> >    and source connectors.
> >
> > Check this KIP's history for details of these changes.
> >
> > Please let me know if you have any other suggestions. I hope to start the
> > voting soon!
> >
> > Best regards,
> >
> > Randall
> >
> > On Thu, Sep 7, 2017 at 9:35 PM, Gwen Shapira <g...@confluent.io> wrote:
> >
> > > Thanks for the KIP, Randall. Those are badly needed!
> > >
> > > Can we have two metrics with record rate per task? One before SMT and
> one
> > > after?
> > > We can have cases where we read 5000 rows from JDBC but write 5 to
> Kafka,
> > > or read 5000 records from Kafka and write 5 due to filtering. I think
> its
> > > important to know both numbers.
> > >
> > >
> > > Gwen
> > >
> > > On Thu, Sep 7, 2017 at 7:50 PM, Randall Hauch <rha...@gmail.com>
> wrote:
> > >
> > > > Hi everyone.
> > > >
> > > > I've created a new KIP to add metrics to the Kafka Connect framework:
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 196%3A+Add+metrics+to+Kafka+Connect+framework
> > > >
> > > > The KIP approval deadline is looming, so if you're interested in
> Kafka
> > > > Connect metrics please review and provide feedback as soon as
> possible.
> > > I'm
> > > > interested not only in whether the metrics are sufficient and
> > > appropriate,
> > > > but also in whether the MBean naming conventions are okay.
> > > >
> > > > Best regards,
> > > >
> > > > Randall
> > > >
> > >
> > >
> > >
> > > --
> > > *Gwen Shapira*
> > > Product Manager | Confluent
> > > 650.450.2760 | @gwenshap
> > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > > <http://www.confluent.io/blog>
> > >
> >
>

Reply via email to