On Mon, Sep 11, 2017 at 4:50 PM, Randall Hauch <rha...@gmail.com> wrote:
> Thanks, Ewen. Comments inline below. > > On Mon, Sep 11, 2017 at 5:46 PM, Ewen Cheslack-Postava <e...@confluent.io> > wrote: > > > Randall, > > > > A couple of questions: > > > > * Some metrics don't seem to have unique names? e.g. > > source-record-produce-rate and source-record-produce-total seem like they > > are duplicated. Looks like maybe just an oversight that the second ones > > should be changed from "produce" to "write". > > > > Nice catch. You are correct - should be "write" instead of "produce". I > will correct. > > > > * I think there's a stray extra character in a couple of > > places: kafka.connect:type=source-task-metrics,name=source- > > record-produce-total,worker=([-.\w]+)l,connector=([-.\w]+),task=([\d]+) > > has an extra char after the worker name. > > > > Thanks. Removed in 2 places. > > > > * Are the produce totals actually useful given rebalancing would cancel > > them out anyway? Doesn't seem like you could do much with them. > > > > Yes, the totals would be since the last rebalance. Maybe that isn't that > useful. Might be better to capture the offsets and lag as Roger was > suggestion. Thoughts? > > Sounds like you've since removed it. re: offsets/lag, is this something that makes sense for source connectors? I don't think I quite understand the suggestion. > > > * Why do transformations get their own metric but not converters? And are > > we concerned at all about the performance impact of getting such fine > > grained info? Getting current time isn't free and we've seen before that > we > > ended up w/ accidental performance regressions as we tried to check it > too > > frequently to enforce timeouts fine grained in the producer (iirc). > > Batching helps w/ this, but on the consumer side, a max.poll.records=1 > > setting could put you in a bad place, especially since transforms might > be > > very lightweight (or nothing) and converters are expected to be > relatively > > cheap as well. > > > > We could remove the read, transform, and put time-based metrics for sink > tasks, and poll, transform, and write time-based metrics. Can/should they > be replaced with anything else? > We can start with any breakdown we want. I think the only tricky issue that arises is if we decide we *do* want to include more detail. Then the metrics become a bit confusing (or the naming is really important) because there is some overlap in the metrics (e.g. if we converted transformation/conversion/send to "processing" time, but then later broke them down into those components). But I think that's easily resolved with a nice little image showing the finest granularity components and the aggregate components as labels over their constituent parts. > > > > * If we include the worker id everywhere and don't have metrics without > > that included, isn't that a pain for users that dump this data into some > > other system? They have to know which worker the connector/task is > > currently on *or* need to do extra work to merge the metrics from across > > machines. Including versions with the worker ID can make sense for > > completeness and accuracy (e.g. technically there are still very slim > risks > > of having a task running twice due to zombies), but it seems like bad > > usability for the common case. > > > > Part of the reason was also to help identify where each of the metrics came > from, but per the next comment this may not be as useful, either. > So remove the worker ID in all the task and connector metric names? What > about the worker metrics? > Does worker id work as a tag to keep it isolated w/o including it in the name? tbh, JMX style metric naming always seemed overly complicated and messy and I've never fully grokked the details of Kafka metrics as a result. (To me, this is also a huge usability issue w/ Kafka as it stands today -- Kafka metrics don't seem to integrate naturally with almost anything that doesn't assume JMX/Java style metrics.) > > > > * Is aggregating things like source record rate at the (worker, > connector) > > level really useful since you're just going to need to do additional > > aggregation anyway once you've collected metrics across all workers? I'd > > rather add a smaller number of metrics w/ clear use cases than just try > to > > be exhaustive and then have to maintain stuff that nobody actually uses. > > > > Yes, the connector aggregate metrics are maybe not as useful if you also > have to aggregate them from different workers. Removing them probably also > reduces the risk of them being misinterpretted. > I swear I was not at all motivated by reducing the size of the patch that this KIP will need reviewed :) > > > > * You have status for connectors but not for tasks. Any reason why? Seems > > like it'd make sense to expose both, especially since users generally > care > > about task status more than connector status (not many connectors > actually > > run a monitoring thread.) > > > > Ack. > > > > * Is number of tasks for each connector a useful metric? Not sure whether > > someone would find this useful or not. Probably not for alerts, but might > > be useful to be able to check it via your metrics dashboard. > > > > Seems like it might be useful, at least in terms of tracking the number of > tasks over time. Might not be as useful for connectors that have relatively > static tasks, but it would be more interesting/useful for connectors that > create tasks dynamically and periodically request task reconfigurations. > Yeah, most will be static. It mostly seems useful a) just for historical reference, b) possibly to alert on for crazy configs (e.g. max.tasks=1000000 from a newbie user), and c) possibly to alert on changes in general just so you know that connectors are changing. > > > * Same questions re: granularity of sink tasks/connectors timing and > > whether the connectors need all the roll-ups of individual (worker, task) > > values to (worker, connector) level. > > > > I'm fine with taking out the aggregates to keep things simple and prevent > misunderstanding. > > > > * If we expose the who the worker currently thinks is leader, it might > also > > make sense to expose the underlying epoch. Not actually sure if we expose > > that for the consumer today, but it's an indicator of who is properly up > to > > date. > > > > Ack. > > > > * Why worker-level offset commit stats? It's not clear to me that these > are > > useful without considering the specific connector. > > > > So would they make more sense on the tasks? Again, on the worker they're > aggregates. > Yeah, task level could make sense. I was just trying to think through how I would put each of the metrics to use and I wasn't sure what I could infer from the worker-level commit stats since sink connectors vary a lot in how they process commits. -Ewen > > > > > > -Ewen > > > > > > On Mon, Sep 11, 2017 at 9:43 AM, Randall Hauch <rha...@gmail.com> wrote: > > > > > Thanks for reviewing. Responses inline below. > > > > > > On Mon, Sep 11, 2017 at 11:22 AM, Roger Hoover <roger.hoo...@gmail.com > > > > > wrote: > > > > > > > Randall, > > > > > > > > Thank you for the KIP. This should improve visibility greatly. I > had > > a > > > > few questions/ideas for more metrics. > > > > > > > > > > > > 1. What's the relationship between the worker state and the > > connector > > > > status? Does the 'paused' status at the Connector level include > the > > > > time > > > > that worker is 'rebalancing'? > > > > > > > > > > The worker state metric simply reports whether the worker is running or > > > rebalancing. This state is independent of how many connectors are > > > deployed/running/paused. During a rebalance, the connectors are being > > > stopped and restarted but are effectively not running. > > > > > > > > > > 2. Are the "Source Connector" metrics like record rate an > > aggregation > > > of > > > > the "Source Task" metrics? > > > > > > > > > > Yes. > > > > > > > > > > - How much value is there is monitoring at the "Source > Connector" > > > > level (other than status) if the number of constituent tasks > may > > > > change > > > > over time? > > > > > > > > > > The task metrics allow you to know whether the tasks are evenly loaded > > and > > > each making progress. The aggregate connector metrics tell you how much > > > work has been performed by all the tasks in that worker. Both are > useful > > > IMO. > > > > > > > > > > - I'm imagining that it's most useful to collect metrics at the > > > task > > > > level as the task-level metrics should be stable regardless of > > > tasks > > > > shifting to different workers > > > > > > > > > > Correct, this is where the most value is because it is the most fine > > > grained. > > > > > > > > > > - If so, can we duplicate the Connector Status down at the task > > > level > > > > so that all important metrics can be tracked by task? > > > > > > > > > > Possibly. The challenge is that the threads running the tasks are > blocked > > > when a connector is paused. > > > > > > > > > > 3. For the Sink Task metrics > > > > - Can we add offset lag and timestamp lag on commit? > > > > - After records are flushed/committed > > > > - what is the diff between the record timestamps and > commit > > > > time (histogram)? this is a measure of end-to-end > pipeline > > > > latency > > > > - what is the diff between record offsets and latest > offset > > > of > > > > their partition at commit time (histogram)? this is a > > > > measure of whether > > > > this particular task is keeping up > > > > > > > > > > Yeah, possibly. Will have to compare with the consumer metrics to see > > what > > > we can get. > > > > > > > > > > - How about flush error rate? Assuming the sink connectors > > are > > > > using retries, it would be helpful to know how many errors > > they're > > > > seeing > > > > > > > > > > We could add a metric to track how many times the framework receives a > > > retry exception and then retries, but the connectors may also do this > on > > > their own. > > > > > > > > > > - Can we tell at the framework level how many records were > > inserted > > > > vs updated vs deleted? > > > > > > > > > > No, there's no distinction in the Connect framework. > > > > > > > > > > - Batching stats > > > > - Histogram of flush batch size > > > > - Counts of flush trigger method (time vs max batch size) > > > > > > > > > > Should be able to add these. > > > > > > > > > > > > > > Cheers, > > > > > > > > Roger > > > > > > > > On Sun, Sep 10, 2017 at 8:45 AM, Randall Hauch <rha...@gmail.com> > > wrote: > > > > > > > > > Thanks, Gwen. > > > > > > > > > > That's a great idea, so I've changed the KIP to add those metrics. > > I've > > > > > also made a few other changes: > > > > > > > > > > > > > > > 1. The context of all metrics is limited to the activity within > > the > > > > > worker. This wasn't clear before, so I changed the motivation > and > > > > metric > > > > > descriptions to explicitly state this. > > > > > 2. Added the worker ID to all MBean attributes. In addition to > > > > hopefully > > > > > making this same scope obvious from within JMX or other metric > > > > reporting > > > > > system. This is also similar to how the Kafka producer and > > consumer > > > > > metrics > > > > > include the client ID in their MBean attributes. Hopefully this > > does > > > > not > > > > > negatively impact or complicate how external reporting systems' > > > > > aggregate > > > > > metrics from multiple workers. > > > > > 3. Stated explicitly that aggregating metrics across workers was > > out > > > > of > > > > > scope of this KIP. > > > > > 4. Added metrics to report the connector class and version for > > both > > > > sink > > > > > and source connectors. > > > > > > > > > > Check this KIP's history for details of these changes. > > > > > > > > > > Please let me know if you have any other suggestions. I hope to > start > > > the > > > > > voting soon! > > > > > > > > > > Best regards, > > > > > > > > > > Randall > > > > > > > > > > On Thu, Sep 7, 2017 at 9:35 PM, Gwen Shapira <g...@confluent.io> > > > wrote: > > > > > > > > > > > Thanks for the KIP, Randall. Those are badly needed! > > > > > > > > > > > > Can we have two metrics with record rate per task? One before SMT > > and > > > > one > > > > > > after? > > > > > > We can have cases where we read 5000 rows from JDBC but write 5 > to > > > > Kafka, > > > > > > or read 5000 records from Kafka and write 5 due to filtering. I > > think > > > > its > > > > > > important to know both numbers. > > > > > > > > > > > > > > > > > > Gwen > > > > > > > > > > > > On Thu, Sep 7, 2017 at 7:50 PM, Randall Hauch <rha...@gmail.com> > > > > wrote: > > > > > > > > > > > > > Hi everyone. > > > > > > > > > > > > > > I've created a new KIP to add metrics to the Kafka Connect > > > framework: > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > > > > 196%3A+Add+metrics+to+Kafka+Connect+framework > > > > > > > > > > > > > > The KIP approval deadline is looming, so if you're interested > in > > > > Kafka > > > > > > > Connect metrics please review and provide feedback as soon as > > > > possible. > > > > > > I'm > > > > > > > interested not only in whether the metrics are sufficient and > > > > > > appropriate, > > > > > > > but also in whether the MBean naming conventions are okay. > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > > > > Randall > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > *Gwen Shapira* > > > > > > Product Manager | Confluent > > > > > > 650.450.2760 | @gwenshap > > > > > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog > > > > > > <http://www.confluent.io/blog> > > > > > > > > > > > > > > > > > > > > >