Okay, I think I've incorporated all feedback except for Gwen and Roger than would like to have timing metrics. Given the deadline and Ewen's concern about degraded performance, I think it's prudent to leave those out of this KIP and proceed as is.
On Tue, Sep 12, 2017 at 12:48 PM, Randall Hauch <rha...@gmail.com> wrote: > > > On Tue, Sep 12, 2017 at 10:36 AM, Roger Hoover <roger.hoo...@gmail.com> > wrote: > >> Randall/Ewen, >> >> I think the timing info is still useful even if it's measured since the >> last rebalance. How else do you know where time is being spent? >> > > I think Ewen's concern (correct me if I'm wrong) is that measuring > time-based metrics might result in excessive performance degradation, > especially when batch sizes are small. > > >> >> The use case for seeing the batch size is that you generally have two >> knobs >> to configure - max batch size and max wait time. The batch size metrics >> would tell you how full your batches are based on your current linger time >> so you can adjust the config. >> > > It does seem that batch sizes are useful, and the KIP includes these > ("batch-size-max" and "batch-size-avg"). > > >> >> Cheers, >> >> Roger >> >> On Mon, Sep 11, 2017 at 7:50 PM, Ewen Cheslack-Postava <e...@confluent.io >> > >> wrote: >> >> > re: questions about additional metrics, I think we'll undoubtedly find >> more >> > that people want in practice, but as I mentioned earlier I think it's >> > better to add the ones we know we need and then fill out the rest as we >> > figure it out. So, e.g., batch size metrics sound like they could be >> > useful, but I'd probably wait until we have a clear use case. It seems >> > likely that it could be useful in diagnosing slow connectors (e.g. the >> > implementation just does something inefficient), but I'm not really sure >> > about that yet. >> > >> > -Ewen >> > >> > On Mon, Sep 11, 2017 at 7:11 PM, Randall Hauch <rha...@gmail.com> >> wrote: >> > >> > > Based on Roger and Ewen's feedback, I removed the aggregate metrics as >> > they >> > > would be difficult to make use of without extra work. This simplified >> > > things a great deal, and I took the opportunity to reorganize the >> groups >> > of >> > > metrics. Also, based upon Ewen's concerns regarding measuring >> > > times/durations, I removed all time-related metrics except for the >> offset >> > > commits and rebalances, which are infrequent enough to warrant the >> > capture >> > > of percentiles. Roger asked about capturing batch size metrics for >> source >> > > and sink tasks, and offset lag metrics for sink tasks. Finally, Ewen >> > > pointed out that all count/total metrics are only valid since the most >> > > recent rebalance and are therefore less meaningful, and were removed. >> > > >> > > On Mon, Sep 11, 2017 at 6:50 PM, Randall Hauch <rha...@gmail.com> >> wrote: >> > > >> > > > Thanks, Ewen. Comments inline below. >> > > > >> > > > On Mon, Sep 11, 2017 at 5:46 PM, Ewen Cheslack-Postava < >> > > e...@confluent.io> >> > > > wrote: >> > > > >> > > >> Randall, >> > > >> >> > > >> A couple of questions: >> > > >> >> > > >> * Some metrics don't seem to have unique names? e.g. >> > > >> source-record-produce-rate and source-record-produce-total seem >> like >> > > they >> > > >> are duplicated. Looks like maybe just an oversight that the second >> > ones >> > > >> should be changed from "produce" to "write". >> > > >> >> > > > >> > > > Nice catch. You are correct - should be "write" instead of >> "produce". I >> > > > will correct. >> > > > >> > > > >> > > >> * I think there's a stray extra character in a couple of >> > > >> places: kafka.connect:type=source-task-metrics,name=source-record- >> > > >> produce-total,worker=([-.\w]+)l,connector=([-.\w]+),task=([\d]+) >> > > >> has an extra char after the worker name. >> > > >> >> > > > >> > > > Thanks. Removed in 2 places. >> > > > >> > > > >> > > >> * Are the produce totals actually useful given rebalancing would >> > cancel >> > > >> them out anyway? Doesn't seem like you could do much with them. >> > > >> >> > > > >> > > > Yes, the totals would be since the last rebalance. Maybe that isn't >> > that >> > > > useful. Might be better to capture the offsets and lag as Roger was >> > > > suggestion. Thoughts? >> > > > >> > > > >> > > >> * Why do transformations get their own metric but not converters? >> And >> > > are >> > > >> we concerned at all about the performance impact of getting such >> fine >> > > >> grained info? Getting current time isn't free and we've seen before >> > that >> > > >> we >> > > >> ended up w/ accidental performance regressions as we tried to >> check it >> > > too >> > > >> frequently to enforce timeouts fine grained in the producer (iirc). >> > > >> Batching helps w/ this, but on the consumer side, a >> max.poll.records=1 >> > > >> setting could put you in a bad place, especially since transforms >> > might >> > > be >> > > >> very lightweight (or nothing) and converters are expected to be >> > > relatively >> > > >> cheap as well. >> > > >> >> > > > >> > > > We could remove the read, transform, and put time-based metrics for >> > sink >> > > > tasks, and poll, transform, and write time-based metrics. Can/should >> > they >> > > > be replaced with anything else? >> > > > >> > > > >> > > >> * If we include the worker id everywhere and don't have metrics >> > without >> > > >> that included, isn't that a pain for users that dump this data into >> > some >> > > >> other system? They have to know which worker the connector/task is >> > > >> currently on *or* need to do extra work to merge the metrics from >> > across >> > > >> machines. Including versions with the worker ID can make sense for >> > > >> completeness and accuracy (e.g. technically there are still very >> slim >> > > >> risks >> > > >> of having a task running twice due to zombies), but it seems like >> bad >> > > >> usability for the common case. >> > > >> >> > > > >> > > > Part of the reason was also to help identify where each of the >> metrics >> > > > came from, but per the next comment this may not be as useful, >> either. >> > > > So remove the worker ID in all the task and connector metric names? >> > What >> > > > about the worker metrics? >> > > > >> > > > >> > > >> * Is aggregating things like source record rate at the (worker, >> > > connector) >> > > >> level really useful since you're just going to need to do >> additional >> > > >> aggregation anyway once you've collected metrics across all >> workers? >> > I'd >> > > >> rather add a smaller number of metrics w/ clear use cases than just >> > try >> > > to >> > > >> be exhaustive and then have to maintain stuff that nobody actually >> > uses. >> > > >> >> > > > >> > > > Yes, the connector aggregate metrics are maybe not as useful if you >> > also >> > > > have to aggregate them from different workers. Removing them >> probably >> > > also >> > > > reduces the risk of them being misinterpretted. >> > > > >> > > > >> > > >> * You have status for connectors but not for tasks. Any reason why? >> > > Seems >> > > >> like it'd make sense to expose both, especially since users >> generally >> > > care >> > > >> about task status more than connector status (not many connectors >> > > actually >> > > >> run a monitoring thread.) >> > > >> >> > > > >> > > > Ack. >> > > > >> > > > >> > > >> * Is number of tasks for each connector a useful metric? Not sure >> > > whether >> > > >> someone would find this useful or not. Probably not for alerts, but >> > > might >> > > >> be useful to be able to check it via your metrics dashboard. >> > > >> >> > > > >> > > > Seems like it might be useful, at least in terms of tracking the >> number >> > > of >> > > > tasks over time. Might not be as useful for connectors that have >> > > relatively >> > > > static tasks, but it would be more interesting/useful for connectors >> > that >> > > > create tasks dynamically and periodically request task >> > reconfigurations. >> > > > >> > > > >> > > >> * Same questions re: granularity of sink tasks/connectors timing >> and >> > > >> whether the connectors need all the roll-ups of individual (worker, >> > > task) >> > > >> values to (worker, connector) level. >> > > >> >> > > > >> > > > I'm fine with taking out the aggregates to keep things simple and >> > prevent >> > > > misunderstanding. >> > > > >> > > > >> > > >> * If we expose the who the worker currently thinks is leader, it >> might >> > > >> also >> > > >> make sense to expose the underlying epoch. Not actually sure if we >> > > expose >> > > >> that for the consumer today, but it's an indicator of who is >> properly >> > up >> > > >> to >> > > >> date. >> > > >> >> > > > >> > > > Ack. >> > > > >> > > > >> > > >> * Why worker-level offset commit stats? It's not clear to me that >> > these >> > > >> are >> > > >> useful without considering the specific connector. >> > > >> >> > > > >> > > > So would they make more sense on the tasks? Again, on the worker >> > they're >> > > > aggregates. >> > > > >> > > > >> > > >> >> > > >> -Ewen >> > > >> >> > > >> >> > > >> On Mon, Sep 11, 2017 at 9:43 AM, Randall Hauch <rha...@gmail.com> >> > > wrote: >> > > >> >> > > >> > Thanks for reviewing. Responses inline below. >> > > >> > >> > > >> > On Mon, Sep 11, 2017 at 11:22 AM, Roger Hoover < >> > > roger.hoo...@gmail.com> >> > > >> > wrote: >> > > >> > >> > > >> > > Randall, >> > > >> > > >> > > >> > > Thank you for the KIP. This should improve visibility >> greatly. I >> > > >> had a >> > > >> > > few questions/ideas for more metrics. >> > > >> > > >> > > >> > > >> > > >> > > 1. What's the relationship between the worker state and the >> > > >> connector >> > > >> > > status? Does the 'paused' status at the Connector level >> > include >> > > >> the >> > > >> > > time >> > > >> > > that worker is 'rebalancing'? >> > > >> > > >> > > >> > >> > > >> > The worker state metric simply reports whether the worker is >> running >> > > or >> > > >> > rebalancing. This state is independent of how many connectors are >> > > >> > deployed/running/paused. During a rebalance, the connectors are >> > being >> > > >> > stopped and restarted but are effectively not running. >> > > >> > >> > > >> > >> > > >> > > 2. Are the "Source Connector" metrics like record rate an >> > > >> aggregation >> > > >> > of >> > > >> > > the "Source Task" metrics? >> > > >> > > >> > > >> > >> > > >> > Yes. >> > > >> > >> > > >> > >> > > >> > > - How much value is there is monitoring at the "Source >> > > >> Connector" >> > > >> > > level (other than status) if the number of constituent >> tasks >> > > may >> > > >> > > change >> > > >> > > over time? >> > > >> > > >> > > >> > >> > > >> > The task metrics allow you to know whether the tasks are evenly >> > loaded >> > > >> and >> > > >> > each making progress. The aggregate connector metrics tell you >> how >> > > much >> > > >> > work has been performed by all the tasks in that worker. Both are >> > > useful >> > > >> > IMO. >> > > >> > >> > > >> > >> > > >> > > - I'm imagining that it's most useful to collect metrics >> at >> > > the >> > > >> > task >> > > >> > > level as the task-level metrics should be stable >> regardless >> > of >> > > >> > tasks >> > > >> > > shifting to different workers >> > > >> > > >> > > >> > >> > > >> > Correct, this is where the most value is because it is the most >> fine >> > > >> > grained. >> > > >> > >> > > >> > >> > > >> > > - If so, can we duplicate the Connector Status down at >> the >> > > task >> > > >> > level >> > > >> > > so that all important metrics can be tracked by task? >> > > >> > > >> > > >> > >> > > >> > Possibly. The challenge is that the threads running the tasks are >> > > >> blocked >> > > >> > when a connector is paused. >> > > >> > >> > > >> > >> > > >> > > 3. For the Sink Task metrics >> > > >> > > - Can we add offset lag and timestamp lag on commit? >> > > >> > > - After records are flushed/committed >> > > >> > > - what is the diff between the record timestamps >> and >> > > >> commit >> > > >> > > time (histogram)? this is a measure of end-to-end >> > > >> pipeline >> > > >> > > latency >> > > >> > > - what is the diff between record offsets and >> latest >> > > >> offset >> > > >> > of >> > > >> > > their partition at commit time (histogram)? this >> is a >> > > >> > > measure of whether >> > > >> > > this particular task is keeping up >> > > >> > > >> > > >> > >> > > >> > Yeah, possibly. Will have to compare with the consumer metrics to >> > see >> > > >> what >> > > >> > we can get. >> > > >> > >> > > >> > >> > > >> > > - How about flush error rate? Assuming the sink >> > connectors >> > > >> are >> > > >> > > using retries, it would be helpful to know how many >> errors >> > > >> they're >> > > >> > > seeing >> > > >> > > >> > > >> > >> > > >> > We could add a metric to track how many times the framework >> > receives a >> > > >> > retry exception and then retries, but the connectors may also do >> > this >> > > on >> > > >> > their own. >> > > >> > >> > > >> > >> > > >> > > - Can we tell at the framework level how many records >> were >> > > >> inserted >> > > >> > > vs updated vs deleted? >> > > >> > > >> > > >> > >> > > >> > No, there's no distinction in the Connect framework. >> > > >> > >> > > >> > >> > > >> > > - Batching stats >> > > >> > > - Histogram of flush batch size >> > > >> > > - Counts of flush trigger method (time vs max batch >> size) >> > > >> > > >> > > >> > >> > > >> > Should be able to add these. >> > > >> > >> > > >> > >> > > >> > > >> > > >> > > Cheers, >> > > >> > > >> > > >> > > Roger >> > > >> > > >> > > >> > > On Sun, Sep 10, 2017 at 8:45 AM, Randall Hauch < >> rha...@gmail.com> >> > > >> wrote: >> > > >> > > >> > > >> > > > Thanks, Gwen. >> > > >> > > > >> > > >> > > > That's a great idea, so I've changed the KIP to add those >> > metrics. >> > > >> I've >> > > >> > > > also made a few other changes: >> > > >> > > > >> > > >> > > > >> > > >> > > > 1. The context of all metrics is limited to the activity >> > within >> > > >> the >> > > >> > > > worker. This wasn't clear before, so I changed the >> motivation >> > > and >> > > >> > > metric >> > > >> > > > descriptions to explicitly state this. >> > > >> > > > 2. Added the worker ID to all MBean attributes. In >> addition >> > to >> > > >> > > hopefully >> > > >> > > > making this same scope obvious from within JMX or other >> > metric >> > > >> > > reporting >> > > >> > > > system. This is also similar to how the Kafka producer and >> > > >> consumer >> > > >> > > > metrics >> > > >> > > > include the client ID in their MBean attributes. Hopefully >> > this >> > > >> does >> > > >> > > not >> > > >> > > > negatively impact or complicate how external reporting >> > systems' >> > > >> > > > aggregate >> > > >> > > > metrics from multiple workers. >> > > >> > > > 3. Stated explicitly that aggregating metrics across >> workers >> > > was >> > > >> out >> > > >> > > of >> > > >> > > > scope of this KIP. >> > > >> > > > 4. Added metrics to report the connector class and version >> > for >> > > >> both >> > > >> > > sink >> > > >> > > > and source connectors. >> > > >> > > > >> > > >> > > > Check this KIP's history for details of these changes. >> > > >> > > > >> > > >> > > > Please let me know if you have any other suggestions. I hope >> to >> > > >> start >> > > >> > the >> > > >> > > > voting soon! >> > > >> > > > >> > > >> > > > Best regards, >> > > >> > > > >> > > >> > > > Randall >> > > >> > > > >> > > >> > > > On Thu, Sep 7, 2017 at 9:35 PM, Gwen Shapira < >> g...@confluent.io >> > > >> > > >> > wrote: >> > > >> > > > >> > > >> > > > > Thanks for the KIP, Randall. Those are badly needed! >> > > >> > > > > >> > > >> > > > > Can we have two metrics with record rate per task? One >> before >> > > SMT >> > > >> and >> > > >> > > one >> > > >> > > > > after? >> > > >> > > > > We can have cases where we read 5000 rows from JDBC but >> write >> > 5 >> > > to >> > > >> > > Kafka, >> > > >> > > > > or read 5000 records from Kafka and write 5 due to >> filtering. >> > I >> > > >> think >> > > >> > > its >> > > >> > > > > important to know both numbers. >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > Gwen >> > > >> > > > > >> > > >> > > > > On Thu, Sep 7, 2017 at 7:50 PM, Randall Hauch < >> > rha...@gmail.com >> > > > >> > > >> > > wrote: >> > > >> > > > > >> > > >> > > > > > Hi everyone. >> > > >> > > > > > >> > > >> > > > > > I've created a new KIP to add metrics to the Kafka >> Connect >> > > >> > framework: >> > > >> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- >> > > >> > > > > > 196%3A+Add+metrics+to+Kafka+Connect+framework >> > > >> > > > > > >> > > >> > > > > > The KIP approval deadline is looming, so if you're >> > interested >> > > in >> > > >> > > Kafka >> > > >> > > > > > Connect metrics please review and provide feedback as >> soon >> > as >> > > >> > > possible. >> > > >> > > > > I'm >> > > >> > > > > > interested not only in whether the metrics are sufficient >> > and >> > > >> > > > > appropriate, >> > > >> > > > > > but also in whether the MBean naming conventions are >> okay. >> > > >> > > > > > >> > > >> > > > > > Best regards, >> > > >> > > > > > >> > > >> > > > > > Randall >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > -- >> > > >> > > > > *Gwen Shapira* >> > > >> > > > > Product Manager | Confluent >> > > >> > > > > 650.450.2760 | @gwenshap >> > > >> > > > > Follow us: Twitter <https://twitter.com/ConfluentInc> | >> blog >> > > >> > > > > <http://www.confluent.io/blog> >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > > >> > > > >> > > >> > >> > >