Thanks, Ewen. I agree with you on the overhead of measuring time for SMTs and converters. I'd still argue for keeping other metrics like flush time b/c even small batches should still be small overhead compared to writing to a sink.
On Tue, Sep 12, 2017 at 3:06 PM, Ewen Cheslack-Postava <e...@confluent.io> wrote: > Requests are generally substantial batches of data, you are not guaranteed > that for the processing batches both because source connectors can hand you > batches of whatever size they want and consumer's max.poll.records can be > overridden. > > Both SMTs and converters are a concern because they can both be relatively > cheap such that just checking the time in between them could possibly dwarf > the cost of applying them. > > Also, another thought re: rebalance metrics: we are already getting some > info via AbstractCoordinator and those actually provide a bit more detail > in some ways (e.g. join & sync vs the entire rebalance). Not sure if we > want to effectively duplicate some info so it can all be located under > Connect names or rely on the existing metrics for some of these. > > -Ewen > > On Tue, Sep 12, 2017 at 2:05 PM, Roger Hoover <roger.hoo...@gmail.com> > wrote: > > > Ewen, > > > > I don't know the details of the perf concern. How is it that the Kafka > > broker can keep latency stats per request without suffering too much > > performance? Maybe SMTs are the only concern b/c they are per-message. > If > > so, let's remove those and keep timing info for everything else like > > flushes, which are batch-based. > > > > > > On Tue, Sep 12, 2017 at 1:32 PM, Ewen Cheslack-Postava < > e...@confluent.io> > > wrote: > > > > > On Tue, Sep 12, 2017 at 10:55 AM, Gwen Shapira <g...@confluent.io> > > wrote: > > > > > > > Ewen, you gave a nice talk at Kafka Summit where you warned about the > > > > danger of SMTs that slow down the data pipe. If we don't provide the > > time > > > > metrics, how will users know when their SMTs are causing performance > > > > issues? > > > > > > > > > > Metrics aren't the only way to gain insight about performance and > always > > > measuring this even when it's not necessarily being used may not make > > > sense. SMT authors are much better off starting out with a JMH or > similar > > > benchmark. What I was referring to in the talk is more about > > understanding > > > that the processing for SMTs is entirely synchronous and that means > > certain > > > classes of operations will just generally be a bad idea, e.g. anything > > that > > > goes out over the network to another service. You don't even really > need > > > performance info to determine that that type of transformation will > cause > > > problems. > > > > > > But my point wasn't that timing info isn't useful. It's that we know > that > > > getting timestamps is pretty expensive and we'll already be doing so > > > elsewhere (e.g. if a source record doesn't include a timestamp). For > some > > > use cases such as ByteArrayConverter + no SMTs + lightweight processing > > > (e.g. just gets handed to a background thread that deals with sending > the > > > data), it wouldn't be out of the question that adding 4 or so more > calls > > to > > > get timestamps could become a bottleneck. Since I don't know if it > would > > > but we have definitely seen the issue come up before, I would be > > > conservative in adding the metrics unless we had some numbers showing > it > > > doesn't matter or doesn't matter much. > > > > > > In general, I don't think metrics that require always-on measurement > are > > a > > > good way to get fine grained performance information. Instrumenting > > > different phases that imply different types of performance problems can > > be > > > helpful (e.g. "processing time" that should be CPU/memory throughput > > bound > > > vs. "send time" that, at least for many connectors, is more likely to > be > > IO > > > bound), but if you want finer-grained details, you probably either want > > > something that can be toggled on/off temporarily or just use a tool > > that's > > > really designed for the job, i.e. a profiler like perf. > > > > > > -Ewen > > > > > > > > > > > > > > Gwen > > > > > > > > On Mon, Sep 11, 2017 at 7:50 PM Ewen Cheslack-Postava < > > e...@confluent.io > > > > > > > > wrote: > > > > > > > > > re: questions about additional metrics, I think we'll undoubtedly > > find > > > > more > > > > > that people want in practice, but as I mentioned earlier I think > it's > > > > > better to add the ones we know we need and then fill out the rest > as > > we > > > > > figure it out. So, e.g., batch size metrics sound like they could > be > > > > > useful, but I'd probably wait until we have a clear use case. It > > seems > > > > > likely that it could be useful in diagnosing slow connectors (e.g. > > the > > > > > implementation just does something inefficient), but I'm not really > > > sure > > > > > about that yet. > > > > > > > > > > -Ewen > > > > > > > > > > On Mon, Sep 11, 2017 at 7:11 PM, Randall Hauch <rha...@gmail.com> > > > wrote: > > > > > > > > > > > Based on Roger and Ewen's feedback, I removed the aggregate > metrics > > > as > > > > > they > > > > > > would be difficult to make use of without extra work. This > > simplified > > > > > > things a great deal, and I took the opportunity to reorganize the > > > > groups > > > > > of > > > > > > metrics. Also, based upon Ewen's concerns regarding measuring > > > > > > times/durations, I removed all time-related metrics except for > the > > > > offset > > > > > > commits and rebalances, which are infrequent enough to warrant > the > > > > > capture > > > > > > of percentiles. Roger asked about capturing batch size metrics > for > > > > source > > > > > > and sink tasks, and offset lag metrics for sink tasks. Finally, > > Ewen > > > > > > pointed out that all count/total metrics are only valid since the > > > most > > > > > > recent rebalance and are therefore less meaningful, and were > > removed. > > > > > > > > > > > > On Mon, Sep 11, 2017 at 6:50 PM, Randall Hauch <rha...@gmail.com > > > > > > wrote: > > > > > > > > > > > > > Thanks, Ewen. Comments inline below. > > > > > > > > > > > > > > On Mon, Sep 11, 2017 at 5:46 PM, Ewen Cheslack-Postava < > > > > > > e...@confluent.io> > > > > > > > wrote: > > > > > > > > > > > > > >> Randall, > > > > > > >> > > > > > > >> A couple of questions: > > > > > > >> > > > > > > >> * Some metrics don't seem to have unique names? e.g. > > > > > > >> source-record-produce-rate and source-record-produce-total > seem > > > like > > > > > > they > > > > > > >> are duplicated. Looks like maybe just an oversight that the > > second > > > > > ones > > > > > > >> should be changed from "produce" to "write". > > > > > > >> > > > > > > > > > > > > > > Nice catch. You are correct - should be "write" instead of > > > > "produce". I > > > > > > > will correct. > > > > > > > > > > > > > > > > > > > > >> * I think there's a stray extra character in a couple of > > > > > > >> places: kafka.connect:type=source-task > > > -metrics,name=source-record- > > > > > > >> produce-total,worker=([-.\w]+)l,connector=([-.\w]+),task=([\ > > d]+) > > > > > > >> has an extra char after the worker name. > > > > > > >> > > > > > > > > > > > > > > Thanks. Removed in 2 places. > > > > > > > > > > > > > > > > > > > > >> * Are the produce totals actually useful given rebalancing > would > > > > > cancel > > > > > > >> them out anyway? Doesn't seem like you could do much with > them. > > > > > > >> > > > > > > > > > > > > > > Yes, the totals would be since the last rebalance. Maybe that > > isn't > > > > > that > > > > > > > useful. Might be better to capture the offsets and lag as Roger > > was > > > > > > > suggestion. Thoughts? > > > > > > > > > > > > > > > > > > > > >> * Why do transformations get their own metric but not > > converters? > > > > And > > > > > > are > > > > > > >> we concerned at all about the performance impact of getting > such > > > > fine > > > > > > >> grained info? Getting current time isn't free and we've seen > > > before > > > > > that > > > > > > >> we > > > > > > >> ended up w/ accidental performance regressions as we tried to > > > check > > > > it > > > > > > too > > > > > > >> frequently to enforce timeouts fine grained in the producer > > > (iirc). > > > > > > >> Batching helps w/ this, but on the consumer side, a > > > > max.poll.records=1 > > > > > > >> setting could put you in a bad place, especially since > > transforms > > > > > might > > > > > > be > > > > > > >> very lightweight (or nothing) and converters are expected to > be > > > > > > relatively > > > > > > >> cheap as well. > > > > > > >> > > > > > > > > > > > > > > We could remove the read, transform, and put time-based metrics > > for > > > > > sink > > > > > > > tasks, and poll, transform, and write time-based metrics. > > > Can/should > > > > > they > > > > > > > be replaced with anything else? > > > > > > > > > > > > > > > > > > > > >> * If we include the worker id everywhere and don't have > metrics > > > > > without > > > > > > >> that included, isn't that a pain for users that dump this data > > > into > > > > > some > > > > > > >> other system? They have to know which worker the > connector/task > > is > > > > > > >> currently on *or* need to do extra work to merge the metrics > > from > > > > > across > > > > > > >> machines. Including versions with the worker ID can make sense > > for > > > > > > >> completeness and accuracy (e.g. technically there are still > very > > > > slim > > > > > > >> risks > > > > > > >> of having a task running twice due to zombies), but it seems > > like > > > > bad > > > > > > >> usability for the common case. > > > > > > >> > > > > > > > > > > > > > > Part of the reason was also to help identify where each of the > > > > metrics > > > > > > > came from, but per the next comment this may not be as useful, > > > > either. > > > > > > > So remove the worker ID in all the task and connector metric > > names? > > > > > What > > > > > > > about the worker metrics? > > > > > > > > > > > > > > > > > > > > >> * Is aggregating things like source record rate at the > (worker, > > > > > > connector) > > > > > > >> level really useful since you're just going to need to do > > > additional > > > > > > >> aggregation anyway once you've collected metrics across all > > > workers? > > > > > I'd > > > > > > >> rather add a smaller number of metrics w/ clear use cases than > > > just > > > > > try > > > > > > to > > > > > > >> be exhaustive and then have to maintain stuff that nobody > > actually > > > > > uses. > > > > > > >> > > > > > > > > > > > > > > Yes, the connector aggregate metrics are maybe not as useful if > > you > > > > > also > > > > > > > have to aggregate them from different workers. Removing them > > > probably > > > > > > also > > > > > > > reduces the risk of them being misinterpretted. > > > > > > > > > > > > > > > > > > > > >> * You have status for connectors but not for tasks. Any reason > > > why? > > > > > > Seems > > > > > > >> like it'd make sense to expose both, especially since users > > > > generally > > > > > > care > > > > > > >> about task status more than connector status (not many > > connectors > > > > > > actually > > > > > > >> run a monitoring thread.) > > > > > > >> > > > > > > > > > > > > > > Ack. > > > > > > > > > > > > > > > > > > > > >> * Is number of tasks for each connector a useful metric? Not > > sure > > > > > > whether > > > > > > >> someone would find this useful or not. Probably not for > alerts, > > > but > > > > > > might > > > > > > >> be useful to be able to check it via your metrics dashboard. > > > > > > >> > > > > > > > > > > > > > > Seems like it might be useful, at least in terms of tracking > the > > > > number > > > > > > of > > > > > > > tasks over time. Might not be as useful for connectors that > have > > > > > > relatively > > > > > > > static tasks, but it would be more interesting/useful for > > > connectors > > > > > that > > > > > > > create tasks dynamically and periodically request task > > > > > reconfigurations. > > > > > > > > > > > > > > > > > > > > >> * Same questions re: granularity of sink tasks/connectors > timing > > > and > > > > > > >> whether the connectors need all the roll-ups of individual > > > (worker, > > > > > > task) > > > > > > >> values to (worker, connector) level. > > > > > > >> > > > > > > > > > > > > > > I'm fine with taking out the aggregates to keep things simple > and > > > > > prevent > > > > > > > misunderstanding. > > > > > > > > > > > > > > > > > > > > >> * If we expose the who the worker currently thinks is leader, > it > > > > might > > > > > > >> also > > > > > > >> make sense to expose the underlying epoch. Not actually sure > if > > we > > > > > > expose > > > > > > >> that for the consumer today, but it's an indicator of who is > > > > properly > > > > > up > > > > > > >> to > > > > > > >> date. > > > > > > >> > > > > > > > > > > > > > > Ack. > > > > > > > > > > > > > > > > > > > > >> * Why worker-level offset commit stats? It's not clear to me > > that > > > > > these > > > > > > >> are > > > > > > >> useful without considering the specific connector. > > > > > > >> > > > > > > > > > > > > > > So would they make more sense on the tasks? Again, on the > worker > > > > > they're > > > > > > > aggregates. > > > > > > > > > > > > > > > > > > > > >> > > > > > > >> -Ewen > > > > > > >> > > > > > > >> > > > > > > >> On Mon, Sep 11, 2017 at 9:43 AM, Randall Hauch < > > rha...@gmail.com> > > > > > > wrote: > > > > > > >> > > > > > > >> > Thanks for reviewing. Responses inline below. > > > > > > >> > > > > > > > >> > On Mon, Sep 11, 2017 at 11:22 AM, Roger Hoover < > > > > > > roger.hoo...@gmail.com> > > > > > > >> > wrote: > > > > > > >> > > > > > > > >> > > Randall, > > > > > > >> > > > > > > > > >> > > Thank you for the KIP. This should improve visibility > > > > greatly. I > > > > > > >> had a > > > > > > >> > > few questions/ideas for more metrics. > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > 1. What's the relationship between the worker state and > > the > > > > > > >> connector > > > > > > >> > > status? Does the 'paused' status at the Connector > level > > > > > include > > > > > > >> the > > > > > > >> > > time > > > > > > >> > > that worker is 'rebalancing'? > > > > > > >> > > > > > > > > >> > > > > > > > >> > The worker state metric simply reports whether the worker is > > > > running > > > > > > or > > > > > > >> > rebalancing. This state is independent of how many > connectors > > > are > > > > > > >> > deployed/running/paused. During a rebalance, the connectors > > are > > > > > being > > > > > > >> > stopped and restarted but are effectively not running. > > > > > > >> > > > > > > > >> > > > > > > > >> > > 2. Are the "Source Connector" metrics like record rate > an > > > > > > >> aggregation > > > > > > >> > of > > > > > > >> > > the "Source Task" metrics? > > > > > > >> > > > > > > > > >> > > > > > > > >> > Yes. > > > > > > >> > > > > > > > >> > > > > > > > >> > > - How much value is there is monitoring at the > "Source > > > > > > >> Connector" > > > > > > >> > > level (other than status) if the number of > constituent > > > > tasks > > > > > > may > > > > > > >> > > change > > > > > > >> > > over time? > > > > > > >> > > > > > > > > >> > > > > > > > >> > The task metrics allow you to know whether the tasks are > > evenly > > > > > loaded > > > > > > >> and > > > > > > >> > each making progress. The aggregate connector metrics tell > you > > > how > > > > > > much > > > > > > >> > work has been performed by all the tasks in that worker. > Both > > > are > > > > > > useful > > > > > > >> > IMO. > > > > > > >> > > > > > > > >> > > > > > > > >> > > - I'm imagining that it's most useful to collect > > metrics > > > > at > > > > > > the > > > > > > >> > task > > > > > > >> > > level as the task-level metrics should be stable > > > > regardless > > > > > of > > > > > > >> > tasks > > > > > > >> > > shifting to different workers > > > > > > >> > > > > > > > > >> > > > > > > > >> > Correct, this is where the most value is because it is the > > most > > > > fine > > > > > > >> > grained. > > > > > > >> > > > > > > > >> > > > > > > > >> > > - If so, can we duplicate the Connector Status down > at > > > the > > > > > > task > > > > > > >> > level > > > > > > >> > > so that all important metrics can be tracked by > > task? > > > > > > >> > > > > > > > > >> > > > > > > > >> > Possibly. The challenge is that the threads running the > tasks > > > are > > > > > > >> blocked > > > > > > >> > when a connector is paused. > > > > > > >> > > > > > > > >> > > > > > > > >> > > 3. For the Sink Task metrics > > > > > > >> > > - Can we add offset lag and timestamp lag on commit? > > > > > > >> > > - After records are flushed/committed > > > > > > >> > > - what is the diff between the record > timestamps > > > and > > > > > > >> commit > > > > > > >> > > time (histogram)? this is a measure of > > end-to-end > > > > > > >> pipeline > > > > > > >> > > latency > > > > > > >> > > - what is the diff between record offsets and > > > latest > > > > > > >> offset > > > > > > >> > of > > > > > > >> > > their partition at commit time (histogram)? > this > > > is > > > > a > > > > > > >> > > measure of whether > > > > > > >> > > this particular task is keeping up > > > > > > >> > > > > > > > > >> > > > > > > > >> > Yeah, possibly. Will have to compare with the consumer > metrics > > > to > > > > > see > > > > > > >> what > > > > > > >> > we can get. > > > > > > >> > > > > > > > >> > > > > > > > >> > > - How about flush error rate? Assuming the sink > > > > > connectors > > > > > > >> are > > > > > > >> > > using retries, it would be helpful to know how many > > > errors > > > > > > >> they're > > > > > > >> > > seeing > > > > > > >> > > > > > > > > >> > > > > > > > >> > We could add a metric to track how many times the framework > > > > > receives a > > > > > > >> > retry exception and then retries, but the connectors may > also > > do > > > > > this > > > > > > on > > > > > > >> > their own. > > > > > > >> > > > > > > > >> > > > > > > > >> > > - Can we tell at the framework level how many > records > > > were > > > > > > >> inserted > > > > > > >> > > vs updated vs deleted? > > > > > > >> > > > > > > > > >> > > > > > > > >> > No, there's no distinction in the Connect framework. > > > > > > >> > > > > > > > >> > > > > > > > >> > > - Batching stats > > > > > > >> > > - Histogram of flush batch size > > > > > > >> > > - Counts of flush trigger method (time vs max > batch > > > > size) > > > > > > >> > > > > > > > > >> > > > > > > > >> > Should be able to add these. > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > >> > > Cheers, > > > > > > >> > > > > > > > > >> > > Roger > > > > > > >> > > > > > > > > >> > > On Sun, Sep 10, 2017 at 8:45 AM, Randall Hauch < > > > > rha...@gmail.com> > > > > > > >> wrote: > > > > > > >> > > > > > > > > >> > > > Thanks, Gwen. > > > > > > >> > > > > > > > > > >> > > > That's a great idea, so I've changed the KIP to add > those > > > > > metrics. > > > > > > >> I've > > > > > > >> > > > also made a few other changes: > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > 1. The context of all metrics is limited to the > > activity > > > > > within > > > > > > >> the > > > > > > >> > > > worker. This wasn't clear before, so I changed the > > > > motivation > > > > > > and > > > > > > >> > > metric > > > > > > >> > > > descriptions to explicitly state this. > > > > > > >> > > > 2. Added the worker ID to all MBean attributes. In > > > addition > > > > > to > > > > > > >> > > hopefully > > > > > > >> > > > making this same scope obvious from within JMX or > other > > > > > metric > > > > > > >> > > reporting > > > > > > >> > > > system. This is also similar to how the Kafka > producer > > > and > > > > > > >> consumer > > > > > > >> > > > metrics > > > > > > >> > > > include the client ID in their MBean attributes. > > > Hopefully > > > > > this > > > > > > >> does > > > > > > >> > > not > > > > > > >> > > > negatively impact or complicate how external > reporting > > > > > systems' > > > > > > >> > > > aggregate > > > > > > >> > > > metrics from multiple workers. > > > > > > >> > > > 3. Stated explicitly that aggregating metrics across > > > > workers > > > > > > was > > > > > > >> out > > > > > > >> > > of > > > > > > >> > > > scope of this KIP. > > > > > > >> > > > 4. Added metrics to report the connector class and > > > version > > > > > for > > > > > > >> both > > > > > > >> > > sink > > > > > > >> > > > and source connectors. > > > > > > >> > > > > > > > > > >> > > > Check this KIP's history for details of these changes. > > > > > > >> > > > > > > > > > >> > > > Please let me know if you have any other suggestions. I > > hope > > > > to > > > > > > >> start > > > > > > >> > the > > > > > > >> > > > voting soon! > > > > > > >> > > > > > > > > > >> > > > Best regards, > > > > > > >> > > > > > > > > > >> > > > Randall > > > > > > >> > > > > > > > > > >> > > > On Thu, Sep 7, 2017 at 9:35 PM, Gwen Shapira < > > > > g...@confluent.io > > > > > > > > > > > > >> > wrote: > > > > > > >> > > > > > > > > > >> > > > > Thanks for the KIP, Randall. Those are badly needed! > > > > > > >> > > > > > > > > > > >> > > > > Can we have two metrics with record rate per task? One > > > > before > > > > > > SMT > > > > > > >> and > > > > > > >> > > one > > > > > > >> > > > > after? > > > > > > >> > > > > We can have cases where we read 5000 rows from JDBC > but > > > > write > > > > > 5 > > > > > > to > > > > > > >> > > Kafka, > > > > > > >> > > > > or read 5000 records from Kafka and write 5 due to > > > > filtering. > > > > > I > > > > > > >> think > > > > > > >> > > its > > > > > > >> > > > > important to know both numbers. > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > Gwen > > > > > > >> > > > > > > > > > > >> > > > > On Thu, Sep 7, 2017 at 7:50 PM, Randall Hauch < > > > > > rha...@gmail.com > > > > > > > > > > > > > >> > > wrote: > > > > > > >> > > > > > > > > > > >> > > > > > Hi everyone. > > > > > > >> > > > > > > > > > > > >> > > > > > I've created a new KIP to add metrics to the Kafka > > > Connect > > > > > > >> > framework: > > > > > > >> > > > > > https://cwiki.apache.org/ > > confluence/display/KAFKA/KIP- > > > > > > >> > > > > > 196%3A+Add+metrics+to+Kafka+Connect+framework > > > > > > >> > > > > > > > > > > > >> > > > > > The KIP approval deadline is looming, so if you're > > > > > interested > > > > > > in > > > > > > >> > > Kafka > > > > > > >> > > > > > Connect metrics please review and provide feedback > as > > > soon > > > > > as > > > > > > >> > > possible. > > > > > > >> > > > > I'm > > > > > > >> > > > > > interested not only in whether the metrics are > > > sufficient > > > > > and > > > > > > >> > > > > appropriate, > > > > > > >> > > > > > but also in whether the MBean naming conventions are > > > okay. > > > > > > >> > > > > > > > > > > > >> > > > > > Best regards, > > > > > > >> > > > > > > > > > > > >> > > > > > Randall > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > -- > > > > > > >> > > > > *Gwen Shapira* > > > > > > >> > > > > Product Manager | Confluent > > > > > > >> > > > > 650.450.2760 <(650)%20450-2760> | @gwenshap > > > > > > >> > > > > Follow us: Twitter <https://twitter.com/ConfluentInc> > | > > > > blog > > > > > > >> > > > > <http://www.confluent.io/blog> > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >