Thanks Sophie, I hope this isn't too nit-picky, but is there a reason to choose "avg" instead of "mean"? Maybe this is too paranoid, and I might be oversensitive because of the mistake I just made earlier, but it strikes me that "avg" is actually ambiguous, as it refers to a family of statistics, whereas "mean" is specific. I see other Kafka metrics with "avg", but none with "mean"; was that the reason? If so, I'm +1.
Regarding the names of the percentile, I actually couldn't find _any_ other metrics that use percentile. Was there a reason to choose "99th" as opposed to "p99" or any other scheme? This is not a criticism, I'm just primarily asking for consistency's sake. Thanks again, -John On Wed, May 13, 2020, at 19:19, Sophie Blee-Goldman wrote: > Alright, I can get behind adding the min metric for the sake of pretty > graphs > (and trivial computation). > > I'm still on the fence regarding the mean (or 50th percentile) but I can see > how users might expect it and find it a bit disorienting not to have. So the > updated proposed metrics are > > > - record-staleness-max [ms] > - record-staleness-99th [ms] *(99th percentile)* > - record-staleness-75th [ms] *(75th percentile)* > - record-staleness-avg [ms] *(mean)* > - record-staleness-min [ms] > > > On Wed, May 13, 2020 at 4:42 PM John Roesler <vvcep...@apache.org> wrote: > > > Oh boy, I never miss an opportunity to embarrass myself. I guess the mean > > seems more interesting to me than the median, but neither are as > > interesting as the higher percentiles (99th and max). > > > > Min isn’t really important for any SLAs, but it does round out the mental > > picture of the distribution. I’ve always graphed min along with the other > > metrics to help me understand how fast the system can be, which helps in > > optimization decisions. It’s also a relatively inexpensive metric to > > compute, so it might be nice to just throw it in. > > > > On Wed, May 13, 2020, at 18:18, Sophie Blee-Goldman wrote: > > > G1: > > > I was considering it as the "end-to-end latency *up* to the specific > > task" > > > but > > > I'm happy with "record-staleness" if that drives the point home better. > > So > > > it's the > > > "staleness of the record when it is received by that task" -- will update > > > the KIP > > > > > > B1/J: > > > I'm struggling to imagine a case where the min would actually be useful, > > > rather than > > > just intellectually interesting. I don't feel strongly that we shouldn't > > > add it, but that's > > > why I didn't include it from the start. Can you enlighten me with an > > > example? > > > > > > I was also vaguely concerned about the overhead of adding multiple > > > percentile > > > metrics. Do we have any data to indicate what kind of performance hit we > > > take on > > > metrics computation? > > > > > > Also, not to be too pedantic but the 50th percentile would be the median > > > not the > > > mean. Would you propose to add the mean *and* the 50th percentile, or > > just > > > one > > > of the two? > > > > > > Thanks all! > > > Sophie > > > > > > On Wed, May 13, 2020 at 3:34 PM John Roesler <vvcep...@apache.org> > > wrote: > > > > > > > Hello all, and thanks for the KIP, Sophie, > > > > > > > > Just some comments on the discussion so far: > > > > > > > > B2/G1: > > > > In principle, it shouldn't matter whether we report "spans" or > > > > "end-to-end" latency. But in practice, some of the spans are pretty > > > > difficult to really measure (like time spent waiting in the topics, or > > > > time from the event happening to the ETL producer choosing to send it, > > > > or time spent in send/receive buffers, etc., etc. > > > > > > > > In other words, it's practically easier to compute spans by subtracting > > > > e2e latencies than it is to compute e2e latencies by adding spans. You > > > > can even just consider that the span computation from e2e always just > > > > involves subtracting two numbers, whereas computing e2e latency from > > > > spans involves adding _all_ the spans leading up to the end you care > > about. > > > > > > > > It seems like people really prefer to have spans when they are > > debugging > > > > latency problems, whereas e2e latency is a more general measurement > > > > that basically every person/application cares about and should be > > > > monitoring. > > > > > > > > Altogether, it really seem to provide more value to more people if we > > > > report > > > > e2e latencies. Regarding "record-staleness" as a name, I think I have > > no > > > > preference, I'd defer to other peoples' intuition. > > > > > > > > G2: > > > > I think the processor-node metric is nice, since the inside of a task > > can > > > > introduce a significant amount of latency in some cases. Plus, it's a > > more > > > > direct measurement, if you really wanted to know (for the purposes of > > IQ > > > > or something) how long it takes source events to "show up" at the > > store. > > > > > > > > I think actually recording it at every processor could be expensive, > > but we > > > > already record a bunch of metrics at the node level. > > > > > > > > B1: > > > > I think 50% could be reasonable to record also. Even if it's a poor > > metric > > > > for operational purposes, a lot of people might expect to see "mean". > > > > Actually, > > > > I was surprised not to see "min". Is there a reason to leave it off? > > > > > > > > I might suggest: > > > > min, mean (50th), 75th, 99th, max > > > > > > > > B3: > > > > I agree we should include late records (though not the ones we drop). > > > > It may be spiky, but only when there are legitimately some records > > with a > > > > high end-to-end latency, which is the whole point of these metrics. > > > > > > > > That's it! I don't think I have any other feedback, other than a > > request to > > > > also report "min". > > > > > > > > Thanks, > > > > -John > > > > > > > > On Wed, May 13, 2020, at 16:58, Guozhang Wang wrote: > > > > > Thanks Sophie for the KIP, a few quick thoughts: > > > > > > > > > > 1) The end-to-end latency includes both the processing latency of the > > > > task > > > > > and the latency spent sitting in intermediate topics. I have a > > similar > > > > > feeling as Boyang mentioned above that the latency metric of a task A > > > > > actually measures the latency of the sub-topology up-to but not > > including > > > > > the processing of A, which is a bit weird. > > > > > > > > > > Maybe the my feeling comes from the name "latency" itself, since > > today we > > > > > already have several "latency" metrics already which are measuring > > > > elapsed > > > > > system-time for processing a record / etc, while here we are > > comparing > > > > the > > > > > system wallclock time with the record timestamp. > > > > > > > > > > Maybe we can consider renaming it as "record-staleness" (note we > > already > > > > > have a "record-lateness" metric), in which case recording at the > > > > > system-time before we start processing the record sounds more > > natural. > > > > > > > > > > 2) With that in mind, I'm wondering if the processor-node-level DEBUG > > > > > metric is worth to add, given that we already have a task-level > > > > processing > > > > > latency metric. Basically, a specific node's e2e latency is similar > > to > > > > the > > > > > task-level e2e latency + task-level processing latency. Personally I > > > > think > > > > > having a task-level record-staleness metric is sufficient. > > > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > > > On Wed, May 13, 2020 at 11:46 AM Sophie Blee-Goldman < > > > > sop...@confluent.io> > > > > > wrote: > > > > > > > > > > > 1. I felt that 50% was not a particularly useful gauge for this > > > > specific > > > > > > metric, as > > > > > > it's presumably most useful at putting an *upper *bound on the > > latency > > > > you > > > > > > can > > > > > > reasonably expect to see. I chose percentiles that would hopefully > > > > give a > > > > > > good > > > > > > sense of what *most* records will experience, and what *close to > > all* > > > > > > records > > > > > > will. > > > > > > > > > > > > However I'm not married to these specific numbers and could be > > > > convinced. > > > > > > Would be especially interested in hearing from users on this. > > > > > > > > > > > > 2. I'm inclined to not include the "hop-to-hop latency" in this KIP > > > > since > > > > > > users > > > > > > can always compute it themselves by subtracting the previous node's > > > > > > end-to-end latency. I guess we could do it either way since you can > > > > always > > > > > > compute one from the other, but I think the end-to-end latency > > feels > > > > more > > > > > > valuable as it's main motivation is not to debug bottlenecks in the > > > > > > topology but > > > > > > to give users a sense of how long it takes arecord to be reflected > > in > > > > > > certain parts > > > > > > of the topology. For example this might be useful for users who are > > > > > > wondering > > > > > > roughly when a record that was just produced will be included in > > their > > > > IQ > > > > > > results. > > > > > > Debugging is just a nice side effect -- but maybe I didn't make > > that > > > > clear > > > > > > enough > > > > > > in the KIP's motivation. > > > > > > > > > > > > 3. Good question, I should address this in the KIP. The short > > answer is > > > > > > "yes", > > > > > > we will include late records. I added a paragraph to the end of the > > > > > > Proposed > > > > > > Changes section explaining the reasoning here, please let me know > > if > > > > you > > > > > > have > > > > > > any concerns. > > > > > > > > > > > > 4. Assuming you're referring to the existing metric > > "process-latency", > > > > that > > > > > > metric > > > > > > reflects the time for the literal Node#process method to run > > whereas > > > > this > > > > > > metric > > > > > > would always be measured relative to the event timestamp. > > > > > > > > > > > > That said, the naming collision there is pretty confusing so I've > > > > renamed > > > > > > the > > > > > > metrics in this KIP to "end-to-end-latency" which I feel better > > > > reflects > > > > > > the nature > > > > > > of the metric anyway. > > > > > > > > > > > > Thanks for the feedback! > > > > > > > > > > > > On Wed, May 13, 2020 at 10:21 AM Boyang Chen < > > > > reluctanthero...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Thanks for the KIP Sophie. Getting the E2E latency is important > > for > > > > > > > understanding the bottleneck of the application. > > > > > > > > > > > > > > A couple of questions and ideas: > > > > > > > > > > > > > > 1. Could you clarify the rational of picking 75, 99 and max > > > > percentiles? > > > > > > > Normally I see cases where we use 50, 90 percentile as well in > > > > production > > > > > > > systems. > > > > > > > > > > > > > > 2. The current latency being computed is cumulative, I.E if a > > record > > > > goes > > > > > > > through A -> B -> C, then P(C) = T(B->C) + P(B) = T(B->C) + > > T(A->B) + > > > > > > T(A) > > > > > > > and so on, where P() represents the captured latency, and T() > > > > represents > > > > > > > the time for transiting the records between two nodes, including > > > > > > processing > > > > > > > time. For monitoring purpose, maybe having T(B->C) and T(A->B) > > are > > > > more > > > > > > > natural to view as "hop-to-hop latency", otherwise if there is a > > > > spike in > > > > > > > T(A->B), both P(B) and P(C) are affected in the same time. In the > > > > same > > > > > > > spirit, the E2E latency is meaningful only when the record exits > > > > from the > > > > > > > sink as this marks the whole time this record spent inside the > > > > funnel. Do > > > > > > > you think we could have separate treatment for sink nodes and > > other > > > > > > > nodes, so that other nodes only count the time receiving the > > record > > > > from > > > > > > > last hop? I'm not proposing a solution here, just want to discuss > > > > this > > > > > > > alternative to see if it is reasonable. > > > > > > > > > > > > > > 3. As we are going to monitor late arrival records as well, they > > > > would > > > > > > > create some really spiky graphs when the out-of-order records are > > > > > > > interleaving with on time records. Should we also supply a smooth > > > > version > > > > > > > of the latency metrics, or user should just take care of it by > > > > themself? > > > > > > > > > > > > > > 4. Regarding this new metrics, we haven't discussed its relation > > > > with our > > > > > > > existing processing latency metrics, could you add some context > > on > > > > > > > comparison and a simple `when to use which` tutorial for the > > best? > > > > > > > > > > > > > > Boyang > > > > > > > > > > > > > > On Tue, May 12, 2020 at 7:28 PM Sophie Blee-Goldman < > > > > sop...@confluent.io > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hey all, > > > > > > > > > > > > > > > > I'd like to kick off discussion on KIP-613 which aims to add > > > > end-to-end > > > > > > > > latency metrics to Streams. Please take a look: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams > > > > > > > > > > > > > > > > Cheers, > > > > > > > > Sophie > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > -- Guozhang > > > > > > > > > > > > > > >