Allrighty! The KIP is updated. Thanks again, all, for the feedback. -John
On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <j...@confluent.io> wrote: > Hey Guozhang and Bill, > > Ok, I'll update the KIP. At the risk of disturbing consensus, I'd like to > put it in the task instead of the thread so that it'll show up in the > TopologyTestDriver metrics as well. > > I'm leaning toward keeping the scope where it is right now, but if others > want to advocate for tossing in some more metrics, we can go that route. > > Thanks all, > -John > > On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bbej...@gmail.com> wrote: > >> Thanks for the KIP John, and sorry for the late comments. >> >> I'm on the fence with providing a single level metrics, but I think we'll >> have that discussion outside of this KIP. >> >> > * maintain one skipped-record metric (could be per-thread, per-task, or >> > per-processor-node) with no "reason" >> > * introduce a warn-level log detailing the topic/partition/offset and >> > reason of the skipped record >> >> I'm +1 on both of these suggestions. >> >> Finally, we have had requests in the past for some metrics around when >> persistent store removes an expired window. Would adding that to our >> metrics stretch the scope of this KIP too much? >> >> Thanks again and overall I'm +1 on this KIP >> >> Bill >> >> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <wangg...@gmail.com> >> wrote: >> >> > The proposal sounds good to me. About "maintain only one level of >> metrics" >> > maybe we can discuss about that separately from this KIP since that >> would >> > be a larger scope of discussion. I agree that if we are going to >> maintain >> > only one-level metrics it should be lowest level and we would let users >> to >> > do the roll-ups themselves, but I'm still not fully convinced that we >> > should just provide single-level metrics, because 1) I think for >> different >> > metrics people may be interested to investigate into different >> > granularities, e.g. for poll / commit rate these are at the lowest >> > task-level metrics, while for process-rate / skip-rate they can be as >> low >> > as processor-node metrics, and 2) user-side rolling ups may not be very >> > straight-forward. But for 2) if someone can provide an efficient and >> easy >> > implementation of that I can be persuaded :) >> > >> > For now I'm thinking we can add the metric on thread-level, either with >> > finer grained ones with "reason" tag plus an aggregated one without the >> > tag, or just having a single aggregated metric without the tag looks >> good >> > to me. >> > >> > >> > Guozhang >> > >> > On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <j...@confluent.io> >> wrote: >> > >> > > Hey Guozhang, >> > > >> > > Thanks for the reply. Regarding JMX, I can dig it. I'll provide a >> list in >> > > the KIP. I was also thinking we'd better start a documentation page >> with >> > > the metrics listed. >> > > >> > > I'd have no problem logging a warning when we skip records. On the >> metric >> > > front, really I'm just pushing for us to maintain only one level of >> > > metrics. If that's more or less granular (i.e., maybe we don't have a >> > > metric per reason and log the reason instead), that's fine by me. I >> just >> > > don't think it provides a lot of extra value per complexity (interface >> > and >> > > implementation) to maintain roll-ups at the thread level in addition >> to >> > > lower-level metrics. >> > > >> > > How about this instead: >> > > * maintain one skipped-record metric (could be per-thread, per-task, >> or >> > > per-processor-node) with no "reason" >> > > * introduce a warn-level log detailing the topic/partition/offset and >> > > reason of the skipped record >> > > >> > > If you like that, I can update the KIP. >> > > >> > > Thanks, >> > > -John >> > > >> > > >> > > >> > > On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <wangg...@gmail.com> >> > wrote: >> > > >> > > > > One thing you mention is the notion of setting alerts on coarser >> > > metrics >> > > > being easier than finer ones. All the metric alerting systems I have >> > used >> > > > make it equally easy to alert on metrics by-tag or over tags. So my >> > > > experience doesn't say that this is a use case. Were you thinking >> of an >> > > > alerting system that makes such a pre-aggregation valuable? >> > > > >> > > > For the commonly used JMX reporter tags will be encoded directly as >> > part >> > > of >> > > > the object name, and if users wants to monitor them they need to >> know >> > > these >> > > > values before hand. That is also why I think we do want to list all >> the >> > > > possible values of the reason tags in the KIP, since >> > > > >> > > > > In my email in response to Matthias, I gave an example of the >> kind of >> > > > scenario that would lead me as an operator to run with DEBUG on all >> the >> > > > time, since I wouldn't be sure, having seen a skipped record once, >> that >> > > it >> > > > would ever happen again. The solution is to capture all the >> available >> > > > information about the reason and location of skips all the time. >> > > > >> > > > That is a good point. I think we can either expose all levels >> metrics >> > as >> > > by >> > > > default, or only expose the most lower-level metrics and get rid of >> > other >> > > > levels to let users do roll-ups themselves (which will be a much >> larger >> > > > scope for discussion), or we can encourage users to not purely >> depend >> > on >> > > > metrics for such trouble shooting: that is to say, users only be >> > alerted >> > > > based on metrics, and we can log a info / warn log4j entry each >> time we >> > > are >> > > > about to skip a record all over the places, so that upon being >> notified >> > > > users can look into the logs to find the details on where / when it >> > > > happens. WDYT? >> > > > >> > > > >> > > > Guozhang >> > > > >> > > > >> > > > >> > > > On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <j...@confluent.io> >> > wrote: >> > > > >> > > > > Hey Guozhang, >> > > > > >> > > > > Thanks for the review. >> > > > > >> > > > > 1. >> > > > > Matthias raised the same question about the "reason" tag values. I >> > can >> > > > list >> > > > > all possible values of the "reason" tag, but I'm thinking this >> level >> > of >> > > > > detail may not be KIP-worthy, maybe the code and documentation >> review >> > > > would >> > > > > be sufficient. If you all disagree and would like it included in >> the >> > > > KIP, I >> > > > > can certainly do that. >> > > > > >> > > > > If we do provide roll-up metrics, I agree with the pattern of >> keeping >> > > the >> > > > > same name but eliminating the tags for the dimensions that were >> > > > rolled-up. >> > > > > >> > > > > 2. >> > > > > I'm not too sure that implementation efficiency really becomes a >> > factor >> > > > in >> > > > > choosing whether to (by default) update one coarse metric at the >> > thread >> > > > > level or one granular metric at the processor-node level, since >> it's >> > > just >> > > > > one metric being updated either way. I do agree that if we were to >> > > update >> > > > > the granular metrics and multiple roll-ups, then we should >> consider >> > the >> > > > > efficiency. >> > > > > >> > > > > I agree it's probably not necessary to surface the metrics for all >> > > nodes >> > > > > regardless of whether they can or do skip records. Perhaps we can >> > > lazily >> > > > > register the metrics. >> > > > > >> > > > > In my email in response to Matthias, I gave an example of the >> kind of >> > > > > scenario that would lead me as an operator to run with DEBUG on >> all >> > the >> > > > > time, since I wouldn't be sure, having seen a skipped record once, >> > that >> > > > it >> > > > > would ever happen again. The solution is to capture all the >> available >> > > > > information about the reason and location of skips all the time. >> > > > > >> > > > > >> > > > > >> > > > > One thing you mention is the notion of setting alerts on coarser >> > > metrics >> > > > > being easier than finer ones. All the metric alerting systems I >> have >> > > used >> > > > > make it equally easy to alert on metrics by-tag or over tags. So >> my >> > > > > experience doesn't say that this is a use case. Were you thinking >> of >> > an >> > > > > alerting system that makes such a pre-aggregation valuable? >> > > > > >> > > > > Thanks again, >> > > > > -John >> > > > > >> > > > > >> > > > > On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang < >> wangg...@gmail.com> >> > > > wrote: >> > > > > >> > > > > > Hello John, >> > > > > > >> > > > > > Thanks for the KIP. Some comments: >> > > > > > >> > > > > > 1. Could you list all the possible values of the "reason" tag? >> In >> > the >> > > > > JIRA >> > > > > > ticket I left some potential reasons but I'm not clear if you're >> > > going >> > > > to >> > > > > > categorize each of them as a separate reason, or is there any >> > > > additional >> > > > > > ones you have in mind. >> > > > > > >> > > > > > Also I'm wondering if we should add another metric that do not >> have >> > > the >> > > > > > reason tag but aggregates among all possible reasons? This is >> for >> > > users >> > > > > to >> > > > > > easily set their alerting notifications (otherwise they have to >> > write >> > > > on >> > > > > > notification rule per reason) in their monitoring systems. >> > > > > > >> > > > > > 2. Note that the processor-node metrics is actually "per-thread, >> > > > > per-task, >> > > > > > per-processor-node", and today we only set the per-thread >> metrics >> > as >> > > > INFO >> > > > > > while leaving the lower two layers as DEBUG. I agree with your >> > > argument >> > > > > > that we are missing the per-client roll-up metrics today, but >> I'm >> > > > > convinced >> > > > > > that the right way to approach it would be >> > > "just-providing-the-lowest- >> > > > > > level >> > > > > > metrics only". >> > > > > > >> > > > > > Note the recoding implementation of these three levels are >> > different >> > > > > > internally today: we did not just do the rolling up to generate >> the >> > > > > > higher-level metrics from the lower level ones, but we just >> record >> > > them >> > > > > > separately, which means that, if we turn on multiple levels of >> > > metrics, >> > > > > we >> > > > > > maybe duplicate collecting some metrics. One can argue that is >> not >> > > the >> > > > > best >> > > > > > way to represent multi-level metrics collecting and reporting, >> but >> > by >> > > > > only >> > > > > > enabling thread-level metrics as INFO today, that implementation >> > > could >> > > > be >> > > > > > more efficient than only collecting the metrics at the lowest >> > level, >> > > > and >> > > > > > then do the roll-up calculations outside of the metrics classes. >> > > > > > >> > > > > > Plus, today not all processor-nodes may possibly skip records, >> > AFAIK >> > > we >> > > > > > will only skip records at the source, sink, window and >> aggregation >> > > > > > processor nodes, so adding a metric per processor looks like an >> > > > overkill >> > > > > to >> > > > > > me as well. On the other hand, from user's perspective the >> "reason" >> > > tag >> > > > > may >> > > > > > be sufficient for them to narrow down where inside the topology >> is >> > > > > causing >> > > > > > records to be dropped on the floor. So I think the "per-thread, >> > > > per-task" >> > > > > > level metrics should be sufficient for them in trouble shoot in >> > DEBUG >> > > > > mode, >> > > > > > and we can add another "per-thread" level metrics as INFO which >> is >> > > > turned >> > > > > > on by default. So under normal execution users still only need >> INFO >> > > > level >> > > > > > metrics for alerting (e.g. set alerts on all skipped-records >> > metrics >> > > as >> > > > > > non-zero), and then upon trouble shooting they can turn on DEBUG >> > > > metrics >> > > > > to >> > > > > > look into which task is actually causing the skipped records. >> > > > > > >> > > > > > >> > > > > > Guozhang >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax < >> > > > matth...@confluent.io> >> > > > > > wrote: >> > > > > > >> > > > > > > Thanks for the KIP John. >> > > > > > > >> > > > > > > Reading the material on the related Jiras, I am wondering what >> > > > `reason` >> > > > > > > tags you want to introduce? Can you elaborate? The KIP should >> > list >> > > > > those >> > > > > > > IMHO. >> > > > > > > >> > > > > > > About the fine grained metrics vs the roll-up: you say that >> > > > > > > >> > > > > > > > the coarse metric aggregates across two dimensions >> > simultaneously >> > > > > > > >> > > > > > > Can you elaborate why this is an issue? I am not convinced atm >> > that >> > > > we >> > > > > > > should put the fine grained metrics into INFO level and remove >> > the >> > > > > > > roll-up at thread level. >> > > > > > > >> > > > > > > > Given that they have to do this sum to get a usable >> top-level >> > > view >> > > > > > > >> > > > > > > This is a fair concern, but I don't share the conclusion. >> > Offering >> > > a >> > > > > > > built-in `KafkaStreams` "client" roll-up out of the box might >> be >> > a >> > > > > > > better solution. In the past we did not offer this due to >> > > performance >> > > > > > > concerns, but we could allow an "opt-in" mechanism. If you >> > > disagree, >> > > > > can >> > > > > > > you provide some reasoning and add them to the "Rejected >> > > > alternatives" >> > > > > > > section. >> > > > > > > >> > > > > > > To rephrase: I understand the issue about missing top-level >> view, >> > > but >> > > > > > > instead of going more fine grained, we should consider to add >> > this >> > > > > > > top-level view and add/keep the fine grained metrics at DEBUG >> > level >> > > > > only >> > > > > > > >> > > > > > > I am +1 to add TopologyTestDriver#metrics() and to remove old >> > > metrics >> > > > > > > directly as you suggested. >> > > > > > > >> > > > > > > >> > > > > > > -Matthias >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > On 3/28/18 6:42 PM, Ted Yu wrote: >> > > > > > > > Looks good to me. >> > > > > > > > >> > > > > > > > On Wed, Mar 28, 2018 at 3:11 PM, John Roesler < >> > j...@confluent.io >> > > > >> > > > > > wrote: >> > > > > > > > >> > > > > > > >> Hello all, >> > > > > > > >> >> > > > > > > >> I am proposing KIP-274 to improve the metrics around >> skipped >> > > > records >> > > > > > in >> > > > > > > >> Streams. >> > > > > > > >> >> > > > > > > >> Please find the details here: >> > > > > > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP- >> > > > > > > >> 274%3A+Kafka+Streams+Skipped+Records+Metrics >> > > > > > > >> >> > > > > > > >> Please let me know what you think! >> > > > > > > >> >> > > > > > > >> Thanks, >> > > > > > > >> -John >> > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > -- >> > > > > > -- Guozhang >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > -- Guozhang >> > > > >> > > >> > >> > >> > >> > -- >> > -- Guozhang >> > >> > >