Ilya Maximets <i.maxim...@ovn.org> writes: > On 6/20/23 16:10, Aaron Conole wrote: >> Adrian Moreno <amore...@redhat.com> writes: >> >>> On 6/19/23 10:36, Eelco Chaudron wrote: >>>> On 16 Jun 2023, at 19:19, Aaron Conole wrote: >>>> >>>>> Martin Kennelly <mkenn...@redhat.com> writes: >>>>> >>>>>> Hey ovs community, >>>>>> >>>>>> I am a developer working on ovn-kubernetes and I want to >>>>>> programmatically consume long poll information >>>>>> i.e: >>>>>> ovs|00211|timeval(handler25)|WARN|Unreasonably long 52388ms poll >>>>>> interval (752ms user, 209ms system) >>>>>> >>>>>> This is currently exposed via journal logs but it's not practical >>>>>> to consume it there programmatically and I was >>>>>> hoping you could add it to coverage metrics. >>>>> >>>>> I think it could be useful. I do want to be careful about exposing >>>>> these kinds of data in a way that could be misinterpreted. Already, >>>>> that log in particular gets misinterpreted quite a bit, and RH gets >>>>> customers claiming OVS is misbehaving when they've oversubscribed the >>>>> system. >>>> +1 >>>> >>> >>> Maybe it's a good time to start documenting coverage counters? >> >> I agree - we should have at least some kind of documentation. Actually, >> it would be really nice if we could do something during >> COVERAGE_DEFINE() that would be like: >> >> COVERAGE_DEFINE(ctr, "description") >> >> and then we can generate documentation from the COVERAGE_DEFINE()s as >> well as querying for it with an ovs-appctl command. >> >> That might be trying to be too fancy, though. > > > The problem with documenting coverage counters is that we can't > and shouldn't claim that these are in any way a stable API. > They are mainly for developers. Code can change and there might > be no meaningful way preserve current counters or their names. > They can change in each release including minor ones without > prior notice.
Agreed that these are developer focused counters, and not proper KPIs. Even with that, there can be some information for others to take some useful information back to the developers with. For example, we can see how often this particular event is happening to correlate with system issues. I guess that is why I prefer the in-code documentation to doing some kind of "traditional" man-page for these as well. > Best regards, Ilya Maximets. > >> >>>>> Mechanically, it would be pretty simple to do something like: >>>>> >>>>> --- >>>>> diff --git a/lib/timeval.c b/lib/timeval.c >>>>> index 193c7bab17..00e5f2a74d 100644 >>>>> --- a/lib/timeval.c >>>>> +++ b/lib/timeval.c >>>>> @@ -40,6 +40,7 @@ >>>>> #include "openvswitch/vlog.h" >>>>> >>>>> VLOG_DEFINE_THIS_MODULE(timeval); >>>>> +COVERAGE_DEFINE(long_poll_interval); >>>>> >>>>> #if !defined(HAVE_CLOCK_GETTIME) >>>>> typedef unsigned int clockid_t; >>>>> @@ -645,6 +646,8 @@ log_poll_interval(long long int last_wakeup) >>>>> struct rusage rusage; >>>>> >>>>> if (!getrusage_thread(&rusage)) { >>>>> + COVERAGE_INC(long_poll_interval); >>>>> + >>>>> VLOG_WARN("Unreasonably long %lldms poll interval" >>>>> " (%lldms user, %lldms system)", >>>>> interval, >>>>> --- >>>>> >>>>> This would at least expose the coverage data via the coverage framework >>>>> and it can be queried via ovs-appctl. Actually, the advantage here is >>>>> that the coverage counter can track some details about X/sec over the >>>>> last 5 seconds, minute, hour, in addition to the total, so we can see >>>>> whether the condition is ongoing. >>>> _______________________________________________ >>>> dev mailing list >>>> d...@openvswitch.org >>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>>> >> >> _______________________________________________ >> dev mailing list >> d...@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss