Re: [E] Re: Design Review Request: Log Throttling

Brian Neradt Fri, 30 Oct 2020 10:05:05 -0700

Hi [email protected],

I've documented our design options and played around with a few prototypes.
See the content below. Please let me know if I'm missing anything or if
people have other thoughts or ideas. The draft PR in its current state
represents the recommended option number 4:
https://github.com/apache/trafficserver/pull/7279



Introduction

It has been observed by ops and devs working with prod boxes that certain
logs are annoyingly noisy. Many of these logs relate to the logging
mechanism itself having issues. If a particular message cannot be logged,
then a warning or error message (depending on where the problem is found)
is emitted. Thus, if a particular message results in a line that is too
long, a warning is emitted. If the pipe reader for a log pipe is down,
every single logging event will result in a warning or error. These issues
can result in hundreds or thousands of warnings or errors per second, which
makes reading the logs very difficult and results in log space filling up
with such messages.

Options
1. Specific Throttling Per Call Site

Apply log throttling on a per call site basis to specific, noisy logs.
SiteThrottledWarn, etc. are introduced and used at the call sites observed
to be noisy. To be clear: no string matches are done; throttling is applied
based upon the call site, not the content of the messages. Thus, for
example, the following messages, each from the same call site, would be
throttled even though they differ slightly in content:

[Oct 29 02:18:29.501] [LOG_FLUSH] WARNING: File:<some_path>/log.pipe
was closed, have dropped (288) bytes.
[Oct 29 02:18:29.501] [LOG_FLUSH] WARNING: File:<some_path>/log.pipe
was closed, have dropped (281) bytes.
[Oct 29 02:18:29.501] [LOG_FLUSH] WARNING: File:<some_path>/log.pipe
was closed, have dropped (278) bytes.

The fact that these are throttled is by design. It is messages like this
that are precisely the reason this throttling feature is being considered.

As an example, we would implement the site throttled versions like this:
*PerCallSiteThrottling*
#define SiteThrottledDiagsError(level, ...)                           \
  do {                                                                \
    static constexpr const SourceLocation loc = MakeSourceLocation(); \
    static LogMessage log_message{IS_THROTTLED};                      \
    log_message.message(level, loc, __VA_ARGS__);                     \
  } while (0)

#define SiteThrottledStatus(...) SiteThrottledDiagsError(DL_Status,
__VA_ARGS__)   // Log information
#define SiteThrottledNote(...) SiteThrottledDiagsError(DL_Note,
__VA_ARGS__)       // Log significant information
#define SiteThrottledWarning(...) SiteThrottledDiagsError(DL_Warning,
__VA_ARGS__) // Log concerning information
// etc.
Advantages

   - Low risk: only logs tagged for per-site throttling are modified in any
   way.
   - Solves the log flush (and family) of noisy logs by marking each of
   these for per-site throttling.

Disadvantages

   - This produces two sets of APIs: one set for
   throttling (SiteThrottledWarn, etc.) alongside the current set without
   throttling (Warn, etc.).
   - A LogMessage object is not applied to each. This is unfortunate
   because having a state for every message could be helpful for future
   configuration development.
   - It’s hard to predict, a priori, which logs this should be applied to.
   We’d have to turn the unthrottled versions to throttled ones as we observe
   logs being annoying noisy.
   - The user cannot apply throttling across log messages, only to those
   that are compiled in with the Throttled versions of the logs.

2. Per Call Site Throttling for all Logs

With this option, we simply apply the above per-site log throttling to all
logs.
Advantages

   - Simplified API (as compared to option 1): the current Error, Warn,
   etc., functions do not change in their API. Only their implementation
   changes in order to apply throttling to them.
   - All logs receive this per-site throttling, so it is not required to
   find out which logs require throttling.

Disadvantages

   - Certain log sites that differ in a useful way between log invocations
   have useful information occluded. Notice that since it is per log site
   without respect to content, the following message about loading plugins is
   skipped for seven separate plugins. For this message, this is clearly
   obscuring possibly useful information:

      15 [Oct 21 21:22:58.843] traffic_server NOTE: Skipped the
following message 7 times.
      16 [Oct 21 21:22:58.843] traffic_server NOTE: loading plugin
'<plugin_path>/txn_box.so'

3a. Apply Throttling to all Logs using String Matching

Apply log throttling to all logs, as above, but only throttle consecutive
logs that exactly match each other.
Advantages

   - Logs that have significant differences are not throttled, and thus
   significant log information is not skipped.
   - Logs that do match are throttled generically, system wide, without
   modifying the logs per call site.

Disadvantages

   - This option introduces string storage and string compare to every log
   entry. This potentially introduces memory and/or performance issues.
   - The whole problem that instigates this issue is not solved. Observe
   that log warnings and errors about specific log events not being emitted
   are unique even though associated with a single log site. For instance, in
   the following example, the log contains the number of bytes in the message
   that could not be emitted. This option would not effectively throttle these
   messages.

        [Oct 14 21:54:38.145] [LOG_FLUSH] ERROR: Failed to write log to
<some_path>/log.pipe: [tried 403, wrote 0, Broken pipe]
3b. Same as 3a, but Modify Noisy Logs to Not Be Unique

This option, as a variant of 3a above, seeks to address the problem that
the noisy logs for which this throttling feature is being introduced, are
each unique but not in a way that is necessarily helpful. To address this,
the noisy log messages can be modified to not be unique across calls,
allowing throttling to apply to them. Thus, for example, the above message
can be modified to read as such:

        [Oct 14 21:54:38.145] [LOG_FLUSH] ERROR: Failed to write log to
<some_path>/log.pipe: [Broken pipe]

Since these messages would be identical between calls, they would thus be
throttled via string match throttling.
Advantages

   - Unlike option 3a, this addresses the problematic noisy logs.
   - As with 3a, this allows for generic throttling of all logs.

Disadvantages

   - This has the same potential issues concerning performance and memory
   as 3a. In fact, this option was prototyped and found to introduce
   significant performance degradation.
   - Some information from the modified logs will no longer be emitted
   because they were simplified to match each other. In the example above, the
   number of bytes attempted to be written and the number written are no
   longer in the emitted log, for example. This loss of information may
   theoretically be problematic in some situations.

4. Apply LogMessage Generically, but Throttle Specifically

The API for this option is the same as option 1, but in the implementation
a LogMessage object is associated with every log site, not just the
throttled ones. However, only the LogMessages for the SiteThrottledWarn,
etc., macros will be configured for throttling. Thus, in addition to the
site throttled macros introduced in option 1, the current non-throttled
version would look like this (note that log_message is created and used,
but not configured with IS_THROTTLED:
*NonThrottledMacros*
#define DiagsError(level, ...)                                        \
  do {                                                                \
    static constexpr const SourceLocation loc = MakeSourceLocation(); \
    static LogMessage log_message;                                    \
    log_message.message(level, loc, __VA_ARGS__);                     \
  } while (0)

#define Status(...) DiagsError(DL_Status, __VA_ARGS__)       // Log
information
#define Note(...) DiagsError(DL_Note, __VA_ARGS__)           // Log
significant information
// etc.

Advantages

   - The problem with the noisy log warnings that is the catalyst for this
   feature is addressed because they would be throttled with
   SiteThrottledWarn, etc.
   - A LogMessage object is applied to every log, which opens the
   possibility for per-log configuration for future configuration feature work.
   - Non-throttled messages will not do string compares or much of any
   computation, thus avoiding performance or memory issues for the
   non-throttled messages seen in option 3a and 3b above.

Disadvantages

   - Throttling needs to be applied per log site as needed. This is the
   same disadvantage as listed in option 1 in which we will have to figure
   out, over time, which call sites this should be applied to.
   - New API: again, as with option 1, this results in new logging API
   being introduced.

Suggestion

Assimilating the above yields this:

   1. Options 2 (apply per-site throttling to all logs) and 3a (apply
   string-matched throttling to all logs) are both attractive because they
   involve no new API. Instead, the current API has some form of throttling
   applied to it. Despite their appeal, however, they cannot be used. Option 2
   results in information loss as certain log sites are repeated with messages
   that have important, unique information. Option 3a doesn't work because it
   doesn't solve the noisy logs which occasions this feature in the first
   place.
   2. Option 3b attempts to address the weakness of 3a by removing content
   from the problematic noisy logs, making them match each other. In theory
   these simplified logs should be fine (arguably, for instance, the number of
   bytes not logged are not needed with every message, spewn out a thousand or
   so times a second). However, this implementation requires the storage of
   every previous log message for every site and a new string compare with
   every message against the previous message at that call site. When tested
   in production this resulted in very significant performance degradation.

That leaves us with options 1 and 4, both of which apply throttling to only
the call sites that we wish to target with throttling. The only difference
between them is in implementation: option 4 applies the LogMessage object
to every call site, not just the ones targeted for throttling. This leaves
the door open for future configuration of the various logs, not just the
throttled ones. Because of this advantage, option 4 is my recommendation.


Thanks!



On Tue, Oct 20, 2020 at 1:59 PM Brian Neradt <[email protected]> wrote:

> Thank you Alan for the forward thinking observations. I agree that it is
> wise to implement this in such a way that it at least doesn't preclude the
> possibility of individual log configurability that you're talking about. As
> we discussed offline, this change moves us in that direction by
> establishing an individual log object per message. Each of these could, in
> the future, be exposed in such a way to make them individually configurable.
>
> In the meantime, per Leif and Sudheer's suggestions, I'll update the draft
> PR to apply the throttling implementation to each Error, Warning, etc.
> message instead of adding the Throttled* version. I'll also work on a
> records.config item to allow the user to specify the process-wide
> throttling interval.
>
> Thanks,
>
> On Mon, Oct 19, 2020 at 12:25 PM Alan Carroll
> <[email protected]> wrote:
>
>> The primary case is a zero throttle, that is no throttling. It also seems
>> plausible to me that different production settings will yield different
>> levels of "noise" that the ops will want to tune for those local
>> conditions.
>>
>> On Mon, Oct 19, 2020 at 11:51 AM Sudheer Vinukonda
>> <[email protected]> wrote:
>>
>> >  Right, but I'm wondering what sort of customization is really needed or
>> > what the use cases are where you'd need different rates of throttling
>> for
>> > different log messages, unless the intent is not protecting the system
>> from
>> > flood of logging (which is basically the same for any kind of log).
>> >
>> >     On Monday, October 19, 2020, 09:48:33 AM PDT, Alan Carroll
>> > <[email protected]> wrote:
>> >
>> >  I'm addressing the question of "why not do this for all messages?". I
>> > think
>> > Neue Brie's approach for specific messages is fine. If, however, the
>> > general logging calls were replaced with it, then you have the
>> > customization problem.
>> >
>> > On Mon, Oct 19, 2020 at 10:41 AM Sudheer Vinukonda
>> > <[email protected]> wrote:
>> >
>> > >  Hmm, may be I'm not understanding the intent of the proposed
>> throttling
>> > > and the problem we are trying to solve well.
>> > > My understanding was that a flood of certain (any) kind of log can
>> result
>> > > in significant disk i/o and impact the system throughput and stability
>> > and
>> > > the proposed throttling is to be resilient against such issues. For
>> such
>> > > issues, I'm not sure if there's a lot of value in supporting advanced
>> > > customizations.
>> > > If the scope of the requirement is beyond resiliency and if there's a
>> > > legitimate use case where throttling when exceeding a rate of 10
>> > > occurrences/sec vs 5/sec matters for each given type of log message,
>> > then I
>> > > guess that's a different problem altogether.
>> > > Nevertheless, sounds like a good topic to discuss during the summit?
>> > >    On Monday, October 19, 2020, 08:23:12 AM PDT, Alan Carroll
>> > > <[email protected]> wrote:
>> > >
>> > >  I strongly suspect we might want it per logging message. At a
>> minimum I
>> > > don't think all messages should be throttled in this way, which is
>> also
>> > an
>> > > answer to Leif' question "how do I know which one to use"? Use the one
>> > that
>> > > throttles if you want throttling, use the other if you don't want
>> > > throttling. If you don't know that, you might want to think more
>> > carefully
>> > > about what you're logging.
>> > >
>> > > On Mon, Oct 19, 2020 at 9:09 AM Sudheer Vinukonda
>> > > <[email protected]> wrote:
>> > >
>> > > > Hmm..do you mean system wide value? We can still have the wrap
>> around
>> > > > frequency configurable, but unless we want it customized per log
>> line
>> > > (I’m
>> > > > not sure if this is really needed?), a system wide config may still
>> be
>> > > > possible to configure, no?
>> > > >
>> > > > > On Oct 19, 2020, at 7:04 AM, Alan Carroll
>> > > > <[email protected]> wrote:
>> > > > >
>> > > > > My concern at this point is the lack of configurability of the
>> > > throttle
>> > > > > time. The underlying implementation supports it but by wrapping it
>> > > > directly
>> > > > > in the logging call, only the default value can be used.
>> > > >
>> > > >
>>
>
>
> --
> "Come to Me, all who are weary and heavy-laden, and I will
> give you rest. Take My yoke upon you and learn from Me, for
> I am gentle and humble in heart, and you will find rest for
> your souls. For My yoke is easy and My burden is light."
>
>     ~ Matthew 11:28-30
>


-- 
"Come to Me, all who are weary and heavy-laden, and I will
give you rest. Take My yoke upon you and learn from Me, for
I am gentle and humble in heart, and you will find rest for
your souls. For My yoke is easy and My burden is light."

    ~ Matthew 11:28-30

Re: [E] Re: Design Review Request: Log Throttling

Reply via email to