Hi dev@trafficserver.apache.org, I've documented our design options and played around with a few prototypes. See the content below. Please let me know if I'm missing anything or if people have other thoughts or ideas. The draft PR in its current state represents the recommended option number 4: https://github.com/apache/trafficserver/pull/7279
Introduction It has been observed by ops and devs working with prod boxes that certain logs are annoyingly noisy. Many of these logs relate to the logging mechanism itself having issues. If a particular message cannot be logged, then a warning or error message (depending on where the problem is found) is emitted. Thus, if a particular message results in a line that is too long, a warning is emitted. If the pipe reader for a log pipe is down, every single logging event will result in a warning or error. These issues can result in hundreds or thousands of warnings or errors per second, which makes reading the logs very difficult and results in log space filling up with such messages. Options 1. Specific Throttling Per Call Site Apply log throttling on a per call site basis to specific, noisy logs. SiteThrottledWarn, etc. are introduced and used at the call sites observed to be noisy. To be clear: no string matches are done; throttling is applied based upon the call site, not the content of the messages. Thus, for example, the following messages, each from the same call site, would be throttled even though they differ slightly in content: [Oct 29 02:18:29.501] [LOG_FLUSH] WARNING: File:<some_path>/log.pipe was closed, have dropped (288) bytes. [Oct 29 02:18:29.501] [LOG_FLUSH] WARNING: File:<some_path>/log.pipe was closed, have dropped (281) bytes. [Oct 29 02:18:29.501] [LOG_FLUSH] WARNING: File:<some_path>/log.pipe was closed, have dropped (278) bytes. The fact that these are throttled is by design. It is messages like this that are precisely the reason this throttling feature is being considered. As an example, we would implement the site throttled versions like this: *PerCallSiteThrottling* #define SiteThrottledDiagsError(level, ...) \ do { \ static constexpr const SourceLocation loc = MakeSourceLocation(); \ static LogMessage log_message{IS_THROTTLED}; \ log_message.message(level, loc, __VA_ARGS__); \ } while (0) #define SiteThrottledStatus(...) SiteThrottledDiagsError(DL_Status, __VA_ARGS__) // Log information #define SiteThrottledNote(...) SiteThrottledDiagsError(DL_Note, __VA_ARGS__) // Log significant information #define SiteThrottledWarning(...) SiteThrottledDiagsError(DL_Warning, __VA_ARGS__) // Log concerning information // etc. Advantages - Low risk: only logs tagged for per-site throttling are modified in any way. - Solves the log flush (and family) of noisy logs by marking each of these for per-site throttling. Disadvantages - This produces two sets of APIs: one set for throttling (SiteThrottledWarn, etc.) alongside the current set without throttling (Warn, etc.). - A LogMessage object is not applied to each. This is unfortunate because having a state for every message could be helpful for future configuration development. - It’s hard to predict, a priori, which logs this should be applied to. We’d have to turn the unthrottled versions to throttled ones as we observe logs being annoying noisy. - The user cannot apply throttling across log messages, only to those that are compiled in with the Throttled versions of the logs. 2. Per Call Site Throttling for all Logs With this option, we simply apply the above per-site log throttling to all logs. Advantages - Simplified API (as compared to option 1): the current Error, Warn, etc., functions do not change in their API. Only their implementation changes in order to apply throttling to them. - All logs receive this per-site throttling, so it is not required to find out which logs require throttling. Disadvantages - Certain log sites that differ in a useful way between log invocations have useful information occluded. Notice that since it is per log site without respect to content, the following message about loading plugins is skipped for seven separate plugins. For this message, this is clearly obscuring possibly useful information: 15 [Oct 21 21:22:58.843] traffic_server NOTE: Skipped the following message 7 times. 16 [Oct 21 21:22:58.843] traffic_server NOTE: loading plugin '<plugin_path>/txn_box.so' 3a. Apply Throttling to all Logs using String Matching Apply log throttling to all logs, as above, but only throttle consecutive logs that exactly match each other. Advantages - Logs that have significant differences are not throttled, and thus significant log information is not skipped. - Logs that do match are throttled generically, system wide, without modifying the logs per call site. Disadvantages - This option introduces string storage and string compare to every log entry. This potentially introduces memory and/or performance issues. - The whole problem that instigates this issue is not solved. Observe that log warnings and errors about specific log events not being emitted are unique even though associated with a single log site. For instance, in the following example, the log contains the number of bytes in the message that could not be emitted. This option would not effectively throttle these messages. [Oct 14 21:54:38.145] [LOG_FLUSH] ERROR: Failed to write log to <some_path>/log.pipe: [tried 403, wrote 0, Broken pipe] 3b. Same as 3a, but Modify Noisy Logs to Not Be Unique This option, as a variant of 3a above, seeks to address the problem that the noisy logs for which this throttling feature is being introduced, are each unique but not in a way that is necessarily helpful. To address this, the noisy log messages can be modified to not be unique across calls, allowing throttling to apply to them. Thus, for example, the above message can be modified to read as such: [Oct 14 21:54:38.145] [LOG_FLUSH] ERROR: Failed to write log to <some_path>/log.pipe: [Broken pipe] Since these messages would be identical between calls, they would thus be throttled via string match throttling. Advantages - Unlike option 3a, this addresses the problematic noisy logs. - As with 3a, this allows for generic throttling of all logs. Disadvantages - This has the same potential issues concerning performance and memory as 3a. In fact, this option was prototyped and found to introduce significant performance degradation. - Some information from the modified logs will no longer be emitted because they were simplified to match each other. In the example above, the number of bytes attempted to be written and the number written are no longer in the emitted log, for example. This loss of information may theoretically be problematic in some situations. 4. Apply LogMessage Generically, but Throttle Specifically The API for this option is the same as option 1, but in the implementation a LogMessage object is associated with every log site, not just the throttled ones. However, only the LogMessages for the SiteThrottledWarn, etc., macros will be configured for throttling. Thus, in addition to the site throttled macros introduced in option 1, the current non-throttled version would look like this (note that log_message is created and used, but not configured with IS_THROTTLED: *NonThrottledMacros* #define DiagsError(level, ...) \ do { \ static constexpr const SourceLocation loc = MakeSourceLocation(); \ static LogMessage log_message; \ log_message.message(level, loc, __VA_ARGS__); \ } while (0) #define Status(...) DiagsError(DL_Status, __VA_ARGS__) // Log information #define Note(...) DiagsError(DL_Note, __VA_ARGS__) // Log significant information // etc. Advantages - The problem with the noisy log warnings that is the catalyst for this feature is addressed because they would be throttled with SiteThrottledWarn, etc. - A LogMessage object is applied to every log, which opens the possibility for per-log configuration for future configuration feature work. - Non-throttled messages will not do string compares or much of any computation, thus avoiding performance or memory issues for the non-throttled messages seen in option 3a and 3b above. Disadvantages - Throttling needs to be applied per log site as needed. This is the same disadvantage as listed in option 1 in which we will have to figure out, over time, which call sites this should be applied to. - New API: again, as with option 1, this results in new logging API being introduced. Suggestion Assimilating the above yields this: 1. Options 2 (apply per-site throttling to all logs) and 3a (apply string-matched throttling to all logs) are both attractive because they involve no new API. Instead, the current API has some form of throttling applied to it. Despite their appeal, however, they cannot be used. Option 2 results in information loss as certain log sites are repeated with messages that have important, unique information. Option 3a doesn't work because it doesn't solve the noisy logs which occasions this feature in the first place. 2. Option 3b attempts to address the weakness of 3a by removing content from the problematic noisy logs, making them match each other. In theory these simplified logs should be fine (arguably, for instance, the number of bytes not logged are not needed with every message, spewn out a thousand or so times a second). However, this implementation requires the storage of every previous log message for every site and a new string compare with every message against the previous message at that call site. When tested in production this resulted in very significant performance degradation. That leaves us with options 1 and 4, both of which apply throttling to only the call sites that we wish to target with throttling. The only difference between them is in implementation: option 4 applies the LogMessage object to every call site, not just the ones targeted for throttling. This leaves the door open for future configuration of the various logs, not just the throttled ones. Because of this advantage, option 4 is my recommendation. Thanks! On Tue, Oct 20, 2020 at 1:59 PM Brian Neradt <brian.ner...@gmail.com> wrote: > Thank you Alan for the forward thinking observations. I agree that it is > wise to implement this in such a way that it at least doesn't preclude the > possibility of individual log configurability that you're talking about. As > we discussed offline, this change moves us in that direction by > establishing an individual log object per message. Each of these could, in > the future, be exposed in such a way to make them individually configurable. > > In the meantime, per Leif and Sudheer's suggestions, I'll update the draft > PR to apply the throttling implementation to each Error, Warning, etc. > message instead of adding the Throttled* version. I'll also work on a > records.config item to allow the user to specify the process-wide > throttling interval. > > Thanks, > > On Mon, Oct 19, 2020 at 12:25 PM Alan Carroll > <solidwallofc...@verizonmedia.com.invalid> wrote: > >> The primary case is a zero throttle, that is no throttling. It also seems >> plausible to me that different production settings will yield different >> levels of "noise" that the ops will want to tune for those local >> conditions. >> >> On Mon, Oct 19, 2020 at 11:51 AM Sudheer Vinukonda >> <sudheervinuko...@yahoo.com.invalid> wrote: >> >> > Right, but I'm wondering what sort of customization is really needed or >> > what the use cases are where you'd need different rates of throttling >> for >> > different log messages, unless the intent is not protecting the system >> from >> > flood of logging (which is basically the same for any kind of log). >> > >> > On Monday, October 19, 2020, 09:48:33 AM PDT, Alan Carroll >> > <solidwallofc...@verizonmedia.com.invalid> wrote: >> > >> > I'm addressing the question of "why not do this for all messages?". I >> > think >> > Neue Brie's approach for specific messages is fine. If, however, the >> > general logging calls were replaced with it, then you have the >> > customization problem. >> > >> > On Mon, Oct 19, 2020 at 10:41 AM Sudheer Vinukonda >> > <sudheervinuko...@yahoo.com.invalid> wrote: >> > >> > > Hmm, may be I'm not understanding the intent of the proposed >> throttling >> > > and the problem we are trying to solve well. >> > > My understanding was that a flood of certain (any) kind of log can >> result >> > > in significant disk i/o and impact the system throughput and stability >> > and >> > > the proposed throttling is to be resilient against such issues. For >> such >> > > issues, I'm not sure if there's a lot of value in supporting advanced >> > > customizations. >> > > If the scope of the requirement is beyond resiliency and if there's a >> > > legitimate use case where throttling when exceeding a rate of 10 >> > > occurrences/sec vs 5/sec matters for each given type of log message, >> > then I >> > > guess that's a different problem altogether. >> > > Nevertheless, sounds like a good topic to discuss during the summit? >> > > On Monday, October 19, 2020, 08:23:12 AM PDT, Alan Carroll >> > > <solidwallofc...@verizonmedia.com.invalid> wrote: >> > > >> > > I strongly suspect we might want it per logging message. At a >> minimum I >> > > don't think all messages should be throttled in this way, which is >> also >> > an >> > > answer to Leif' question "how do I know which one to use"? Use the one >> > that >> > > throttles if you want throttling, use the other if you don't want >> > > throttling. If you don't know that, you might want to think more >> > carefully >> > > about what you're logging. >> > > >> > > On Mon, Oct 19, 2020 at 9:09 AM Sudheer Vinukonda >> > > <sudheervinuko...@yahoo.com.invalid> wrote: >> > > >> > > > Hmm..do you mean system wide value? We can still have the wrap >> around >> > > > frequency configurable, but unless we want it customized per log >> line >> > > (I’m >> > > > not sure if this is really needed?), a system wide config may still >> be >> > > > possible to configure, no? >> > > > >> > > > > On Oct 19, 2020, at 7:04 AM, Alan Carroll >> > > > <solidwallofc...@verizonmedia.com.invalid> wrote: >> > > > > >> > > > > My concern at this point is the lack of configurability of the >> > > throttle >> > > > > time. The underlying implementation supports it but by wrapping it >> > > > directly >> > > > > in the logging call, only the default value can be used. >> > > > >> > > > >> > > > -- > "Come to Me, all who are weary and heavy-laden, and I will > give you rest. Take My yoke upon you and learn from Me, for > I am gentle and humble in heart, and you will find rest for > your souls. For My yoke is easy and My burden is light." > > ~ Matthew 11:28-30 > -- "Come to Me, all who are weary and heavy-laden, and I will give you rest. Take My yoke upon you and learn from Me, for I am gentle and humble in heart, and you will find rest for your souls. For My yoke is easy and My burden is light." ~ Matthew 11:28-30