Our servers were being massively hammered yesterday afternoon and I
think this code in forkserver was causing problems:
while ($running >= $MAXCONN) {
::log(LOGINFO,"Too many connections: $running >= $MAXCONN. Waiting
one second.");
sleep(1) ;
$running = scalar keys %childstatus;
}
Once I raised the MAXCONN to a higher value, the load went down and mail
started flowing again (even though we were still being hit pretty hard).
I'm not sure if I need to deal with this now, since the high-perf
branch should make that code obsolete.
However as part of trying to analyze what the hell was going on, it
occurred to me that the adaptive logging plugin (which I am running on
only 1 of our 2 MX boxes) was actually making it much harder to see what
was happening. The reason for this is that the design of the plugin is
to collect all log entries and only emit any lines once the message has
been either accepted or rejected.
The problem that I discovered is that if some messages are taking a long
time to complete (it seems like some servers don't like to send QUIT
right away), there is no way of knowing what is going on, since the log
entries are all squirrelled away for later.
So, I am thinking of rewriting the adaptive logging to work like this
instead:
1) All log lines below the maxlevel are saved
2) All log lines below the max level are immediately emitted
3) All log lines below the min level are emitted again with a prefix
only when the message is accepted for delivery
The net result is that multilog can be set to take all log lines
immediately and filter out the duplicated lines in one set of log files,
and another set of log files can keep only the successful lines. This
way, it will be much more like the current log behavior (spew everything
to disk), so I can see what is happening in realtime. But, I can also
log just the sucessful lines into another file for long-term analysis.
Thoughts?
John