Hello,

I'm having problems with mail accumulating in the incoming queue under
heavy load (2500+ SMTPd processes). The queue manager stops for a long
time once in a while after trying to communicate with the "trace" client,
as shown in a trace from cleanup below:

--
open("public/qmgr", O_WRONLY|O_NONBLOCK) = 14
fstat64(14, {st_mode=S_IFIFO|0622, st_size=0, ...}) = 0
lstat64("public/qmgr", {st_mode=S_IFIFO|0622, st_size=0, ...}) = 0
fcntl64(14, F_GETFL)                    = 0x801 (flags O_WRONLY|O_NONBLOCK)
fcntl64(14, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
poll([{fd=14, events=POLLOUT}], 1, 10000) = 0
close(14)                               = 0
--

From what I've been able to piece together the communication in this
case flows as this:

qmgr->trace->cleanup->qmgr

Files accumulating in the incoming queue in this situation have mode
0700. Since this indicates that they are ready to be moved to the 
active queue, it hints of a problem with the queue manager. Of 
course, there's plenty of resources (memory, CPU, I/O) still available 
in the server.
I've tried setting trigger_timeout to 1s but it doesn't help very much.

I found a very similar report from a while ago about the "bounce" client:

http://archives.neohapsis.com/archives/postfix/2000-12/0351.html

Wietse acknowledged the problem and released a solution a few days 
later. I quote him below:

"The problem is that qmgr blocks while bouncing. At present, the
bounce client interface is synchronous: when bouncing mail, the
qmgr has to wait until the bounce message is queued, which involves
another cleanup daemon process, which produces another qmgr trigger.

Normally, all this happens in a split second. However, if the qmgr
FIFO is filled up, the cleanup process that queues the bounce
message will block $trigger_timeout seconds while attempting to
trigger the qmgr. And since the qmge is waiting for the bounce
message to be queued, qmgr also blocks for $trigger_timeout seconds,
which is undesirable.

So you guys have found a little deadlock that happens when mail
bounces while a lot of mail is being submitted so that the qmgr
FIFO fills up. Fortunately, Postfix has time limits on everything
so it survives the deadlock."

I've checked the Postfix release log and found the following related
entries:

20001208
        Bugfix: while processing massive amounts of one-recipient
        mail, qmgr could deadlock for 10 seconds while sending a
        bounce message. All queue manager bounce send requests are
        now implemented asynchronously.  Files: global/abounce.[hc]
        (asynchronous bounce client), qmgr/qmgr_active.c.  Problem
        reported by El Bunzo (webpower.nl) and Tiger Technologies
        (tigertech.com).

20021116
        New trace service. This is used for reporting if a recipient
        is deliverable (sendmail -bv) and for producing a record
        of delivery attempts (sendmail -v). The report is sent via
        email, using the bounce daemon. Files: global/trace.[hc].
        This required replacing the bounce/defer logfile format by
        an extensible name=value format. Files: global/bounce_log.c,
        bounce/bounce_append_service.c.

So here's my question: would it be possible to make the trace client 
interface asynchronous as well? I believe it would help a lot in this
case, since I've tried disabling delivery status notifications and the problem 
disappeared. The only messages I'm able to see in the incoming queue 
in this situation have mode 0600 and that means the bottleneck (not a
very good term, since messages don't accumulate anymore) has shifted to
the cleanup process. Unfortunately, leaving DSNs off isn't an option for
me.

Thank you very much,

Eduardo Stelmaszczyk

Reply via email to