Hello, I'm having problems with mail accumulating in the incoming queue under heavy load (2500+ SMTPd processes). The queue manager stops for a long time once in a while after trying to communicate with the "trace" client, as shown in a trace from cleanup below:
-- open("public/qmgr", O_WRONLY|O_NONBLOCK) = 14 fstat64(14, {st_mode=S_IFIFO|0622, st_size=0, ...}) = 0 lstat64("public/qmgr", {st_mode=S_IFIFO|0622, st_size=0, ...}) = 0 fcntl64(14, F_GETFL) = 0x801 (flags O_WRONLY|O_NONBLOCK) fcntl64(14, F_SETFL, O_WRONLY|O_NONBLOCK) = 0 poll([{fd=14, events=POLLOUT}], 1, 10000) = 0 close(14) = 0 -- From what I've been able to piece together the communication in this case flows as this: qmgr->trace->cleanup->qmgr Files accumulating in the incoming queue in this situation have mode 0700. Since this indicates that they are ready to be moved to the active queue, it hints of a problem with the queue manager. Of course, there's plenty of resources (memory, CPU, I/O) still available in the server. I've tried setting trigger_timeout to 1s but it doesn't help very much. I found a very similar report from a while ago about the "bounce" client: http://archives.neohapsis.com/archives/postfix/2000-12/0351.html Wietse acknowledged the problem and released a solution a few days later. I quote him below: "The problem is that qmgr blocks while bouncing. At present, the bounce client interface is synchronous: when bouncing mail, the qmgr has to wait until the bounce message is queued, which involves another cleanup daemon process, which produces another qmgr trigger. Normally, all this happens in a split second. However, if the qmgr FIFO is filled up, the cleanup process that queues the bounce message will block $trigger_timeout seconds while attempting to trigger the qmgr. And since the qmge is waiting for the bounce message to be queued, qmgr also blocks for $trigger_timeout seconds, which is undesirable. So you guys have found a little deadlock that happens when mail bounces while a lot of mail is being submitted so that the qmgr FIFO fills up. Fortunately, Postfix has time limits on everything so it survives the deadlock." I've checked the Postfix release log and found the following related entries: 20001208 Bugfix: while processing massive amounts of one-recipient mail, qmgr could deadlock for 10 seconds while sending a bounce message. All queue manager bounce send requests are now implemented asynchronously. Files: global/abounce.[hc] (asynchronous bounce client), qmgr/qmgr_active.c. Problem reported by El Bunzo (webpower.nl) and Tiger Technologies (tigertech.com). 20021116 New trace service. This is used for reporting if a recipient is deliverable (sendmail -bv) and for producing a record of delivery attempts (sendmail -v). The report is sent via email, using the bounce daemon. Files: global/trace.[hc]. This required replacing the bounce/defer logfile format by an extensible name=value format. Files: global/bounce_log.c, bounce/bounce_append_service.c. So here's my question: would it be possible to make the trace client interface asynchronous as well? I believe it would help a lot in this case, since I've tried disabling delivery status notifications and the problem disappeared. The only messages I'm able to see in the incoming queue in this situation have mode 0600 and that means the bottleneck (not a very good term, since messages don't accumulate anymore) has shifted to the cleanup process. Unfortunately, leaving DSNs off isn't an option for me. Thank you very much, Eduardo Stelmaszczyk