On Mon, May 13, 2013 at 12:57:06PM -0600, Curtis wrote: > We are seeing an intermittent issue in our Postfix logs where we see > all outbound threads (smtp) stop delivering email or logging > anything while the active queue continues to grow.
Just to make the language less jarring, Postfix is not "multi-threaded", these are "processes" not "threads". Of course each process is its own single thread, but this is not the usual way to talk about the situation. > This indicates to me that all active smtp threads are hanging, > since nothing from the smtp threads are recorded in the logs at all. How long is the time between the last outbound delivery before everything freezes and the first delivery once the freeze stops. The length of the delay may reveal its origin. Report both log messages. > During this time, inbound email is coming in fine and smtpd continues > to log activity, while the smtp threads slowly die one by one, over > the course of several minutes. Idle smtp(8) processes that are not given any work exit after 100s, each process otherwise waits its turn to accept new delivery requests by acquiring an exclusive lock on the /var/spool/postfix/pid/unix.smtp file and then blocking to accept a queue manager request. If your kernel unix-domain socket code has a bug where a connection request from the queue manager is "lost" (fails to be delivered to the process holding the lock that is making the accept(2) system call, then you'll observe the freeze you're reporting until the queue manager gives up and tries another connection. That timeout is 18000s or 5hours. Another kernel bug I've seen on some systems loses data sent beween the queue manager and trivial rewrite, this freezes the queue manager, and mail delivery stops. A watchdog timer causes a queue manager exit after 1000s. Then delivery resumes. > Once all smtp threads finally die, the > number of smtp threads instantly jumps to the max of 110 and outbound > email delivery (and logging) continues. Does the queue manager process number change when this happens? Likely your queue manager restarted. Any warnings, errors, panics, fatal errors etc., logged by the queue manager? > We are going to try to catch it when it's actually happening so that > we can run an strace on one of these hung smtp processes... but I'm > curious if these symptoms are something others have seen before and > could share some insights as to the possible cause. What I'm most > curious about is why would Postfix wait for all existing smtp > threads to die before spawning new threads to handle a rapidly > growing active queue? When you say "active" queue, you are perhaps speaking loosely. More likely the queue manager is frozen and all the mail is piling in "incoming". In which case logging will show lots of mail coming into "active" (logged by qmgr) in parallel with the start of deliveries. -- Viktor.