I've had Postfix seemingly stop responding 3 times on 2 systems over the weekend. In all cases, the system is accessible, but logging in is sluggish, however, load, iowait, memory usage, etc. all appear perfectly normal. Postfix doesn't log any unusual errors / warnings / etc., but simply seems to stop logging abruptly. So it seems like it may be a kernel param or ulimit issue, vs. Postfix config or performance constraint.
I saved netstat, ps, and lsof output to a file before restarting, and don't see anything unusual or close to limits in terms of # of open files, # of open connections, # of smtpd processes, etc.. The tcp connection opens, but no banner is sent if a new connection is initiated. The smtpds do not have the 'stress' parameter set to anything. The various smtpds each are set to '-' in master.cf, and main.cf / postconf reports: default_process_limit = 500 (in the most recent case, the *total* number of smtpd instances in the 'ps' output was 50, and connections are spread out across 25 / 465 / 587 and one other instance that's mapped to a different IP on port 25 from the load balancer, so I don't think it's a simple case of too low a process limit). This is on RHEL6 with the stock 2.6.6 package (version 6). I'd rather not post full postconf -n output at this time, though I'm happy to report on specific variables. Running ulimit -a as the postfix user reports: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 256302 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 49152 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 10240 virtual memory (kbytes, -v) unlimited file locks (-x) 49152 It's not something I can reproduce on a test system, so any suggestions on the gentlest way to debug this in production? On the most recent occurance, I straced the master process, following child processes, and see this a bunch of times (before stopping postfix or rebooting the system). Any postfix-whisperers have an idea of what might be happening just based on this? write(104, "\0", 1) = 1 rt_sigreturn(0x68) = -1 EINTR (Interrupted system call) alarm(333) = 333 epoll_wait(14, {{EPOLLIN, {u32=11, u64=11530104173018218507}}}, 100, 9000) = 1 read(11, "\0", 1) = 1 read(11, 0x7ffffdfd329f, 1) = -1 EAGAIN (Resource temporarily unavailable) wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 30794 epoll_ctl(14, EPOLL_CTL_ADD, 71, {EPOLLIN, {u32=71, u64=11634484007298662471}}) = 0 wait4(-1, 0x7ffffdfd3778, WNOHANG, NULL) = 0 alarm(333) = 333 epoll_wait(14, {{EPOLLIN, {u32=87, u64=11634081860920803415}}}, 100, 9000) = 1 read(87, ")/\0\0u\t\0\0\0\0\0\0", 12) = 12 alarm(333) = 331 epoll_wait(14, {{EPOLLIN, {u32=87, u64=11634081860920803415}}}, 100, 7000) = 1 read(87, ")/\0\0u\t\0\0\1\0\0\0", 12) = 12 alarm(333) = 333 epoll_wait(14, {{EPOLLIN, {u32=87, u64=11634081860920803415}}}, 100, 7000) = 1 read(87, ")/\0\0u\t\0\0\0\0\0\0", 12) = 12 alarm(333) = 333 epoll_wait(14, {{EPOLLIN, {u32=87, u64=11634081860920803415}}}, 100, 7000) = 1 read(87, ")/\0\0u\t\0\0\1\0\0\0", 12) = 12 alarm(333) = 333 epoll_wait(14, 7ffffdfd32c0, 100, 7000) = -1 EINTR (Interrupted system call) --- SIGCHLD (Child exited) @ 0 (0) --- The last thing I see that looks remotely like a problem before the system stops logging / responding is: 2014-09-14T19:32:42.251092-07:00 XX postfix/smtpd[27487]: warning: network_biopair_interop: error writing 53 bytes to the network: Connection reset by peer 2014-09-14T19:32:42.251099-07:00 XX postfix/smtpd[27487]: warning: network_biopair_interop: error writing 37 bytes to the network: Broken pipe but these are fairly common, presumably because of SSL version mismatch problems. Then the last warning message I see is: 2014-09-14T19:34:07.911837-07:00 XX postfix/smtpd[24416]: warning: 92.64.225.1: hostname static.kpn.net verification failed: Name or service not known And the last log entry at all: 2014-09-14T19:34:07.911845-07:00 XX postfix/smtpd[24416]: connect from unknown[92.64.225.1] There's nothing else unusual that I can see. I'm not sure if it could be an rsyslog problem, however, usually when this happens, there are some other symptoms (kernel log and regular system log don't show any events around this time). w