I've had Postfix seemingly stop responding 3 times on 2 systems over the
weekend. In all cases, the system is accessible, but logging in is
sluggish, however, load, iowait, memory usage, etc. all appear perfectly
normal. Postfix doesn't log any unusual errors / warnings / etc., but
simply seems to stop logging abruptly. So it seems like it may be a
kernel param or ulimit issue, vs. Postfix config or performance
constraint.

I saved netstat, ps, and lsof output to a file before restarting, and
don't see anything unusual or close to limits in terms of # of open
files, # of open connections, # of smtpd processes, etc.. The tcp
connection opens, but no banner is sent if a new connection is
initiated. The smtpds do not have the 'stress' parameter set to
anything.

The various smtpds each are set to '-' in master.cf, and main.cf /
postconf reports:
default_process_limit = 500

(in the most recent case, the *total* number of smtpd instances in the
'ps' output was 50, and connections are spread out across 25 / 465 / 587
and one other instance that's mapped to a different IP on port 25 from
the load balancer, so I don't think it's a simple case of too low a
process limit).

This is on RHEL6 with the stock 2.6.6 package (version 6). I'd rather
not post full postconf -n output at this time, though I'm happy to
report on specific variables.

Running ulimit -a as the postfix user reports:
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256302
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 49152
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 10240
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) 49152

It's not something I can reproduce on a test system, so any suggestions
on the gentlest way to debug this in production?

On the most recent occurance, I straced the master process, following
child processes, and see this a bunch of times (before stopping postfix
or rebooting the system). Any postfix-whisperers have an idea of what
might be happening just based on this?

write(104, "\0", 1)                     = 1
rt_sigreturn(0x68)                      = -1 EINTR (Interrupted system call)
alarm(333)                              = 333
epoll_wait(14, {{EPOLLIN, {u32=11, u64=11530104173018218507}}}, 100, 9000) = 1
read(11, "\0", 1)                       = 1
read(11, 0x7ffffdfd329f, 1)             = -1 EAGAIN (Resource temporarily 
unavailable)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 30794
epoll_ctl(14, EPOLL_CTL_ADD, 71, {EPOLLIN, {u32=71, u64=11634484007298662471}}) 
= 0
wait4(-1, 0x7ffffdfd3778, WNOHANG, NULL) = 0
alarm(333)                              = 333
epoll_wait(14, {{EPOLLIN, {u32=87, u64=11634081860920803415}}}, 100, 9000) = 1
read(87, ")/\0\0u\t\0\0\0\0\0\0", 12)   = 12
alarm(333)                              = 331
epoll_wait(14, {{EPOLLIN, {u32=87, u64=11634081860920803415}}}, 100, 7000) = 1
read(87, ")/\0\0u\t\0\0\1\0\0\0", 12)   = 12
alarm(333)                              = 333
epoll_wait(14, {{EPOLLIN, {u32=87, u64=11634081860920803415}}}, 100, 7000) = 1
read(87, ")/\0\0u\t\0\0\0\0\0\0", 12)   = 12
alarm(333)                              = 333
epoll_wait(14, {{EPOLLIN, {u32=87, u64=11634081860920803415}}}, 100, 7000) = 1
read(87, ")/\0\0u\t\0\0\1\0\0\0", 12)   = 12
alarm(333)                              = 333
epoll_wait(14, 7ffffdfd32c0, 100, 7000) = -1 EINTR (Interrupted system call)
--- SIGCHLD (Child exited) @ 0 (0) ---

The last thing I see that looks remotely like a problem before the
system stops logging / responding is:

2014-09-14T19:32:42.251092-07:00 XX postfix/smtpd[27487]: warning: 
network_biopair_interop: error writing 53 bytes to the network: Connection 
reset by peer
2014-09-14T19:32:42.251099-07:00 XX postfix/smtpd[27487]: warning: 
network_biopair_interop: error writing 37 bytes to the network: Broken pipe

but these are fairly common, presumably because of SSL version mismatch
problems. Then the last warning message I see is:

2014-09-14T19:34:07.911837-07:00 XX postfix/smtpd[24416]: warning: 92.64.225.1: 
hostname static.kpn.net verification failed: Name or service not known

And the last log entry at all:
2014-09-14T19:34:07.911845-07:00 XX postfix/smtpd[24416]: connect from 
unknown[92.64.225.1]

There's nothing else unusual that I can see. I'm not sure if it could be
an rsyslog problem, however, usually when this happens, there are some
other symptoms (kernel log and regular system log don't show any events
around this time).

w

Reply via email to