I recently upgraded a couple of servers from postfix 2.2 to 2.5. No configuration changes except those made by the upgrade scripts.
Now, during large mailings, the two new servers have frequent qmgr crashes, while the ones running 2.2 do not. The problem is qmgr runs up against the per-process open filehandle limit of 1024: postfix/qmgr[21445]: fatal: fcntl F_DUPFD 128: Too many open files What I'm trying to understand is *why* it's hitting the limit. These servers are configured to have a maximum of 960 smtp processes. On the 2.2 servers, when under heavy load when a new mailing is being submitted, qmgr generally has a little over 960 open filehandles, so I assume (but do not know for sure) that during heavy activity it keeps one socket open to each smtp. On the 2.5 servers, however, it tends to climb slowly towards 900-ish and then suddenly spike up to 1024 and die, and I don't know what it wants all those extra filehandles for. I wrote a script to monitor how many open filehandles qmgr and scache have, and to save lsof output to a file when it gets above 900. Here is typical output from a postfix 2.2 server: 2008-10-30 15:16:26 qmgr: 972 scache: 310 2008-10-30 15:16:36 qmgr: 970 scache: 479 2008-10-30 15:16:46 qmgr: 973 scache: 518 2008-10-30 15:16:50 qmgr: 974 scache: 538 2008-10-30 15:17:05 qmgr: 971 scache: 583 2008-10-30 15:17:09 qmgr: 970 scache: 593 ... and here's a qmgr crash on one of the upgraded postfix 2.5 servers: 2008-10-30 14:41:05 qmgr: 840 scache: 907 2008-10-30 14:41:09 qmgr: 845 scache: 927 2008-10-30 14:41:18 qmgr: 860 scache: 898 2008-10-30 14:41:22 qmgr: 864 scache: 919 2008-10-30 14:41:57 qmgr: 904 scache: 851 2008-10-30 14:42:01 qmgr: 903 scache: 876 2008-10-30 14:42:06 qmgr: 909 scache: 885 2008-10-30 14:42:10 qmgr: 11 scache: 930 2008-10-30 14:43:14 qmgr: 632 scache: 845 The qmgr crash in this case was logged at 14:42:09, so qmgr spiked from 909 to 1024 in about 3 seconds. That's typical. Saved qmgr output from the last sample looks like this: qmgr 10477 postfix 0u CHR 1,3 2176 /dev/null qmgr 10477 postfix 1u CHR 1,3 2176 /dev/null qmgr 10477 postfix 2u CHR 1,3 2176 /dev/null qmgr 10477 postfix 3r FIFO 0,7 1061266654 pipe qmgr 10477 postfix 4w FIFO 0,7 1061266654 pipe qmgr 10477 postfix 5u unix 0x0000010222b41980 1061266544 socket qmgr 10477 postfix 6u FIFO 0,18 1061266542 /var/spool/postfix/public/qmgr qmgr 10477 postfix 7u sock 0,4 1098866379 can't identify protocol qmgr 10477 postfix 8r 0000 0,8 0 1098866385 eventpoll qmgr 10477 postfix 9r DIR 0,18 1286400 38353 /var/spool/postfix/incoming qmgr 10477 postfix 10u unix 0x000001006c70ac80 1100120907 socket qmgr 10477 postfix 12r 0000 0,8 0 1061266532 eventpoll qmgr 10477 postfix 14u sock 0,4 1098866686 can't identify protocol qmgr 10477 postfix 128u unix 0x000001004c68f640 1100103289 socket qmgr 10477 postfix 129u unix 0x0000010023874380 1100117731 socket ... qmgr 10477 postfix 1012u unix 0x000001021833b080 1100120331 socket qmgr 10477 postfix 1013u unix 0x00000100916a1940 1100120998 socket qmgr 10477 postfix 1014u unix 0x0000010083283c40 1100116997 socket qmgr 10477 postfix 1015u unix 0x000001013d373c40 1100120955 socket qmgr 10477 postfix 1016u unix 0x00000100537f6640 1100121209 socket Which is just what it always looks like except with a lot more unix sockets. I don't know how to determine what each socket is talking to. The number of smtp processes never exceeds 960, and there are usually 2 smtpd processes (this server does not handle incoming mail). Any explanation for why, with postfix 2.5, qmgr occasionally tries to use a bunch of extra filehandles? Or something I can do to find out? -- Cos