Hey,
a small not-quite but a bit postfix related issue.
We (or better said: an over-eager third party) have been running some
performance tests against our future outbound bulkmail platform (no, not
UCE, university stuff), which consists of multiple SLES11.1 VMs with 1GB
of RAM and 4 vCPU each, running Postfix 2.8.5 in a multi-instance setup.
It is a classic after-queue content filter with amavisd in between two
postfix instances doing DKIM signing.
bulkin --> amavis --> bulkout
bulkin sends everything to amavis, which does the signing and sends
everything to bulkout, which discards it (discard: entry in transport map).
The third-party has been bombarding the bulkin instance for several
hours with 100 parallel threads. There is no chance in hell amavis could
cope with that rate, which eventually lead to almost 2 million files in
the incoming queue. We expected the system to get slower and slower and
slower and eventually fail to accept new mails due to queue_minfree
being hit. But it happened much earlier, and in a very unexpected way
Oct 6 20:14:33 lxmhs45 postfix-bulkinhss/bounce[23308]: fatal: open
lock file pid/unix.defer: cannot create file exclusively: No space left
on device
Oct 6 20:15:35 lxmhs45 postfix-bulkinhss/bounce[23691]: fatal: open
lock file pid/unix.defer: cannot create file exclusively: No space left
on device
Oct 6 20:16:36 lxmhs45 postfix-bulkinhss/bounce[24479]: fatal: open
lock file pid/unix.defer: cannot create file exclusively: No space left
on device
Oct 6 20:17:37 lxmhs45 postfix-bulkinhss/bounce[24579]: fatal: open
lock file pid/unix.defer: cannot create file exclusively: No space left
on device
Oct 6 20:18:39 lxmhs45 postfix-bulkinhss/bounce[24684]: fatal: open
lock file pid/unix.defer: cannot create file exclusively: No space left
on device
Oct 6 20:19:40 lxmhs45 postfix-bulkinhss/bounce[24847]: fatal: open
lock file pid/unix.defer: cannot create file exclusively: No space left
on device
Oct 6 20:20:21 lxmhs45 postfix-bulkin/bounce[24949]: fatal: open file
defer CD0CF2054904: No space left on device
Oct 6 20:20:21 lxmhs45 postfix-bulkin/bounce[24950]: fatal: open file
defer DEBCD1AB98: No space left on device
Oct 6 20:20:21 lxmhs45 postfix-bulkin/bounce[24951]: fatal: open file
defer 1800D3019D35: No space left on device
Oct 6 20:20:41 lxmhs45 postfix-bulkinhss/bounce[24977]: fatal: open
lock file pid/unix.defer: cannot create file exclusively: No space left
on device
lxmhs45:/var/spool/postfix-bulk/postfix-bulkinhss # touch a
touch: cannot touch `a': No space left on device
lxmhs45:/var/spool/postfix-bulk/postfix-bulkinhss # df .
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb 10475520 7471160 3004360 72%
/var/spool/postfix-bulk
lxmhs45:/var/spool/postfix-bulk/postfix-bulkinhss # df -i .
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sdb 10485760 1742528 8743232 17% /var/spool/postfix-bulk
Since both bulkin and bulkout instances are both on the same filesystem
and mail processing pretty much stopped, we had a nice little lockdown.
I had to stop accepting anything and manually move a few files away from
the queue filesystem to get things running again.
I'm not really asking for tuning advise here, I can think of a couple of
things to do (hash_queue_names = incoming, higher in_flow_delay, kick
the users in the groin for doing that). I'm trying to understand what
happened here.
We actually thought of the deadlock on queue full before, so bulkin has
queue_minfree of 2GB and bulkout has a queue_minfree of 1GB. So the
bulkin instance would stop accepting mails from outside way before it
could not pass them through the signing chain anymore. But obviously
that limit was too low, it started failing at 3GB "free".
Does anyone have a reasonable explaination here? My guess is XFS is
allocating a block of 4k for each message file of around 1.5k, but then
I'm still missing space (10GB / 4k makes 2.5 Mio possible files, the
filesystem has around 1.9 million files on it). Could the remaining
space have been eaten by structural information? And why is it not
reported in df?
Oh, and two problems I've noticed when debugging this:
queue_minfree seems to be a signed 32bit value (I tried to set it to
more than 2GB to stop accepting mail earlier and it failed horribly)
Oct 6 21:34:54 lxmhs45 postfix-bulkin/smtpd[13722]: fatal: bad
numerical configuration: queue_minfree = 5073741824
and postmulti seems to have a problem when the argument order is
different than documented
lxmhs45:/etc/postfix-bulkin # postmulti -i postfix-bulkout -p status
postfix-bulkout/postfix-script: the Postfix mail system is running: PID:
13718
lxmhs45:/etc/postfix-bulkin # postmulti -p status -i postfix-bulkout
postfix/postfix-script: the Postfix mail system is not running
postfix-bulkin/postfix-script: the Postfix mail system is running: PID:
13551
postfix-bulkout/postfix-script: the Postfix mail system is running: PID:
13718
postfix-bulkinhss/postfix-script: the Postfix mail system is not running
Is this expected?
Bernhard