Hey,

a small not-quite but a bit postfix related issue.

We (or better said: an over-eager third party) have been running some performance tests against our future outbound bulkmail platform (no, not UCE, university stuff), which consists of multiple SLES11.1 VMs with 1GB of RAM and 4 vCPU each, running Postfix 2.8.5 in a multi-instance setup. It is a classic after-queue content filter with amavisd in between two postfix instances doing DKIM signing.

bulkin --> amavis --> bulkout

bulkin sends everything to amavis, which does the signing and sends everything to bulkout, which discards it (discard: entry in transport map).

The third-party has been bombarding the bulkin instance for several hours with 100 parallel threads. There is no chance in hell amavis could cope with that rate, which eventually lead to almost 2 million files in the incoming queue. We expected the system to get slower and slower and slower and eventually fail to accept new mails due to queue_minfree being hit. But it happened much earlier, and in a very unexpected way

Oct 6 20:14:33 lxmhs45 postfix-bulkinhss/bounce[23308]: fatal: open lock file pid/unix.defer: cannot create file exclusively: No space left on device Oct 6 20:15:35 lxmhs45 postfix-bulkinhss/bounce[23691]: fatal: open lock file pid/unix.defer: cannot create file exclusively: No space left on device Oct 6 20:16:36 lxmhs45 postfix-bulkinhss/bounce[24479]: fatal: open lock file pid/unix.defer: cannot create file exclusively: No space left on device Oct 6 20:17:37 lxmhs45 postfix-bulkinhss/bounce[24579]: fatal: open lock file pid/unix.defer: cannot create file exclusively: No space left on device Oct 6 20:18:39 lxmhs45 postfix-bulkinhss/bounce[24684]: fatal: open lock file pid/unix.defer: cannot create file exclusively: No space left on device Oct 6 20:19:40 lxmhs45 postfix-bulkinhss/bounce[24847]: fatal: open lock file pid/unix.defer: cannot create file exclusively: No space left on device Oct 6 20:20:21 lxmhs45 postfix-bulkin/bounce[24949]: fatal: open file defer CD0CF2054904: No space left on device Oct 6 20:20:21 lxmhs45 postfix-bulkin/bounce[24950]: fatal: open file defer DEBCD1AB98: No space left on device Oct 6 20:20:21 lxmhs45 postfix-bulkin/bounce[24951]: fatal: open file defer 1800D3019D35: No space left on device Oct 6 20:20:41 lxmhs45 postfix-bulkinhss/bounce[24977]: fatal: open lock file pid/unix.defer: cannot create file exclusively: No space left on device

lxmhs45:/var/spool/postfix-bulk/postfix-bulkinhss # touch a
touch: cannot touch `a': No space left on device
lxmhs45:/var/spool/postfix-bulk/postfix-bulkinhss # df .
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdb 10475520 7471160 3004360 72% /var/spool/postfix-bulk
lxmhs45:/var/spool/postfix-bulk/postfix-bulkinhss # df -i .
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sdb             10485760 1742528 8743232   17% /var/spool/postfix-bulk

Since both bulkin and bulkout instances are both on the same filesystem and mail processing pretty much stopped, we had a nice little lockdown. I had to stop accepting anything and manually move a few files away from the queue filesystem to get things running again.

I'm not really asking for tuning advise here, I can think of a couple of things to do (hash_queue_names = incoming, higher in_flow_delay, kick the users in the groin for doing that). I'm trying to understand what happened here.

We actually thought of the deadlock on queue full before, so bulkin has queue_minfree of 2GB and bulkout has a queue_minfree of 1GB. So the bulkin instance would stop accepting mails from outside way before it could not pass them through the signing chain anymore. But obviously that limit was too low, it started failing at 3GB "free".

Does anyone have a reasonable explaination here? My guess is XFS is allocating a block of 4k for each message file of around 1.5k, but then I'm still missing space (10GB / 4k makes 2.5 Mio possible files, the filesystem has around 1.9 million files on it). Could the remaining space have been eaten by structural information? And why is it not reported in df?


Oh, and two problems I've noticed when debugging this:

queue_minfree seems to be a signed 32bit value (I tried to set it to more than 2GB to stop accepting mail earlier and it failed horribly)

Oct 6 21:34:54 lxmhs45 postfix-bulkin/smtpd[13722]: fatal: bad numerical configuration: queue_minfree = 5073741824



and postmulti seems to have a problem when the argument order is different than documented

lxmhs45:/etc/postfix-bulkin # postmulti -i postfix-bulkout -p status
postfix-bulkout/postfix-script: the Postfix mail system is running: PID: 13718
lxmhs45:/etc/postfix-bulkin # postmulti -p status -i postfix-bulkout
postfix/postfix-script: the Postfix mail system is not running
postfix-bulkin/postfix-script: the Postfix mail system is running: PID: 13551 postfix-bulkout/postfix-script: the Postfix mail system is running: PID: 13718
postfix-bulkinhss/postfix-script: the Postfix mail system is not running


Is this expected?

Bernhard

Reply via email to