Premature "No Space left on device" on XFS

Bernhard Schmidt Thu, 06 Oct 2011 12:37:50 -0700

Hey,

a small not-quite but a bit postfix related issue.

We (or better said: an over-eager third party) have been running someperformance tests against our future outbound bulkmail platform (no, notUCE, university stuff), which consists of multiple SLES11.1 VMs with 1GBof RAM and 4 vCPU each, running Postfix 2.8.5 in a multi-instance setup.It is a classic after-queue content filter with amavisd in between twopostfix instances doing DKIM signing.


bulkin --> amavis --> bulkout

bulkin sends everything to amavis, which does the signing and sendseverything to bulkout, which discards it (discard: entry in transport map).

The third-party has been bombarding the bulkin instance for severalhours with 100 parallel threads. There is no chance in hell amavis couldcope with that rate, which eventually lead to almost 2 million files inthe incoming queue. We expected the system to get slower and slower andslower and eventually fail to accept new mails due to queue_minfreebeing hit. But it happened much earlier, and in a very unexpected way

Oct 6 20:14:33 lxmhs45 postfix-bulkinhss/bounce[23308]: fatal: openlock file pid/unix.defer: cannot create file exclusively: No space lefton deviceOct 6 20:15:35 lxmhs45 postfix-bulkinhss/bounce[23691]: fatal: openlock file pid/unix.defer: cannot create file exclusively: No space lefton deviceOct 6 20:16:36 lxmhs45 postfix-bulkinhss/bounce[24479]: fatal: openlock file pid/unix.defer: cannot create file exclusively: No space lefton deviceOct 6 20:17:37 lxmhs45 postfix-bulkinhss/bounce[24579]: fatal: openlock file pid/unix.defer: cannot create file exclusively: No space lefton deviceOct 6 20:18:39 lxmhs45 postfix-bulkinhss/bounce[24684]: fatal: openlock file pid/unix.defer: cannot create file exclusively: No space lefton deviceOct 6 20:19:40 lxmhs45 postfix-bulkinhss/bounce[24847]: fatal: openlock file pid/unix.defer: cannot create file exclusively: No space lefton deviceOct 6 20:20:21 lxmhs45 postfix-bulkin/bounce[24949]: fatal: open filedefer CD0CF2054904: No space left on deviceOct 6 20:20:21 lxmhs45 postfix-bulkin/bounce[24950]: fatal: open filedefer DEBCD1AB98: No space left on deviceOct 6 20:20:21 lxmhs45 postfix-bulkin/bounce[24951]: fatal: open filedefer 1800D3019D35: No space left on deviceOct 6 20:20:41 lxmhs45 postfix-bulkinhss/bounce[24977]: fatal: openlock file pid/unix.defer: cannot create file exclusively: No space lefton device


lxmhs45:/var/spool/postfix-bulk/postfix-bulkinhss # touch a
touch: cannot touch `a': No space left on device
lxmhs45:/var/spool/postfix-bulk/postfix-bulkinhss # df .
Filesystem           1K-blocks      Used Available Use% Mounted on

/dev/sdb 10475520 7471160 3004360 72%/var/spool/postfix-bulk

lxmhs45:/var/spool/postfix-bulk/postfix-bulkinhss # df -i .
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sdb             10485760 1742528 8743232   17% /var/spool/postfix-bulk

Since both bulkin and bulkout instances are both on the same filesystemand mail processing pretty much stopped, we had a nice little lockdown.I had to stop accepting anything and manually move a few files away fromthe queue filesystem to get things running again.

I'm not really asking for tuning advise here, I can think of a couple ofthings to do (hash_queue_names = incoming, higher in_flow_delay, kickthe users in the groin for doing that). I'm trying to understand whathappened here.

We actually thought of the deadlock on queue full before, so bulkin hasqueue_minfree of 2GB and bulkout has a queue_minfree of 1GB. So thebulkin instance would stop accepting mails from outside way before itcould not pass them through the signing chain anymore. But obviouslythat limit was too low, it started failing at 3GB "free".

Does anyone have a reasonable explaination here? My guess is XFS isallocating a block of 4k for each message file of around 1.5k, but thenI'm still missing space (10GB / 4k makes 2.5 Mio possible files, thefilesystem has around 1.9 million files on it). Could the remainingspace have been eaten by structural information? And why is it notreported in df?



Oh, and two problems I've noticed when debugging this:

queue_minfree seems to be a signed 32bit value (I tried to set it tomore than 2GB to stop accepting mail earlier and it failed horribly)

Oct 6 21:34:54 lxmhs45 postfix-bulkin/smtpd[13722]: fatal: badnumerical configuration: queue_minfree = 5073741824

and postmulti seems to have a problem when the argument order isdifferent than documented


lxmhs45:/etc/postfix-bulkin # postmulti -i postfix-bulkout -p status

postfix-bulkout/postfix-script: the Postfix mail system is running: PID:13718

lxmhs45:/etc/postfix-bulkin # postmulti -p status -i postfix-bulkout
postfix/postfix-script: the Postfix mail system is not running

postfix-bulkin/postfix-script: the Postfix mail system is running: PID:13551postfix-bulkout/postfix-script: the Postfix mail system is running: PID:13718

postfix-bulkinhss/postfix-script: the Postfix mail system is not running


Is this expected?

Bernhard

Premature "No Space left on device" on XFS

Reply via email to