On 10/9/2011 4:49 PM, karave...@mail.bg wrote: > I run a couple of busy postfix MX servers with queues now on XFS: > average: 400 deliveries per minute > peak: 1200 deliveries per minute. > > 4 months ago they were hosted on 8 core Xeon, 6xSAS10k RAID 10 > machines. The spools were on ext4. > > When I have switched the queue filesystem to XFS with delaylog option > (around 2.6.36) the load average dropped from 2.5 to 0.5.
Nice. I wouldn't have expected quite that much gain with a Postfix queue workload (my inbound flows are much smaller). > Now, about the spools. They are managed by Cyrus, so not a Maildir but > close. We have now in use 2 types of servers for spools: > 24 SATA x 1T disks in RAID5 > 12 SATA x 3T disks in RAID5. > The mail spools and other mail related filesystems are on XFS with > delaylog option. They run with average 200 TPS > > Yes, the expunges take some time. But we run the task every night for > 1/7 of the mailboxes, so every mailbox is expunged once in a week. The > expunge task runs for 2-3 hours on around 50k mailboxes. Ouch. There are probably multiple contributors to that 2-3 hr run time, but I'm wondering how much of that, if any, is possibly due to a less than optimal XFS configuration, such as using the inode64 allocator with too many AGs, causing head thrashing. According to your formula below, your 7.2k SATA 24 disk array would have been created with 52 AGs on a dual CPU system. 52 is excessive and could very well cause head thrashing with this workload. 24 AGs would be more optimal for your system. Care to share your xfs_info output? Maybe off list is best since we're OT. On 10/9/2011 5:33 PM, karave...@mail.bg wrote: > Setting a higher number of allocation groups per XFS > filesystem helps a lot for the concurrency. My rule of > thumb (learnt from databases) is: > number of spindles + 2 * number of CPUs. Assuming you're using inode64, this may work ok up to a point. With inode32 this is a very bad idea. Depending on your RAID hardware/software and the spindle speed of the drives, at a certain number of allocation groups your performance will begin to _degrade_ due to excessive head seeking, as AGs are spread evenly across the platter. This obviously applies only to mech disks, not SSDs. > About the fsck times. We experienced a couple of power > failures and XFS comes up in 30-45 minutes (30T in > RAID5 of 12 SATA disks). If the server is shut down > correctly in comes up in a second. Interesting. This slow check time could also be a symptom of too many AGs on an inode64 filesystem. xfs_check and xfs_repair walk the entire directory structure using parallel threads. With inode64, metadata is stored in all AGs, which again, are spread evenly across the effective platter. So you'd be getting head thrashing as a result as threads compete for head movement between AGs. > We know that RAID5 is not the best option for write > scalability, but the controller write cache helps a lot. Especially if you have barriers disabled, which you should with a BBWC and the individual drive caches disabled. You're still taking an RWM beating on the expunge even with good/big write cache. If you don't mind me asking, what RAID HBA or SAN head are you using? -- Stan