On 10/9/2011 4:49 PM, karave...@mail.bg wrote:

> I run a couple of busy postfix MX servers with queues now on XFS:
> average: 400 deliveries per minute 
> peak: 1200 deliveries per minute.
> 
> 4 months ago they were hosted on 8 core Xeon, 6xSAS10k RAID 10 
> machines. The spools were on ext4.
> 
> When I have switched the queue filesystem to XFS with delaylog option 
> (around 2.6.36) the load average dropped from 2.5 to 0.5.

Nice.  I wouldn't have expected quite that much gain with a Postfix
queue workload (my inbound flows are much smaller).

> Now, about the spools. They are managed by Cyrus, so not a Maildir but 
> close. We have now in use 2 types of servers for spools:
> 24 SATA x 1T disks in RAID5
> 12 SATA x 3T disks in RAID5.
> The mail spools and other mail related filesystems are on XFS with 
> delaylog option. They run with average 200 TPS 
> 
> Yes, the expunges take some time. But we run the task every night for 
> 1/7 of the mailboxes, so every mailbox is expunged once in a week. The 
> expunge task runs for 2-3 hours on around 50k mailboxes.

Ouch.  There are probably multiple contributors to that 2-3 hr run time,
but I'm wondering how much of that, if any, is possibly due to a less
than optimal XFS configuration, such as using the inode64 allocator with
too many AGs, causing head thrashing.  According to your formula below,
your 7.2k SATA 24 disk array would have been created with 52 AGs on a
dual CPU system.  52 is excessive and could very well cause head
thrashing with this workload.  24 AGs would be more optimal for your
system.  Care to share your xfs_info output?  Maybe off list is best
since we're OT.


On 10/9/2011 5:33 PM, karave...@mail.bg wrote:

> Setting a higher number  of allocation groups per XFS
> filesystem helps a lot for the concurrency. My rule of
> thumb (learnt from databases) is:
> number of spindles + 2 * number of CPUs.

Assuming you're using inode64, this may work ok up to a point.  With
inode32 this is a very bad idea.  Depending on your RAID
hardware/software and the spindle speed of the drives, at a certain
number of allocation groups your performance will begin to _degrade_ due
to excessive head seeking, as AGs are spread evenly across the platter.
 This obviously applies only to mech disks, not SSDs.

> About the fsck times. We experienced a couple of power
> failures and XFS comes up in 30-45 minutes  (30T in
> RAID5 of 12 SATA disks).  If the server is shut down
> correctly in comes up in a second.

Interesting.  This slow check time could also be a symptom of too many
AGs on an inode64 filesystem.  xfs_check and xfs_repair walk the entire
directory structure using parallel threads.  With inode64, metadata is
stored in all AGs, which again, are spread evenly across the effective
platter.  So you'd be getting head thrashing as a result as threads
compete for head movement between AGs.

> We know that RAID5 is not the best option for write
> scalability, but the controller write cache helps a lot.

Especially if you have barriers disabled, which you should with a BBWC
and the individual drive caches disabled.  You're still taking an RWM
beating on the expunge even with good/big write cache.  If you don't
mind me asking, what RAID HBA or SAN head are you using?

-- 
Stan

Reply via email to