On 05/28/12 20:06, Iwan Aucamp wrote:
I'm getting sub-optimal performance with an mmap based database
(mongodb) which is running on zfs of Solaris 10u9.
System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB)
ram (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks
- a few mongodb instances are running with with moderate IO and total
rss of 50 GB
- a service which logs quite excessively (5GB every 20 mins) is also
running (max 2GB ram use) - log files are compressed after some time
to bzip2.
Database performance is quite horrid though - it seems that zfs does
not know how to manage allocation between page cache and arc cache -
and it seems arc cache wins most of the time.
I'm thinking of doing the following:
- relocating mmaped (mongo) data to a zfs filesystem with only
metadata cache
- reducing zfs arc cache to 16 GB
Is there any other recommendations - and is above likely to improve
performance.
1. Upgrade to S10 Update 10 - this has various performance improvements,
in particular related to database type loads (but I don't know anything
about mongodb).
2. Reduce the ARC size so RSS + ARC + other memory users < RAM size.
I assume the RSS include's whatever caching the database does. In
theory, a database should be able to work out what's worth caching
better than any filesystem can guess from underneath it, so you want to
configure more memory in the DB's cache than in the ARC. (The default
ARC tuning is unsuitable for a database server.)
3. If the database has some concept of blocksize or recordsize that it
uses to perform i/o, make sure the filesystems it is using configured to
be the same recordsize. The ZFS default recordsize (128kB) is usually
much bigger than database blocksizes. This is probably going to have
less impact with an mmaped database than a read(2)/write(2) database,
where it may prove better to match the filesystem's record size to the
system's page size (4kB, unless it's using some type of large pages). I
haven't tried playing with recordsize for memory mapped i/o, so I'm
speculating here.
Blocksize or recordsize may apply to the log file writer too, and it may
be that this needs a different recordsize and therefore has to be in a
different filesystem. If it uses write(2) or some variant rather than
mmap(2) and doesn't document this in detail, Dtrace is your friend.
4. Keep plenty of free space in the zpool if you want good database
performance. If you're more than 60% full (S10U9) or 80% full (S10U10),
that could be a factor.
Anyway, there are a few things to think about.
--
Andrew
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss