On 05/28/12 20:06, Iwan Aucamp wrote:
I'm getting sub-optimal performance with an mmap based database (mongodb) which is running on zfs of Solaris 10u9.

System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) ram (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks

- a few mongodb instances are running with with moderate IO and total rss of 50 GB - a service which logs quite excessively (5GB every 20 mins) is also running (max 2GB ram use) - log files are compressed after some time to bzip2.

Database performance is quite horrid though - it seems that zfs does not know how to manage allocation between page cache and arc cache - and it seems arc cache wins most of the time.

I'm thinking of doing the following:
- relocating mmaped (mongo) data to a zfs filesystem with only metadata cache
 - reducing zfs arc cache to 16 GB

Is there any other recommendations - and is above likely to improve performance.

1. Upgrade to S10 Update 10 - this has various performance improvements, in particular related to database type loads (but I don't know anything about mongodb).

2. Reduce the ARC size so RSS + ARC + other memory users < RAM size.
I assume the RSS include's whatever caching the database does. In theory, a database should be able to work out what's worth caching better than any filesystem can guess from underneath it, so you want to configure more memory in the DB's cache than in the ARC. (The default ARC tuning is unsuitable for a database server.)

3. If the database has some concept of blocksize or recordsize that it uses to perform i/o, make sure the filesystems it is using configured to be the same recordsize. The ZFS default recordsize (128kB) is usually much bigger than database blocksizes. This is probably going to have less impact with an mmaped database than a read(2)/write(2) database, where it may prove better to match the filesystem's record size to the system's page size (4kB, unless it's using some type of large pages). I haven't tried playing with recordsize for memory mapped i/o, so I'm speculating here.

Blocksize or recordsize may apply to the log file writer too, and it may be that this needs a different recordsize and therefore has to be in a different filesystem. If it uses write(2) or some variant rather than mmap(2) and doesn't document this in detail, Dtrace is your friend.

4. Keep plenty of free space in the zpool if you want good database performance. If you're more than 60% full (S10U9) or 80% full (S10U10), that could be a factor.

Anyway, there are a few things to think about.

--
Andrew
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to