[zfs-discuss] ZFS performance with Oracle

JS Fri, 16 Mar 2007 13:12:27 -0800

I thought I'd share some lessons learned testing Oracle APS on Solaris 10 using 
ZFS as backend storage. I just got done running 2 months worth of performance 
tests on a v490 (32GB/4x1.8Ghz dual core proc system with 2xSun 2G HBAs on 
separate fabrics) and varying how I managed storage. Storage used included EMC 
CX-600 disk (both presented as a LUN and as exported disks) and Pillar Axiom 
disk, using ZFS for all filesystems, some filesystems and no filesystems vs 
VxVM and UFS combinations. The specifics of my timing data won't be generally 
useful (I simply needed to drive down the timing of one large job) but this 
layout has been generally useful in keeping latency down among my ordinary ERP 
and other Oracle loads. These suggestions have been taken from the performance 
wiki as well as other mailing lists and posts made here, as well as my own 
guesstimates.
 
-Everything VXFS was pretty fast out of the box, but I expected that. 
-Having everything vanilla UFS was a bit slower on filebench tests, but dragged 
out my plan during real loads.
-8k blocksize for datafiles is essential. Both filebench and live testing prove 
this out.
-Separating the redo log from the data pool. Redo logs blew chunks on every ZFS 
installation, driving up my total process time in every case (the job is redo 
intensive at important junctures). 
-redo logs on a RAID 10 LUN on EMC using forcedirectio,noatime beat the same 
LUN using vxfs multiple times (didn't test quickio, which we don't normally use 
anyway). Slicing and presenting LUNS from the same RAID Group was faster than 
slicing a single LUN from the OS (for synchronized redo logs, primary sync on 1 
group, secondary on the other), but didn't get any faster or seriously drop my 
latency overhead when I used entirely separate RAID10 Groups. Using EMC LUNs 
was consistantly faster than exporting the disks and making veritas or 
disksuite luns.
-separating /backup onto a separate pool. huge differences during backups. I 
use low priority axiom disk here.
-exporting disks from EMC and using those to build RAID10 mirrors. This is 
annoying as I'd prefer to create the mirrors on EMC so I can take advantage of 
the hot spares and the backend processing, but the kernel still takes a crap 
every time a single non-redundant (to ZFS) device backs up and causes a bus 
reset. 
-for my particular test 7xRAID 10 (14 73GB 15k drives) ended up being as fast 
or faster than the same number of drives split into EMC luns and presented with 
vxfs on them. With 11x (22 drives) and /backup and redo logs on the main pool, 
the drives always stay at a high latency and performance craps during backups.
-I tried futzing with sd:sd_max_throttle with values from 20(low water mark) to 
64 (high water mark without errors) and my particular process didn't seem to 
benefit.  Left this value at 20, since EMC still recommends it.
-No particular value for powerpath vs mpxio other than price.
-set_arc.sh script (when it worked the first couple of times). The program 
grabs many GB of memory, so fighting the ARC cache for the right to mmap was a 
huge impediment.
-Pillar was pretty quick when I created multiple luns and strung them together 
as one big stripe, but wasn't as consistant in IO usage or overall time as EMC.


A couple of things that don't relate to the IO layout, but were important for 
APS were:
-Sun Systems need faster processors to process the jobs faster. Had Oracle beat 
our data processing time by 4x+ on an 8x1Ghz system (we had 1.5Ghz USIV+)and 
this bugged the hell out of me until I found out that they were running an 
HP9000 rp4440 which uses a combined memory bandwidth of 12.9GB/s, no matter how 
many processors were running. USIV+ maxes at 2.4Ghz/proc, but scales the more 
procs you have working. This is all swell for RDBMS loads talking to shared 
memory, but most of the time in APS is spent running a single threaded job that 
loads many gigabytes of memory then processes the data. For that case, big, 
unscalable memory bandwidth beat the hell out of scalable procs at higher Mhz. 
Going from 1.3Ghz procs to 1.8s cut total running time (even with other 
improvements) by about 60%.
-MPSS using 4M pages vs normal 8k pages made no real difference. While trapstat 
-T wasn't really showing a high percentage of misses, there was an assumption 
that anything that allowed the process to read data into memory faster would 
help performance. Maybe if we could actually recompile the binary, but by 
setting the environment to use the library we got nothing more than more 4M 
cache misses.
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS performance with Oracle

Reply via email to