I thought I'd share some lessons learned testing Oracle APS on Solaris 10 using ZFS as backend storage. I just got done running 2 months worth of performance tests on a v490 (32GB/4x1.8Ghz dual core proc system with 2xSun 2G HBAs on separate fabrics) and varying how I managed storage. Storage used included EMC CX-600 disk (both presented as a LUN and as exported disks) and Pillar Axiom disk, using ZFS for all filesystems, some filesystems and no filesystems vs VxVM and UFS combinations. The specifics of my timing data won't be generally useful (I simply needed to drive down the timing of one large job) but this layout has been generally useful in keeping latency down among my ordinary ERP and other Oracle loads. These suggestions have been taken from the performance wiki as well as other mailing lists and posts made here, as well as my own guesstimates. -Everything VXFS was pretty fast out of the box, but I expected that. -Having everything vanilla UFS was a bit slower on filebench tests, but dragged out my plan during real loads. -8k blocksize for datafiles is essential. Both filebench and live testing prove this out. -Separating the redo log from the data pool. Redo logs blew chunks on every ZFS installation, driving up my total process time in every case (the job is redo intensive at important junctures). -redo logs on a RAID 10 LUN on EMC using forcedirectio,noatime beat the same LUN using vxfs multiple times (didn't test quickio, which we don't normally use anyway). Slicing and presenting LUNS from the same RAID Group was faster than slicing a single LUN from the OS (for synchronized redo logs, primary sync on 1 group, secondary on the other), but didn't get any faster or seriously drop my latency overhead when I used entirely separate RAID10 Groups. Using EMC LUNs was consistantly faster than exporting the disks and making veritas or disksuite luns. -separating /backup onto a separate pool. huge differences during backups. I use low priority axiom disk here. -exporting disks from EMC and using those to build RAID10 mirrors. This is annoying as I'd prefer to create the mirrors on EMC so I can take advantage of the hot spares and the backend processing, but the kernel still takes a crap every time a single non-redundant (to ZFS) device backs up and causes a bus reset. -for my particular test 7xRAID 10 (14 73GB 15k drives) ended up being as fast or faster than the same number of drives split into EMC luns and presented with vxfs on them. With 11x (22 drives) and /backup and redo logs on the main pool, the drives always stay at a high latency and performance craps during backups. -I tried futzing with sd:sd_max_throttle with values from 20(low water mark) to 64 (high water mark without errors) and my particular process didn't seem to benefit. Left this value at 20, since EMC still recommends it. -No particular value for powerpath vs mpxio other than price. -set_arc.sh script (when it worked the first couple of times). The program grabs many GB of memory, so fighting the ARC cache for the right to mmap was a huge impediment. -Pillar was pretty quick when I created multiple luns and strung them together as one big stripe, but wasn't as consistant in IO usage or overall time as EMC.
A couple of things that don't relate to the IO layout, but were important for APS were: -Sun Systems need faster processors to process the jobs faster. Had Oracle beat our data processing time by 4x+ on an 8x1Ghz system (we had 1.5Ghz USIV+)and this bugged the hell out of me until I found out that they were running an HP9000 rp4440 which uses a combined memory bandwidth of 12.9GB/s, no matter how many processors were running. USIV+ maxes at 2.4Ghz/proc, but scales the more procs you have working. This is all swell for RDBMS loads talking to shared memory, but most of the time in APS is spent running a single threaded job that loads many gigabytes of memory then processes the data. For that case, big, unscalable memory bandwidth beat the hell out of scalable procs at higher Mhz. Going from 1.3Ghz procs to 1.8s cut total running time (even with other improvements) by about 60%. -MPSS using 4M pages vs normal 8k pages made no real difference. While trapstat -T wasn't really showing a high percentage of misses, there was an assumption that anything that allowed the process to read data into memory faster would help performance. Maybe if we could actually recompile the binary, but by setting the environment to use the library we got nothing more than more 4M cache misses. This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss