Hi Jason, It seems to me that there is some detailed information which would be needed for a full analysis. So, to keep the ball rolling, I'll respond generally.
Jason J. W. Williams wrote:
Hi Richard, Been watching the stats on the array and the cache hits are < 3% on these volumes. We're very write heavy, and rarely write similar enough data twice. Having random oriented database data and sequential-oriented database log data on the same volume groups, it seems to me this was causing a lot of head repositioning.
In general, writes are buffered. For many database workloads, the sequential log writes won't be write cache hits and will be coelesced. There are several ways you could account for this, but suffice to say that the read cache hit rate is more interesting, for performance improvement opportunities. The random reads are often cache misses, and adding prefetch is often a waste of resources -- the nature of the beast. For ZFS, all data writes should be sequential until you get near the capacity of the volume when there will be a search for free blocks which may be randomly dispersed. One way to look at this is that for new and not-yet-filled volumes, ZFS will write sequentially, unlike other file systems. Once you get filled, then ZFS will write more like other file systems. Hence, your write performance for ZFS may change over time, though this will be somewhat mitigated by the RAID array write buffer cache.
By shutting down the slaves database servers we cut the latency tremendously, which would seem to me to indicate a lot of contention. But I'm trying to come up to speed on this, so I may be wrong.
This is likely. Note that RAID controllers are really just servers which speak a block-level protocol to other hosts. Some RAID controllers are underpowered. ZFS on a modern server can create a significant workload. This can also clobber a RAID array. For example, by default, ZFS will queue up to 35 iops per vdev before blocking. If you have one RAID array which is connected to 4 hosts, each host having 5 vdevs, then the RAID array would need to be able to handle 700 (35 * 4 * 5) concurrent iops. There are RAID arrays, which will remain nameless, that will not handle that workload very well. Under lab conditions you should be able to empirically determine the knee in the response time curve as you add workload. To compound the problem, fibre channel has pitiful flow control. Thus it may also be necessary to throttle the concurrent iops at the source. I'm not sure what the current thinking is on tuning vq_max_pending (35) for ZFS, you might search for it in the archives. [the intent is to have no tunables, let the system figure out what to do best]
"iostat -xtcnz 5" showed the latency dropped from 200 to 20 once we cut the replication. Since the masters and slaves were using the same the volume groups and RAID-Z was striping across all of them on both the masters and slaves, I think this was a big problem.
It is hard for me to visualize your setup, but this is a tell-tale sign that you've overrun the RAID box. Changing the volume partitioning will likely help, perhaps tremendously. -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss