Hi Richard,

Thank you for taking so much time on this! The array is a StorageTek
FLX210 so it is a bit underpowered...best we could afford at the time.

In terms of the load on it we have two servers running Solaris 10.
Each physical server then has two  containers, each one has a MySQL
instance in it. The primary physical server has the masters, and the
secondary physical server has the slaves. The slaves use MySQL binlog
replication to get the INSERTS/UPDATES from the masters.

Each physical server has 3 vdevs that are RAID-Z'd together. We then
layout two file systems in the zpool, one per container.

The vdevs the zpools are actually LUNs, each one in a separate volume
group on the FLX210. So we have three volume groups on the RAID array
with two LUNs per VG. So six LUNs, and each physical server has a
RAID-Z zpool built out of the LUNs (striped across the volume groups).
Each VG has 4 disks in it, so this maximized across the 12 drives
we're using. Unfortunately, since both servers are striping across the
same 3 volume groups I think it caused our performance issue. Also,
the volume groups were RAID-5, so we had RAID-Z on top of RAID-5. This
meant we could loss 3 disks and still be OK in a worst case scenario,
but its killing the performance. The FLX210 doesn't have RAID ASICs as
I recently learned. :-(

As a stop gap, we stopped the replication to the slaves and converted
the RAID array's volume groups to RAID-1. This seems to have
tremendously reduced the issue temporarily.

Given the limited number of disks we have to work with, the new layout
we've decided on is:

5 volume groups:

*VG 1 (2 disk RAID-1): physical_server1_DB
***LUN 0: SQL Master 1 DB
***LUN 1: SQL Master 2 DB

*VG 2 (2 disk RAID-1): physical_server1_logs
***LUN 0: SQL Master 1 Logs
***LUN 1: SQL Master 2 Logs

*VG 3 (2 disk RAID-1): physical_server2_DB
***LUN 0: SQL Slave 1 DB
***LUN 1: SQL Slave 2 DB

*VG 4 (2 disk RAID-1): physical_server2_logs
***LUN 0: SQL Slave 1 Logs
***LUN 1: SQL Slave 2 Logs

*VG 5 (2 disk RAID-1): Windows Server LUNs
***LUN 0: Exchange Server LUN
***LUN 1: Maintenance LUN

Each SQL LUN will be a vdev in its own striped zpool. And we've got
two disks in reserve for additional capacity (not counting the 2 array
hot-spares). The main concern at the moment is that given the current
layout, we don't waste much space at all filesystem-wise (we do lose a
lot of space with the double RAID-5). The new layout however, would
give the logs a ton of space of which they might use 10% (but we don't
want to consolidate all the logs on the same VG lest we get the
contention problems back). Its a tough trade-off to make---space for
speed.

I'm somewhat of a mind to have all the logs use a single VG and see
how the performance fares. Add a second VG only if necessary.

Currently, we see about 40-70 IOPS per vdev. So that can average
120-200 IOPS per VG with a peak of 300. About 60-70% of those IOPS are
writes as well.

One thing all those VGs in the new layout would let us do is figure
out how many of those IOPS are random, and how many are sequential log
writes.

As always, advice/thoughts are appreciated.

Best Regards,
Jason

On 11/30/06, Richard Elling <[EMAIL PROTECTED]> wrote:
Hi Jason,
It seems to me that there is some detailed information which would
be needed for a full analysis.  So, to keep the ball rolling, I'll
respond generally.

Jason J. W. Williams wrote:
> Hi Richard,
>
> Been watching the stats on the array and the cache hits are < 3% on
> these volumes. We're very write heavy, and rarely write similar enough
> data twice. Having random oriented database data and
> sequential-oriented database log data on the same volume groups, it
> seems to me this was causing a lot of head repositioning.

In general, writes are buffered.  For many database workloads, the
sequential log writes won't be write cache hits and will be coelesced.
There are several ways you could account for this, but suffice to say
that the read cache hit rate is more interesting, for performance
improvement opportunities.  The random reads are often cache misses,
and adding prefetch is often a waste of resources -- the nature of the
beast.

For ZFS, all data writes should be sequential until you get near the
capacity of the volume when there will be a search for free blocks
which may be randomly dispersed.  One way to look at this is that
for new and not-yet-filled volumes, ZFS will write sequentially,
unlike other file systems.  Once you get filled, then ZFS will write
more like other file systems.  Hence, your write performance for
ZFS may change over time, though this will be somewhat mitigated by
the RAID array write buffer cache.

> By shutting down the slaves database servers we cut the latency
> tremendously, which would seem to me to indicate a lot of contention.
> But I'm trying to come up to speed on this, so I may be wrong.

This is likely.

Note that RAID controllers are really just servers which speak a
block-level protocol to other hosts.  Some RAID controllers are
underpowered.

ZFS on a modern server can create a significant workload.  This
can also clobber a RAID array.  For example, by default, ZFS will
queue up to 35 iops per vdev before blocking.  If you have one
RAID array which is connected to 4 hosts, each host having 5 vdevs,
then the RAID array would need to be able to handle 700 (35 * 4 * 5)
concurrent iops.  There are RAID arrays, which will remain nameless,
that will not handle that workload very well.  Under lab conditions
you should be able to empirically determine the knee in the response
time curve as you add workload.

To compound the problem, fibre channel has pitiful flow control.
Thus it may also be necessary to throttle the concurrent iops at the
source.  I'm  not sure what the current thinking is on tuning
vq_max_pending (35) for ZFS, you might search for it in the archives.
[the intent is to have no tunables, let the system figure out what to
do best]

> "iostat -xtcnz 5" showed the latency dropped from 200 to 20 once we
> cut the replication. Since the masters and slaves were using the same
> the volume groups and RAID-Z was striping across all of them on both
> the masters and slaves, I think this was a big problem.

It is hard for me to visualize your setup, but this is a tell-tale
sign that you've overrun the RAID box.  Changing the volume partitioning
will likely help, perhaps tremendously.
  -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to