more below...

On Nov 25, 2009, at 5:54 AM, Paul Kraus wrote:

Richard,
First, thank you for the detailed reply ... (comments in line below)

On Tue, Nov 24, 2009 at 6:31 PM, Richard Elling
<richard.ell...@gmail.com> wrote:
more below...

On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote:

On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling
<richard.ell...@gmail.com> wrote:

Try disabling prefetch.

Just tried it... no change in random read (still 17-18 MB/sec for a
single thread), but sequential read performance dropped from about 200
MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file
accessed in 256 KB records. ARC is set to a max of 1 GB for testing.
arcstat.pl shows that the vast majority (>95%) of reads are missing
the cache.

hmmm... more testing needed. The question is whether the low
I/O rate is because of zfs itself, or the application? Disabling prefetch
will expose the application, because zfs is not creating additional
and perhaps unnecessary read I/O.

The values reported by iozone are in pretty close agreement with what
we are seeing with iostat during the test runs. Compression is off on
zfs (the iozone test data compresses very well and yields bogus
results). I am looking for a good alternative to iozone for random
testing, I did put together a crude script to spawn many dd processes
accessing the block device itself, each with a different seek over the
range of the disk and saw results much greater than the iozone single
threaded random performance.

filebench is usually bundled in /usr/benchmarks or as a pkg.
vdbench is easy to use and very portable, www.vdbench.org

Your data which shows the sequential write, random write, and
sequential read driving actv to 35 is because prefetching is enabled
for the read.  We expect the writes to drive to 35 with a sustained
write workload of any flavor.

Understood. I tried tuning the queue size to 50 and observed that the
actv went to 50 (with very little difference in performance), so
returned it to the default of 35.

Yep, bottleneck is on the back end (physical HDDs). For arrays with lots
of HDDs, this queue can be deeper, but the 3500 series is way too
small to see this.  If SSDs are used on the back end, then you can
revisit this.

From the data, it does look like the random read tests are converging
on the media capabilities of the disks in the array.  For the array you
can see the read-modify-write penalty of RAID-5 as well as the
caching and prefetching of reads.

Note: the physical I/Os are 128 KB, regardless of the iozone size
setting.  This is expected, since 128 KB is the default recordsize
limit for ZFS.

The random read (with cache misses)
will stall the application, so it takes a lot of threads (>>16?) to keep
35 concurrent I/Os in the pipeline without prefetching.  The ZFS
prefetching algorithm is "intelligent" so it actually complicates the
interpretation of the data.

What bothers me is that that iostat is showing the 'disk' device as
not being saturated during the random read test. I'll post iostat
output that I captured yesterday to http://www.ilk.org/~ppk/Geek/ You
can clearly see the various test phases (sequential write, rewrite,
sequential read, reread, random read, then random write).

Is this a single thread?  Usually this means that you aren't creating
enough load. ZFS won't be prefetching (as much) for a random
read workload, so iostat will expose client bottlenecks.

You're peaking at 658 256KB random IOPS for the 3511, or ~66
IOPS per drive.  Since ZFS will max out at 128KB per I/O, the disks
see something more than 66 IOPS each.  The IOPS data from
iostat would be a better metric to observe than bandwidth.  These
drives are good for about 80 random IOPS each, so you may be
close to disk saturation.  The iostat data for IOPS and svc_t will
confirm.

But ... if I am saturating the 3511 with one thread, then why do I get
many times that performance with multiple threads ?

The T2000 data (sheet 3) shows pretty consistently around
90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20%
less than I would expect, perhaps due to the measurement.

I ran the T2000 test to see if 10U8 behaved better and to make sure I
wasn't seeing an oddity of the 480 / 3511 case. I wanted to see if the
random read bahavior was similar, and it was (in relative terms).

Also, the 3511 RAID-5 configuration will perform random reads at
around 1/2 IOPS capacity if the partition offset is 34.  This was the
default long ago.  The new default is 256.

Our 3511's have been running 421F (latest) for a long time :-) We are
religious about keeping all the 3511 FW current and matched.

The reason is that with
a 34 block offset, you are almost guaranteed that a larger I/O will
stride 2 disks. You won't notice this as easily with a single thread,
but it will be measurable with more threads. Double check the
offset with prtvtoc or format.

How do I check offset ... format -> verify from one of the partitionsis below:

format> ver

Volume name = <        >
ascii name  = <SUN-StorEdge 3511-421F-517.23GB>
bytes/sector    =  512
sectors = 1084710911
accessible sectors = 1084710878
Part Tag Flag First Sector Size Last Sector 0 usr wm 256 517.22GB 1084694494

This is it: First Sector = 256.  Good.

 1 unassigned    wm                 0            0                0
 2 unassigned    wm                 0            0                0
 3 unassigned    wm                 0            0                0
 4 unassigned    wm                 0            0                0
 5 unassigned    wm                 0            0                0
 6 unassigned    wm                 0            0                0
8 reserved wm 1084694495 8.00MB 1084710878

format>

Writes are a completely different matter.  ZFS has a tendency to
turn random writes into sequential writes, so it is pretty much
useless to look at random write data. The sequential writes
should easily blow through the cache on the 3511.

I am seeing cache utilization of 25-30% during write tests, with
occasional peaks close to 50%. Which is expected as I am testing
against one partition on one logical drive.

 Squinting
my eyes, I would expect the array can do around 70 MB/s
writes, or 25 256KB IOPS saturated writes.

iostat and the 3511 transfer rate monitor is showing peaks of 150-180
MB/sec with sustained throughput of 100 MB/sec.

[Richard tries to remember if the V480 uses schizo?]
[searching...]
[found it]
Ok, a quick browse shows that the V480 uses two schizo ASICs
as the UPA to PCI bridges. Don't expect more than 200 MB/s from
a schizo.
http://www.sun.com/processors/manuals/External_Schizo_PRM.pdf

 By contrast, the
T2000 JBOD data shows consistent IOPS at the disk level
and exposes the track cache affect on the sequential read test.

Yup, it is clear that we are easily hitting the read i/o limits of the
drives in the T2000.

Did I mention that I'm a member of BAARF?  www.baarf.com :-)

Not yet :-)

Hint: for performance work with HDDs, pay close attention to
IOPS, then convert to bandwidth for the PHB.

PHB ???

Not a fan of Dilbert? :-)

I do look at IOPs, but what struck me as odd was the disparate results.

<snip>

b119 has improved stat() performance, which should make a positive
improvement of such backups.  But eventually you may need to move
to a multi-stage backup, depending on your business requirements.

Due to contract issues (I am consulting at a government agency), we
cannot yet run OpenSolaris in production.

Look for CR6775100 to be rolled into a Solaris 10 patch.  It might take
another 6 months or so, if it gets backported.

On our previous server for this application (Apple G5) we had 4 TB of
data and about 50 million files (under HFS+) and a full backup took 3
WEEKS. We went the route of explicitly specifying each directory in
the NetBackup config and got _some_ reliability. Today we have about
22 TB in over 200 ZFS datasets (not evenly distributed,
unfortunately), the largest of which is about 3.5 TB and 30 million
files.

Yep, this is becoming more common as people build larger file systems.
I briefly describe the multistage backup here:
http://richardelling.blogspot.com/2009/08/backups-for-file-systems-with-millions.html
Of course, there are quite a few design details that will vary based on
business requirements...

BTW, our overall configuration is based on h/w we bought years ago and
are having to adopt as best we can. We are pushing to replace the
SE-3511 arrays with J4400 JBODs. Our current config has 11 disk R5
sets and 1 hot spare per 3511 tray, we carve up 'standard' 512 GB
partitions, which we mirror at the ZPOOL layer across 3511 arrays. We
just add additional mirror pairs as the data in each department grows,
keeping the mirrors on different arrays :-)

In general, RAID-5 (or raidz) performs poorly for random reads.  It
gets worse when the reads are small.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to