Re: [zfs-discuss] ZFS Random Read Performance

Richard Elling Tue, 24 Nov 2009 15:32:48 -0800

more below...

On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote:

On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling
<richard.ell...@gmail.com> wrote:

Try disabling prefetch.


Just tried it... no change in random read (still 17-18 MB/sec for a
single thread), but sequential read performance dropped from about 200
MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file
accessed in 256 KB records. ARC is set to a max of 1 GB for testing.
arcstat.pl shows that the vast majority (>95%) of reads are missing
the cache.


hmmm... more testing needed. The question is whether the low

I/O rate is because of zfs itself, or the application? Disablingprefetch

will expose the application, because zfs is not creating additional
and perhaps unnecessary read I/O.

Your data which shows the sequential write, random write, and
sequential read driving actv to 35 is because prefetching is enabled
for the read.  We expect the writes to drive to 35 with a sustained
write workload of any flavor. The random read (with cache misses)
will stall the application, so it takes a lot of threads (>>16?) to keep
35 concurrent I/Os in the pipeline without prefetching.  The ZFS
prefetching algorithm is "intelligent" so it actually complicates the
interpretation of the data.

You're peaking at 658 256KB random IOPS for the 3511, or ~66
IOPS per drive.  Since ZFS will max out at 128KB per I/O, the disks
see something more than 66 IOPS each.  The IOPS data from
iostat would be a better metric to observe than bandwidth.  These
drives are good for about 80 random IOPS each, so you may be
close to disk saturation.  The iostat data for IOPS and svc_t will
confirm.

The T2000 data (sheet 3) shows pretty consistently around
90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20%
less than I would expect, perhaps due to the measurement.

Also, the 3511 RAID-5 configuration will perform random reads at
around 1/2 IOPS capacity if the partition offset is 34.  This was the
default long ago.  The new default is 256. The reason is that with
a 34 block offset, you are almost guaranteed that a larger I/O will
stride 2 disks.  You won't notice this as easily with a single thread,
but it will be measurable with more threads. Double check the
offset with prtvtoc or format.

Writes are a completely different matter.  ZFS has a tendency to
turn random writes into sequential writes, so it is pretty much
useless to look at random write data. The sequential writes
should easily blow through the cache on the 3511.  Squinting
my eyes, I would expect the array can do around 70 MB/s
writes, or 25 256KB IOPS saturated writes.  By contrast, the
T2000 JBOD data shows consistent IOPS at the disk level
and exposes the track cache affect on the sequential read test.

Did I mention that I'm a member of BAARF?  www.baarf.com :-)

Hint: for performance work with HDDs, pay close attention to
IOPS, then convert to bandwidth for the PHB.

The reason I don't think that this ishitting our end users is the
cache hit ratio (reported by arc_summary.pl) is 95% on the production
system (I am working on our test system and am the only one using it
right now, so all the I/O load is iozone).

I think my next step (beyond more poking with DTrace) is to try a
backup and see what I get for ARC hit ratio ... I expect it to be low,
but I may be surprised (then I have to figure out why backups are as
slow as they are). We are using NetBackup and it takes about 3 days to
do a FULL on a 3.3 TB zfs with about 30 million files. Differential
Incrementals take 16-22 hours (and almost no data changes). The
production server is an M4000, 4 dual core CPUs, 16 GB memory, and
about 25 TB of data overall. A big SAMBA file server.


b119 has improved stat() performance, which should make a positive
improvement of such backups.  But eventually you may need to move
to a multi-stage backup, depending on your business requirements.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Random Read Performance

Reply via email to