Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

Richard Elling Wed, 21 Mar 2012 11:55:24 -0700

comments below...

On Mar 21, 2012, at 10:40 AM, Marion Hakanson wrote:


> p...@kraus-haus.org said:
>>    Without knowing the I/O pattern, saying 500 MB/sec. is meaningless.
>> Achieving 500MB/sec. with 8KB files and lots of random accesses is really
>> hard, even with 20 HDDs. Achieving 500MB/sec. of sequential streaming of
>> 100MB+ files is much easier.
>> . . .
>>    For ZFS, performance is proportional to the number of vdevs NOT the
>> number of drives or the number of drives per vdev. See https://
>> docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsL
>> Xc for some testing I did a while back. I did not test sequential read as
>> that is not part of our workload. 

Actually, few people have sequential workloads. Many think they do, but I say
prove it with iopattern.

>> . . .
>>    I understand why the read performance scales with the number of vdevs,
>> but I have never really understood _why_ it does not also scale with the
>> number of drives in each vdev. When I did my testing with 40 dribves, I
>> expected similar READ performance regardless of the layout, but that was NOT
>> the case. 
> 
> In your first paragraph you make the important point that "performance"
> is too ambiguous in this discussion.  Yet in the 2nd & 3rd paragraphs above,
> you go back to using "performance" in its ambiguous form.  I assume that
> by "performance" you are mostly focussing on random-read performance....
> 
> My experience is that sequential read performance _does_ scale with the number
> of drives in each vdev.  Both sequential and random write performance also
> scales in this manner (note that ZFS tends to save up small, random writes
> and flush them out in a sequential batch).

Yes.

I wrote a small, random read performance model that considers the various 
caches.
It is described here:
http://info.nexenta.com/rs/nexenta/images/tech_brief_nexenta_performance.pdf

The spreadsheet shown in figure 3 is available for the asking (and it works on 
your
iphone or ipad :-)

> Small, random read performance does not scale with the number of drives in 
> each
> raidz[123] vdev because of the dynamic striping.  In order to read a single
> logical block, ZFS has to read all the segments of that logical block, which
> have been spread out across multiple drives, in order to validate the checksum
> before returning that logical block to the application.  This is why a single
> vdev's random-read performance is equivalent to the random-read performance of
> a single drive.

It is not as bad as that. The actual worst case number for a HDD with 
zfs_vdev_max_pending
of one is:
        average IOPS * ((D+P) / D)
where,
        D = number of data vdevs
        P = numebr of parity vdevs (1 for raidz, 2 for raidz2, 3 for raidz3)
        total disks per set = D + P

We did many studies that verified this. More recent studies show 
zfs_vdev_max_pending
has a huge impact on average latency of HDDs, which I also described in my talk 
at 
OpenStorage Summit last fall.

> p...@kraus-haus.org said:
>>    The recommendation is to not go over 8 or so drives per vdev, but that is
>> a performance issue NOT a reliability one. I have also not been able to
>> duplicate others observations that 2^N drives per vdev is a magic number (4,
>> 8, 16, etc). As you can see from the above, even a 40 drive vdev works and is
>> reliable, just (relatively) slow :-) 

Paul, I have a considerable amount of data that refutes your findings. Can we 
agree
that YMMV and varies dramatically, depending on your workload?

> 
> Again, the "performance issue" you describe above is for the random-read
> case, not sequential.  If you rarely experience small-random-read workloads,
> then raidz* will perform just fine.  We often see 2000 MBytes/sec sequential
> read (and write) performance on a raidz3 pool consisting of 3, 12-disk vdev's
> (using 2TB drives).

Yes, this is relatively easy to see. I've seen 6GByes/sec for large configs, but
that begins to push the system boundaries in many ways.

> 
> However, when a disk fails and must be resilvered, that's when you will
> run into the slow performance of the small, random read workload.  This
> is why I use raidz2 or raidz3 on vdevs consisting of more than 6-7 drives,
> especially of the 1TB+ size.  That way if it takes 200 hours to resilver,
> you've still got a lot of redundancy in place.
> 
> Regards,
> 
> Marion
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
DTrace Conference, April 3, 2012, 
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

Reply via email to