Ralf Bertling wrote:
> Hi list,
> as this matter pops up every now and then in posts on this list I just 
> want to clarify that the real performance of RaidZ (in its current 
> implementation) is NOT anything that follows from raidz-style data 
> efficient redundancy or the copy-on-write design used in ZFS.
>
> In a M-Way mirrored setup of N disks you get the write performance of 
> the worst disk and a read performance that is the sum of all disks 
> (for streaming and random workloads, while latency is not improved)
> Apart from the write performance you get very bad disk utilization 
> from that scenario.

I beg to differ with "very bad disk utilization."  IMHO you get perfect
disk utilization for M-way redundancy :-)

> In Raid-Z currently we have to distinguish random reads from streaming 
> reads:
> - Write performance (with COW) is (N-M)*worst single disk write 
> performance since all writes are streaming writes by design of ZFS 
> (which is N-M-1 times faste than mirrored)
> - Streaming read performance is N*worst read performance of a single 
> disk (which is identical to mirrored if all disks have the same speed)
> - The problem with the current implementation is that N-M disks in a 
> vdev are currently taking part in reading a single byte from a it, 
> which i turn results in the slowest performance of N-M disks in question.

You will not be able to predict real-world write or sequential
read performance with this simple analysis because there are
many caches involved.  The caching effects will dominate for
many cases.  ZFS actually works well with write caches, so
it will be doubly difficult to predict write performance.

You can predict small, random read workload performance,
though, because you can largely discount the caching effects
for most scenarios, especially JBODs.

>
> Now lets see if this really has to be this way (this implies no, 
> doesn't it ;-)
> When reading small blocks of data (as opposed to streams discussed 
> earlier) the requested data resides on a single disk and thus reading 
> it does not require to send read commands to all disks in the vdev. 
> Without detailed knowledge of the ZFS code, I suspect the problem is 
> the logical block size of any ZFS operation always uses the full 
> stripe. If true, I think this is a design error.

No, the reason is that the block is checksummed and we check
for errors upon read by verifying the checksum.  If you search
the zfs-discuss archives you will find this topic arises every 6
months or so.  Here is a more interesting thread on the subject,
dated November 2006:
http://mail.opensolaris.org/pipermail/zfs-discuss/2006-November/035711.html

You will also note that for fixed record length workloads, we
tend to recommend the blocksize be matched with the ZFS
recordsize. This will improve efficiency for reads, in general.

> Without that, random reads to a raid-z are almost as fast as mirrored 
> data. 
> The theoretical disadvantages come from disks that have different 
> speed (probably insignificant in any real-life scenario) and the 
> statistical probability that by chance a few particular random reads 
> do in fact have to access the same disk drive to be fulfilled. (In a 
> mirrored setup, ZFS can choose from all idle devices, whereas in 
> RAID-Z it has to wait for the disk that holds the data to be ready 
> processing its current requests).
> Looking more closely, this effect mostly affects latency (not 
> performance) as random read-requests coming in should be distributed 
> equally across all devices even bette if the queue of requests gets 
> longer (this would however require ZFS to reorder requests for 
> maximum performance.

ZFS does re-order I/O.  Array controllers re-order the re-ordered
I/O. Disks then re-order I/O, just to make sure it was re-ordered
again. So it is also difficult to develop meaningful models of disk
performance in these complex systems.

>
> Since this seems to be a real issue for many ZFS users, it would be 
> nice if someone who has more time than me to look into the code, can 
> comment on the amount of work required to boost RaidZ read performance.

Periodically, someone offers to do this... but I haven't seen an
implementation.

>
> Doing so would level the tradeoff between read- write- performance and 
> disk utilization significantly.
> Obviously if disk space (and resulting electricity costs) do not 
> matter compared to getting maximum read performance, you will always 
> be best of with 3 or even more way mirrors and a very large number of 
> vdevs in your pool.

Space, performance, reliability: pick two.

<sidebar>
The ZFS checksum has proven to be very effective at
identifying data corruption in systems.  In a traditional
RAID-5 implementation, like SVM, the data is assumed
to be correct if the read operation returned without an
error. If you try to make SVM more reliable by adding a
checksum, then you will end up at approximately the
same place ZFS is: by distrusting the hardware you take
a performance penalty, but improve your data reliability.
</sidebar>

>
> A further question that springs to mind is if copies=N is also used to 
> improve read performance.

I have not measured copies=N performance changes, but
I do not expect them to change the read efficiency.  You
will still need to read the entire block to calculate the
checksum.

> If so, you could have some read-optimized filesystems in a pool while 
> others use maximum storage efficiency (as for backups).

Hmmm... ok so how does a small, random read workload
requirement come from a backup system implementation?
I would expect backups to be single thread, sequential
workloads.  For example, many people use VTLs with ZFS
http://www.sun.com/storagetek/tape_storage/tape_virtualization/vtl_value/features.xml
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to