Nicholas Lee wrote:
On Mon, Jun 22, 2009 at 4:24 PM, Stuart Anderson
<ander...@ligo.caltech.edu <mailto:ander...@ligo.caltech.edu>> wrote:
However, it is a bit disconcerting to have to run with reduced data
protection for an entire week. While I am certainly not going back to
UFS, it seems like it should be at least theoretically possible to
do this
several orders of magnitude faster, e.g., what if every block on the
replacement disk had its RAIDZ2 data recomputed from the degraded
Maybe this is also saying - that for large disk sets a single RAIDZ2
provides a false sense of security.
Nicholas
------------------------------------------------------------------------
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
I'm assuming the problem is that you are IOPS bound. Since you wrote
small files, ZFS uses small stripe sizes. Which means, that when you
need to do a full-stripe read to reconstruct the RAIDZ2 parity, you're
reading only a very small amount of data. You're IOPS bound on the
replacement disk.
For arguments' sake, let's assume you have 4k stripe sizes. Thus, you do:
(1) 4k read across all disks
(2) checksum computation
(3) tiny write to re-silver disk
Assuming you might max out at 300 IOPS (not unreasonable for small reads
on SATA drives), the results in:
(300 / 2 ) x 4kB = 600k/s.
That is, you can do 150 stripe reads and writes, each read/write pair
reconstructing the parity for 4k of data. And, that might be optimal.
At that rate, 1TB of data will take ( (1024 * 1024 * 1024 * 1024kB) /
600kB/s) = 1.8 million seconds =~ 500 hours.
I don't know about how ZFS does the actual reconstruction, but I have
two suggestions:
(1) if ZFS is doing a serial resilver (i.e. resilver stripe 1 before
doing stripe 2, etc), would it be possible to NOT do a full stripe write
when doing the reconstruction? that is, only write the reconstructed
data back to the replacement disk? That would allow the "data" disks to
use their full IOPS reading, and the replacement disks it's full IOPS
writing. It's still going to suck rocks, but only half as much.
(2) Multiple stripe-reconstruction would probably be better; that is,
ZFS should reconstruct several adjacent stripes together, up to some
reasonable total size (say 1MB or so). That way, you could get
reconstruction rates of 100MB/s (that is, reconstruct the parity for
100MB of data, NOT writing 100MB/s). 1TB of data @ 100MB/s is only 3
hours.
--
Erik Trimble
Java System Support
Mailstop: usca22-123
Phone: x17195
Santa Clara, CA
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss