> I don't see how you can get both end-to-end data integrity and > read avoidance.
Checksum the individual RAID-5 blocks, rather than the entire stripe? In more detail: Allow the pointer to the block to contain one checksum per device used (the count will vary if you're using a RAID-Z style algorithm). Checksum each device's data independently, so the pointer looks like one of: (a) array of <device, offset, checksum> tuples (b) <device, offset> tuple and array of checksums The latter is closer to what we have today with RAID-Z (allocate across all devices), the former is more flexible and might work better if the number of disks in the stripe can be changed. Reading an entire stripe then requires reading all the data (as today) and verifying the individual checksums. If any checksum fails, reconstruct from the remaining blocks. Reading part of a stripe requires reading only the data from the disks holding the requested data and verifying their individual checksums. If any checksum fails, fall back on reading the whole block (across all devices) and reconstructing. Writing an entire stripe is pretty much as today, write all the data to the requested disks, but with individual checksums. Writing a partial stripe is more interesting. With style (b) block pointers -- RAID-Z style -- you need to do a read/modify/write of the stripe to get all the new data into the right place. But with style (a) block pointers, you’re back to RAID-5 style writes (read old data+parity or remaining data, write new data+parity). (You don't need to rewrite the whole stripe since the block pointer can refer to the original partial stripe.) This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss