> I don't see how you can get both end-to-end data integrity and
> read avoidance.

Checksum the individual RAID-5 blocks, rather than the entire stripe?

In more detail: Allow the pointer to the block to contain one checksum per 
device used (the count will vary if you're using a RAID-Z style algorithm). 
Checksum each device's data independently, so the pointer looks like one of:

  (a) array of <device, offset, checksum> tuples
  (b) <device, offset> tuple and array of checksums

The latter is closer to what we have today with RAID-Z (allocate across all 
devices), the former is more flexible and might work better if the number of 
disks in the stripe can be changed.

Reading an entire stripe then requires reading all the data (as today) and 
verifying the individual checksums. If any checksum fails, reconstruct from the 
remaining blocks.

Reading part of a stripe requires reading only the data from the disks holding 
the requested data and verifying their individual checksums. If any checksum 
fails, fall back on reading the whole block (across all devices) and 
reconstructing.

Writing an entire stripe is pretty much as today, write all the data to the 
requested disks, but with individual checksums.

Writing a partial stripe is more interesting. With style (b) block pointers -- 
RAID-Z style -- you need to do a read/modify/write of the stripe to get all the 
new data into the right place. But with style (a) block pointers, you’re back 
to RAID-5 style writes (read old data+parity or remaining data, write new 
data+parity). (You don't need to rewrite the whole stripe since the block 
pointer can refer to the original partial stripe.)
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to