Hello all, I have a new idea up for discussion.
Several RAID systems have implemented "spread" spare drives in the sense that there is not an idling disk waiting to receive a burst of resilver data filling it up, but the capacity of the spare disk is spread among all drives in the array. As a result, the healthy array gets one more spindle and works a little faster, and rebuild times are often decreased since more spindles can participate in repairs at the same time. I don't think I've seen such idea proposed for ZFS, and I do wonder if it is at all possible with variable-width stripes? Although if the disk is sliced in 200 metaslabs or so, implementing a spread-spare is a no-brainer as well. To be honest, I've seen this a long time ago in (Falcon?) RAID controllers, and recently - in a USEnix presentation of IBM GPFS on YouTube. In the latter the speaker goes a greater depth describing how their "declustered RAID" approach (as they call it: all blocks - spare, redundancy and data are intermixed evenly on all drives and not in a single "group" or a mid-level VDEV as would be for ZFS). http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related GPFS with declustered RAID not only decreases rebuild times and/or impact of rebuilds on end-user operations, but it also happens to increase reliability - there is a smaller time window in case of multiple-disk failure in a large RAID-6 or RAID-7 array (in the example they use 47-disk sets) that the data is left in a "critical state" due to lack of redundancy, and there is less data overall in such state - so the system goes from critical to simply degraded (with some redundancy) in a few minutes. Another thing they have in GPFS is temporary offlining of disks so that they can catch up when reattached - only newer writes (bigger TXG numbers in ZFS terms) are added to reinserted disks. I am not sure this exists in ZFS today, either. This might simplify physical systems maintenance (as it does for IBM boxes - see presentation if interested) and quick recovery from temporarily unavailable disks, such as when a disk gets a bus reset and is unavailable for writes for a few seconds (or more) while the array keeps on writing. I find these ideas cool. I do believe that IBM might get angry if ZFS development copy-pasted them "as is", but it might get nonetheless get us inventing a similar wheel that would be a bit different ;) There are already several vendors doing this in some way, so perhaps there is no (patent) monopoly in place already... And I think all the magic of spread spares and/or "declustered RAID" would go into just making another write-block allocator in the same league "raidz" or "mirror" are nowadays... BTW, are such allocators pluggable (as software modules)? What do you think - can and should such ideas find their way into ZFS? Or why not? Perhaps from theoretical or real-life experience with such storage approaches? //Jim Klimov _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss