On Sat, Jan 7, 2012 at 7:37 PM, Richard Elling <richard.ell...@gmail.com>wrote:
> Hi Jim, > > On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote: > > > Hello all, > > > > I have a new idea up for discussion. > > > > Several RAID systems have implemented "spread" spare drives > > in the sense that there is not an idling disk waiting to > > receive a burst of resilver data filling it up, but the > > capacity of the spare disk is spread among all drives in > > the array. As a result, the healthy array gets one more > > spindle and works a little faster, and rebuild times are > > often decreased since more spindles can participate in > > repairs at the same time. > > Xiotech has a distributed, relocatable model, but the FRU is the whole ISE. > There have been other implementations of more distributed RAIDness in the > past (RAID-1E, etc). > > The big question is whether they are worth the effort. Spares solve a > serviceability > problem and only impact availability in an indirect manner. For > single-parity > solutions, spares can make a big difference in MTTDL, but have almost no > impact > on MTTDL for double-parity solutions (eg. raidz2). > I disagree. Dedicated spares impact far more than availability. During a rebuild performance is, in general, abysmal. ZIL and L2ARC will obviously help (L2ARC more than ZIL), but at the end of the day, if we've got a 12 hour rebuild (fairly conservative in the days of 2TB SATA drives), the performance degradation is going to be very real for end-users. With distributed parity and spares, you should in theory be able to cut this down an order of magnitude. I feel as though you're brushing this off as not a big deal when it's an EXTREMELY big deal (in my mind). In my opinion you can't just approach this from an MTTDL perspective, you also need to take into account user experience. Just because I haven't lost data, doesn't mean the system isn't (essentially) unavailable (sorry for the double negative and repeated parenthesis). If I can't use the system due to performance being a fraction of what it is during normal production, it might as well be an outage. > > > I don't think I've seen such idea proposed for ZFS, and > > I do wonder if it is at all possible with variable-width > > stripes? Although if the disk is sliced in 200 metaslabs > > or so, implementing a spread-spare is a no-brainer as well. > > Put some thoughts down on paper and work through the math. If it all works > out, let's implement it! > -- richard > > I realize it's not intentional Richard, but that response is more than a bit condescending. If he could just put it down on paper and code something up, I strongly doubt he would be posting his thoughts here. He would be posting results. The intention of his post, as far as I can tell, is to perhaps inspire someone who CAN just write down the math and write up the code to do so. Or at least to have them review his thoughts and give him a dev's perspective on how viable bringing something like this to ZFS is. I fear responses like "the code is there, figure it out" makes the *aris community no better than the linux one. > > > > To be honest, I've seen this a long time ago in (Falcon?) > > RAID controllers, and recently - in a USEnix presentation > > of IBM GPFS on YouTube. In the latter the speaker goes > > a greater depth describing how their "declustered RAID" > > approach (as they call it: all blocks - spare, redundancy > > and data are intermixed evenly on all drives and not in > > a single "group" or a mid-level VDEV as would be for ZFS). > > > > http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related > > > > GPFS with declustered RAID not only decreases rebuild > > times and/or impact of rebuilds on end-user operations, > > but it also happens to increase reliability - there is > > a smaller time window in case of multiple-disk failure > > in a large RAID-6 or RAID-7 array (in the example they > > use 47-disk sets) that the data is left in a "critical > > state" due to lack of redundancy, and there is less data > > overall in such state - so the system goes from critical > > to simply degraded (with some redundancy) in a few minutes. > > > > Another thing they have in GPFS is temporary offlining > > of disks so that they can catch up when reattached - only > > newer writes (bigger TXG numbers in ZFS terms) are added to > > reinserted disks. I am not sure this exists in ZFS today, > > either. This might simplify physical systems maintenance > > (as it does for IBM boxes - see presentation if interested) > > and quick recovery from temporarily unavailable disks, such > > as when a disk gets a bus reset and is unavailable for writes > > for a few seconds (or more) while the array keeps on writing. > > > > I find these ideas cool. I do believe that IBM might get > > angry if ZFS development copy-pasted them "as is", but it > > might get nonetheless get us inventing a similar wheel > > that would be a bit different ;) > > There are already several vendors doing this in some way, > > so perhaps there is no (patent) monopoly in place already... > > > > And I think all the magic of spread spares and/or "declustered > > RAID" would go into just making another write-block allocator > > in the same league "raidz" or "mirror" are nowadays... > > BTW, are such allocators pluggable (as software modules)? > > > > What do you think - can and should such ideas find their > > way into ZFS? Or why not? Perhaps from theoretical or > > real-life experience with such storage approaches? > > > > //Jim Klimov > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- > > ZFS and performance consulting > http://www.RichardElling.com > illumos meetup, Jan 10, 2012, Menlo Park, CA > http://www.meetup.com/illumos-User-Group/events/41665962/ > > As always, feel free to tell me why my rant is completely off base ;) --Tim
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss