On Wed, Oct 24, 2012 at 6:17 AM, Robin Axelsson < gu99r...@student.chalmers.se> wrote: > > It would be interesting to know how you convert a raidz2 stripe to say a > raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra > parity drive by converting it to a raidz3 pool. I'm imagining that would > be like creating a raidz1 pool on top of the leaf vdevs that constitutes > the raidz2 pool and the new leaf vdev which results in an additional parity > drive. It doesn't sound too difficult to do that. Actually, this way you > could even get raidz4 or raidz5 pools. Question is though, how things would > pan out performance wise, I would imagine that a 55 drive raidz25 pool is > really taxing on the CPU. >
Multiple parity is more complicated than that, an additional xor device (a la traditional raid4) would end up with zeros everywhere, and couldn't reconstruct your data from an additional failure. Look at "computing parity" in http://en.wikipedia.org/wiki/Raid_6#RAID_6 . While in theory it can extend to more than 3 parity blocks, it is unclear whether more than 3 will offer any serious additional benefits (using multiple raidz2 vdevs can give you better IOPS than larger raidz3 vdevs, with little change in raw space efficiency). There are also combinatorial implications to multiple bit errors in a single data chunk with high parity levels, but that is somewhat unlikely. Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a > no-brainer; you just remove one drive from the pool and force zpool to > accept the new state as "normal". > A degraded raidz2 vdev has to compute the missing block from parity on nearly every read, this is not the normal state of raidz1. Changing the parity level, either up or down, has similar complications in the on-disk structure. But expanding a raidz pool with additional storage while preserving the > parity structure sounds a little bit trickier. I don't think I have that > knowledge to write a bpr rewriter although I'm reading Solaris Internals > right now ;) Unless raidz* did something radically different than raid5/6 (as in, not having the parity blocks necessarily next to each other in the data chunk, and having their positions recorded in the data chunk itself), the position of the parity and data blocks would change. The "always consistent on disk" approach of ZFS adds additional problems to this, which probably make it impossible to rewrite the re-parity'ed chunk over the old chunk, meaning it has to find some free space every time it wants to update a chunk to the new parity level. >> What you describe here is known as unionfs in Linux, among others. >> I think there were RFEs or otherwise expressed desires to make that >> in Solaris and later illumos (I did campaign for that sometime ago), >> but AFAIK this was not yet done by anyone. >> >> YES, UnionFS-like functionality is what I was talking about. It seems > like it has been abandoned in favor of AuFS in the Linux and the BSD world. > It seems to have functions that are a little overkill to use with zfs, such > as copy-on-write. Perhaps a more simplistic implementation of it would be > more suitable for zfs. > You could create zfs filesystems for subfolders in your "dataset" from the separate pools, and give them mountpoints that put them into the same directory. You would have to balance the data allocation between the pools manually, though. Perhaps a similar functionality can be established through an abstraction > layer behind network shares. > > In Windows this functionality is called 'disk pooling', btw. In ZFS, disk pooling is done by "creating a zpool", emphasis on singular. Do you actually expect a large portion of your disks to go offline suddenly? I don't see a good way to handle this (good meaning there are no missing files under the expected error conditions) that gets you more than 50% of your raw storage capacity (mirrors across the boundary of what you expect to go down together). I doubt I would like the outcome of having some software make arbitrary decisions of what real filesystem to put each file on, and then having one filesystem fail, so if you really expect this, you may be happier keeping the two pools separate and deciding where to put stuff yourself (since if you are expecting a set of disks to fail, I expect you would have some idea as to which ones it would be, for instance an external enclosure). If, on the other hand, you don't expect your hardware to drop an entire set of disks for no good reason, making them into one large storage pool and putting your filesystem in it will share your data transparently across all disks without needing to set anything else up. Tim _______________________________________________ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss