(I looked at my email before checking here, so I'll just cut-and-paste the email response in here rather than send it. By the way, is there a way to view just the responses that have accumulated in this forum since I last visited - or just those I've never looked at before?)
Bill Moore wrote: > On Fri, Sep 15, 2006 at 01:23:31AM -0700, can you guess? wrote: >> Implementing it at the directory and file levels would be even more >> flexible: redundancy strategy would no longer be tightly tied to path >> location, but directories and files could themselves still inherit >> defaults from the filesystem and pool when appropriate (but could be >> individually handled when desirable). > > The problem boils down to not having a way to express your intent that > works over NFS (where you're basically limited by POSIX) that you can > use from any platform (esp. ones where ZFS isn't installed). If you > have some ideas, this is something we'd love to hear about. Well, one idea is that it seems downright silly to gate ZFS facilities on the basis of two-decade-old network file access technology: sure, it's important to be able to *access* ZFS files using NFS, but does anyone really care if NFS can't express the full range of ZFS features - at least to the degree that they think such features should be suppressed as a result (rather than made available to local users plus any remote users employing a possibly future mechanism that *can* support them)? That being said, you could always adopt the ReiserFS approach of allowing access to file/directory metadata via extended path specifications in environments like NFS where richer forms of interaction aren't available: yes, it may feel a bit kludgey, but it gets the job done. And, of course, even if you did nothing to help NFS its users would still benefit from inheriting whatever arbitrarily fine-grained redundancy levels had been established via more comprehensive means: they just wouldn't be able to tweak redundancy levels themselves (any more, or any less, than they can do so today). > >> I've never understood why redundancy was a pool characteristic in ZFS >> - and the addition of 'ditto blocks' and now this new proposal (both >> of which introduce completely new forms of redundancy to compensate >> for the fact that pool-level redundancy doesn't satisfy some needs) >> just makes me more skeptical about it. > > We have thought long and hard about this problem and even know how to > implement it (the name we've been using is Metaslab Grids, which isn't > terribly descriptive, or as Matt put it "a bag o' disks"). Yes, 'a bag o' disks' - used intelligently at a higher level - is pretty much what I had in mind. There are > two main problems with it, though. One is failures. The problem is > that you want the set of disks implementing redundancy (mirror, RAID-Z, > etc.) to be spread across fault domains (controller, cable, fans, power > supplies, geographic sites) as much as possible. There is no generic > mechanism to obtain this information and act upon it. We could ask the > administrator to supply it somehow, but such a description takes effort, > is not easy, and prone to error. That's why we have the model right now > where the administrator specifies how they want the disks spread out > across fault groups (vdevs). Without having looked at the code I may be missing something here. Even with your current implementation, if there's indeed no automated way to obtain such information the administrator has to exercise manual control over disk groupings if they're going to attain higher availability by avoiding other single points of failure instead of just guard against unrecoverable data loss from disk failure. Once that information has been made available to the system, letting it make use of it at a higher level rather than just aggregating entire physical disks should not entail additional administrator effort. I admit that I haven't considered the problem in great detail, since my bias is toward solutions that employ redundant arrays of inexpensive nodes to scale up rather than a small number of very large nodes (in part because a single large node itself can often be a single point of failure even if many of its subsystems carefully avoid being so in the manner that you suggest). Each such small node has a relatively low disk count and little or no internal redundancy, and thus comprises its own little fault-containment environment, avoiding most such issues; as a plus, such node sizes mesh well with the bandwidth available from very inexpensive Gigabit Ethernet interconnects and switches (even when streaming data sequentially, such as video on demand) and allow fine-grained incremental system scaling (by the time faster interconnects become inexpensive, disk bandwidth should have increased enough that such a balance will still be fairly good). Still, if you can group whole disks intelligently in a large system with respect to supplementing simple redundancy with higher overall subsystem availability, then you ought to be able to use exactly the same information to allow higher-level decisions about where to place redundant data at other than whole-disk granularity. > > The second problem comes back to accounting. If you can specify, on a > per-file or per-directory basis, what kind of replication you want, how > do you answer the statvfs() question? I think the recent "discussions" > on this list illustrate the complexity and passion on both sides of the > argument. I rather liked the idea of using the filesystem *default* redundancy level as the basis for providing free space information, though in environments where different users were set up with different defaults using the per-user default might make sense (then, only if that was manually changed, presumably by that user, would less obvious things happen). Overall, I think perhaps free space should be reported on the basis of things that the user does *not* have control over, such as the default flavor of redundancy established by an administrator (i.e., as the number of bytes the user could write using that default flavor - which is what I was starting to converge on just above). Then the user will mostly see only discrepancies caused by changes in that default that s/he has made, and should be able to understand them (well, if the user has personal 'temp' space the admin might have special-cased that for them by making it non-redundant, I suppose). Then again, whenever one traverses a mount point today (not always all that obvious a transition) the whole world of free space (and I'd expect quota) changes anyway, and users don't seem to find that an insurmountable obstacle. So I find it difficult to see free-space reporting as being any real show-stopper in this area regardless of how it's done (though like most people who contributed to that topic I think I have a preference). > >> (Not that I intend in any way to minimize the effort it might take to >> change that decision now.) > > The effort is not actually that great. All the hard problems we needed > to solve in order to implement this were basically solved when we did > the RAID-Z code. As a matter of fact, you can see it in the on-disk > specification as well. In the DVA, you'll notice an 8-bit field labeled > "GRID". These are the bits that would describe, on a per-block basis, > what kind of redundancy we used. The only reason I can think of for establishing that per block (rather than per object) would be if you kept per-block access-rate information around so that you could distribute really hot blocks more widely. And given that such blocks would normally be in cache anyway, that only seems to make sense in a distributed environment (where you're trying to spread the load over multiple nodes more because of interconnect bandwidth limitations than disk bandwidth limitations - though even here you could do this at the cache level rather than the on-disk level based on dynamic needs). - bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss