2012-08-25 0:42, Sašo Kiselkov wrote:
Oh man, that's a million-billion points you made. I'll try to run through each quickly.
Thanks... I still do not have the feeling that you've fully got my idea, or, alternately, that I correctly understand ARC :)
There is also relatively large RAM pointer overhead for storing small pieces of data (such as metadata blocks sized 1 or few sectors) in L2ARC, which I expect to be eliminated by storing and using these blocks directly from the pool (on SSD METAXELs), having both SSD-fast access to the blocks and no expiration into L2ARC and back with inefficiently-sized ARC pointers to remember.
...And these counter-arguments probably are THE point of deviation:
However, metaxels and cache devices are essentially the same
> (many small random reads, infrequent large async writes). > The difference between metaxel and cache, however, is cosmetic.
You'd still need to reference metaxel data from ARC, so your savings would be very small. ZFS already is pretty efficient there.
No, you don't! "Republic credits WON'T do fine!" ;) The way I understood ARC (without/before L2ARC), it either caches pool blocks or it doesn't. More correctly, there is also a cache of ghosts without bulk block data, so we can account for misses of recently expired blocks of one of the two categories, and so adjust the cache subdivision towards MRU or MFU. Ultimately, those ghosts which were not requested, also expire away from the cache, and no reference to a recently-cached block remains. With L2ARC on the other hand, there is some list of pointers in the ARC so it knows which blocks were cached on the SSD - and lack of this list upon pool import is in effect the perceived emptiness of the L2ARC device. L2ARC's pointers are of comparable size to the small metadata blocks, and *this* consideration IMHO makes it much more efficient to use L2ARC with larger cached blocks, especially on systems with limited RAM (which effectively limits addressable L2ARC size as accounted in amount of blocks), with the added benefit that you can compress larger blocks in L2ARC. This way, the *difference* between L2ARC and a METAXEL is that the latter is an ordinary pool tlvdev with a specially biased read priority and write filter. If a metadata block is read, it goes into the ARC. If it expires - then there's a ghost for a while and soon there is no memory that this block was cached - unlike L2ARC's list of pointers which are just a couple of times smaller than the cached block of this type. But re-fetching metadata from SSD METAXEL is faster, when it is needed again. > Also, you're wrong if you think the clientele of l2arc and > metaxel would be different - it most likely wouldn't. This only stresses the problem with L2ARC's shortcomings for metadata, the way I see them (if they do indeed exist), and in particular chews your RAM a lot more than it could or should, being a mechanism to increase caching efficiency. If their clientele is indeed similar, and if metaxels would be more efficient for metadata storage, then you might not need L2ARC with its overheads, or not as much of it, and get a clear win in system resource consumption ;) > How often do you expect cache devices to fail? From what I hear, life expectancy for today's consumer-scale devices is small (1-3 years) for heavy writes - at which the L2ARC would likely exceed METAXEL's write rates, due to the need to write the same metadata into L2ARC time and again, if it were not for the special throttling to limit L2ARC write bandwidth. > So to sum up, you're applying raid to something that doesn't > need it. Well, metadata is kinda important - though here we do add a third copy where we previously sufficed to have two. And you're not "required" to mirror it. Also, on the other hand, if a METAXEL is a top-level vdev without special resilience to its failure/absence as described in my first post, then its failure would formally be considered a fatal situation and bring down the whole pool - unlike problems with L2ARC or ZIL devices, which can be ignored at admin's discretion. > And how is that different to having a cache-sizing policy > which selects how much each data type get allocated from > a single common cache? ... > All of this can be solved by cache sizing policies and > l2arc persistency. Ultimately, I don't disagree with this point :) But I do think that this might not be the optimal solution in terms of RAM requirements and coding complexity, etc. If you want to store some data long-term, such as is my desire to store the metadata - ZFS has mechanisms for that in ways of normal VDEVs (or subclassing that into metaxels) ;)
*) implement a new vdev type (mirrored or straight metaxel) *) integrate all format changes to labels to describe these
As one idea in the proposal - though I don't require sticking to it - is that the metaxel's job is described in the pool metadata (i.e. a readonly attribute which can be set during tlvdev device creation/addition - metaxels:list-of-guids). Until the pool is imported, a metaxel seems like a normal singledisk/mirrored tlvdev in a normal pool. This approach can limit importability of a pool with failed metaxels, unless we expect that and try to make sense of other pool devices - essentially until we can decipher the nvlist and see that the absent device is a metaxel, so the error is deemed not fatal. However, this also requires no label changes or other incompatible on-disk format changes, the way I see it. As long as the metaxel is not faulted, any other ZFS implementation (like grub or an older livecd) can import this pool and read 1/3 of metadata faster, on average ;)
As noted before, you'll have to go through the code to look for paths which fetch metadata (mostly the object layer) and replace those with metaxel-aware calls. That's a lot of work for a POC.
Alas, for some years now I'm a lot less of a programmer and a lot more of a brainstormer ;) Still, judging from whatever experience I have, a working POC with some corners cut might be a matter of a week or two of coding... Just to see if the expected benefits in comparison to L2ARC do exist. The full-scale thing, yes, might take months or years from even a team of programmers ;) Thanks, //Jim _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss