Erik Trimble wrote:
 On 9/22/2010 11:15 AM, Markus Kovero wrote:
Such configuration was known to cause deadlocks. Even if it works now (which I don't expect to be the case) it will make your data to be cached twice. The CPU utilization> will also be much higher, etc.
All in all I strongly recommend against such setup.
--
Pawel Jakub Dawidek                       http://www.wheelsystems.com
p...@freebsd.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
Well, CPU utilization can be tuned downwards by disabling checksums in inner pools as checksumming is done in main pool. I'd be interested in bug id's for deadlock issues and everything related. Caching twice is not an issue, prefetching could be and it can be disabled I don't understand what makes it difficult for zfs to handle this kind of setup. Main pool (testpool) should just allow any writes/reads to/from volume, not caring what they are, where as anotherpool would just work as any other pool consisting of any other devices. This is quite similar setup to iscsi-replicated mirror pool, where you have redundant pool created from iscsi volumes locally and remotely.

Yours
Markus Kovero

Actually, the mechanics of local pools inside pools is significantly different than using remote volumes (potentially exported ZFS volumes) to build a local pool from.

And, no, you WOULDN'T want to turn off the "inside" pool's checksums. You're assuming that this would be taken care of by the outside pool, but that's a faulty assumption, since the only way this would happen would be if the pools somehow understood they were being nested, and thus could "bypass" much of the caching and I/O infrastructure related to the inner pool.

What is an example of where a checksummed outside pool would not be able to protect a non-checksummed inside pool? Would an intermittent RAM/motherboard/CPU failure that only corrupted the inner pool's block before it was passed to the outer pool (and did not corrupt the outer pool's block) be a valid example?

If checksums are desirable in this scenario, then redundancy would also be needed to recover from checksum failures.



Pools understanding nesting would be a win. Another win that might benefit from this pool-to-pool communication interface would be a ZFS client (shim? driver?) that would extend ZFS checksum protection all the way out across the network to the workstations accessing ZFS pools. ZFS offers no protection against corruption between the CIFS/NFS server and the CIFS/NFS client. (The client would need to mount the pool directly in the current structure).

----
To quote myself from May 2010:

If someone wrote a "ZFS client", it'd be possible to get over the wire data protection. This would be continuous from the client computer all the way to the storage device. Right now there is data protection from the server to the storage device. The best protected apps are those running on the same server that has mounted the ZFS pool containing the data they need (in which case they are protected by ZFS checksums and by ECC RAM, if present).

A "ZFS client" would run on the computer connecting to the ZFS server, in order to extend ZFS's protection and detection out across the network.

In one model, the ZFS client could be a proxy for communication between the client and the server running ZFS. It would extend the filesystem checksumming across the network, verifying checksums locally as data was requested, and calculating checksums locally before data was sent that the server would re-check. Recoverable checksum failures would be transparent except for performance loss, unrecoverable failures would be reported as unrecoverable using the standard OS unrecoverable checksum error message (Windows has one that it uses for bad sectors on drives and optical media). The local client checksum calculations would be useful in detecting network failures, and local hardware instability. (I.e. if most/all clients start seeing checksum failures...look at the network; if only one client sees checksum failures, check that client's hardware.)

An extension to the ZFS client model would allow multi-level ZFS systems to better coordinate their protection and recover from more scenarios. By multi-level ZFS, I mean ZFS stacked on ZFS, say via iSCSI. An example (I'm sure there are better ones) would be 3 servers, each with 3 data disks. Each disk is made into its own non-redundant pool (making 9 non-redundant pools). These pools are in turn shared via iSCSI. One of the servers creates RAIDZ1 groups using 1 disk from each of the 3 servers. With a means for ZFS systems to communicate, a failure of any non-redundant lower level device need not trigger a system halt of that lower system, because it will know from the higher level system that the device can be repaired/replaced using the higher level redundancy.

A key to making this happen is an interface to request a block and its related checksum (or if speaking of CIFS, to request a file, its related blocks, and their checksums.)
----




The ability to grow/shrink RAIDZ by adding and removing devices is still more important, and so is the ability to rebalance pools when a pool is grown.





Cacheing is also a huge issue, since ZFS isn't known for being memory-slim, and as caching is done (currently) on a per-pool level, nested pools will consume significantly more RAM.
This tells me that nesting itself isn't a cause for additional RAM consumption. The number of pools is the cause. Minimize the number of pools to minimize RAM consumption.

Without caching the inner pool, performance is going to suck (even if some blocks are cached in the outer pool, that pool has no way to do look-ahead, nor other actions). The nature of delayed writes can also wreck havoc with caching at both pool levels.
What about not caching the outer pool? Then can we view the inner pool as using a (now larger) cache to make up for a 'big slow storage' device. The inner pool knows which files are being used so can do look-ahead.

Stupid filesystems have no issues with nesting, as they're not doing anything besides (essentially) direct I/O to the underlying devices. UFS doesn't have its own I/O subsystem, nor do things like ext* or xfs. However, I've yet to see any "modern" filesystem do well with nesting itself - there's simply too much going on under the hood, and without being "nested-aware" (i.e. specifically coding the filesystem to understand when it's being nested), much of these backend optimizations are a recipe for conflict .


Sounds like tunneling TCP over TCP, vs TCP over UDP. In the former case optimizations and retries on errors can lead to quickly degraded performance. In the latter, the lower layer doesn't try to maintain integrity and instead leaves that job to the application.

TCP over TCP:  ZFS over ZFS
TCP over UDP:  ZFS over UFS
UDP over UDP:  UFS over UFS



_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to