Steve McKinty wrote:
1) First issue relates to the überblock. Updates to it are assumed to be atomic, but if the replication block size is smaller than the überblock then we can't guarantee that the whole überblock is replicated as an entity. That could in theory result in a corrupt überblock at the secondary.Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS just use an alternate überblock and rewrite the damaged one transparently? Yes, ZFS uberblocks are self-checksummed with SHA-256 and when opening the pool it uses the latest valid uberblock that it can find. So that is not a problem. 2) Assuming that the replication maintains write-ordering, the secondary site will always have valid and self-consistent data, although it may be out-of-date compared to the primary if the replication is asynchronous, depending on link latency, buffering, etc. Normally most replication systems do maintain write ordering, [i]except[/i] for one specific scenario. If the replication is interrupted, for example secondary site down or unreachable due to a comms problem, the primary site will keep a list of changed blocks. When contact between the sites is re-established there will be a period of 'catch-up' resynchronization. In most, if not all, cases this is done on a simple block-order basis. Write-ordering is lost until the two sites are once again in sync and routine replication restarts. I can see this has having major ZFS impact. It would be possible for intermediate blocks to be replicated before the data blocks they point to, and in the worst case an updated überblock could be replicated before the block chains that it references have been copied. This breaks the assumption that the on-disk format is always self-consistent. If a disaster happened during the 'catch-up', and the partially-resynchronized LUNs were imported into a zpool at the secondary site, what would/could happen? Refusal to accept the whole zpool? Rejection just of the files affected? System panic? How could recovery from this situation be achieved? I believe your understanding is correct. If you expect such a double-failure, you cannot rely on being able to recover your pool at the secondary site. The newest uberblocks would be among the first blocks to be replicated (2 of the uberblock arrays are situated at the start of the vdev) and your whole block tree might be inaccessible if the latest Meta Object Set blocks were not also replicated. You might be lucky and be able to mount your filesystems because ZFS keeps 3 separate copies of the most important metadata and it tries to keep apart each copy by about 1/8th of the disk, but even then I wouldn't count on it. If ZFS can't open the pool due to this kind of corruption, you would get the following message: status: The pool metadata is corrupted and the pool cannot be opened. action: Destroy and re-create the pool from a backup source. At this point, you could try zeroing out the first 2 uberblock arrays so that ZFS tries using an older uberblock from the last 2 arrays, but this might not work. As the message says, the only reliable way to recover from this is restoring your pool from backups. There are ways to mitigate this, of course, the most obvious being to take a snapshot of the (valid) secondary before starting resync, as a fallback. This isn't always easy to do, especially since the resync is usually automatic; there is no clear trigger to use for the snapshot. It may also be difficult to synchronize the snapshot of all LUNs in a pool. I'd like to better understand the risks/behaviour of ZFS before starting to work on mitigation strategies. If the replication process was interrupted for a sufficiently long time and disaster strikes at the primary site *during resync*, I don't think snapshots would save you even if you had took them at the right time. Snapshots might increase your chances of recovery (by making ZFS not free and reuse blocks), but AFAIK there wouldn't be any guarantee that you'd be able to recover anything whatsoever since the most important pool metadata is not part of the snapshots. Regards, Ricardo |
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss