Re: [zfs-discuss] corrupt zfs stream? checksum mismatch

Richard Elling Wed, 13 Aug 2008 15:49:02 -0700

Jonathan Wheeler wrote:
> Thanks for the information, I'm learning quite a lot from all this.
>
> It seems to me that zfs send *should* be doing some kind of verification, 
> since some work has clearly been put into zfs so that zfs's can be dumped 
> into files/pipes. It's a great feature to have, and I can't believe that this 
> was purely for zfs send | zfs receive scenarios. 
>


zfs send/receive is not a backup solution because it does not have the
features generally expected in a backup solution.  It is a very low-level
method of replicating dataset structure.  If you find documentation to
the contrary, which was created after CR6399918 was integrated, then
please file a new bug.
http://bugs.opensolaris.org/view_bug.do?bug_id=6399918

> A common example used all over the place is zfs send | ssh $host. In these 
> examples is ssh guaranteeing the data delivery somehow? If not, there need to 
> be some serious asterisks in these guides!
>   

In this case, the receive does checks and will fail when the checks do
not pass.  In such cases, the send can be restarted.  ssh performs
encryption, and encryption codes tend to be more robust because a
corruption will tend to fail upon decryption (including the surrounding
checksum checks).

If you save the contents of the pipe somewhere, then you are at the
mercy of the robustness of the saved location.

However, there is more that can be done here, both inside and outside
of ZFS.  For inside ZFS,  I have filed an RFE: CR6736837, improve
send/receive fault tolerance.  However, to be effective, we really need
a better understanding of the failures we expect to encounter. 

As an interim step, know that a send will create the same stream because
it is sending a stable set of data.  You can send to files twice, on diverse
storage, and then compare the resulting files.  In other words, the
flexibility of UNIX pipes is exposed by zfs send/receive.

> Looking at this at a level that I do understand, it's going via TCP, which 
> checksums packets..... then again, I was using nfs over TCP, and look where I 
> am today. So much for that!
>   

I do not think you will be able to identify the root cause of your
corruption -- there are far too many dependents and you do not
have a known-good reference :-(.

> As I google these subjects more and more, I fear that I'm hitting the 
> conceptual mental block that many before me have done also. zfs send is not 
> zfsdump, even though it sure looks the same, and it's not clearly stated that 
> you may end up in a situation like the one I'm in today if you don't somehow 
> test your backups.
>   

Correct, though this applies to everything, in general.

One backup method I use (I use several ;-), is to use send/receive to a
removable disk, usually a USB disk.  I can then setup compression and
redundancy policies for the USB disk and also periodically scrub to
test the retention.  This also offers the ability to go back to any snapshot
in a matter of minutes, even though I store the USB disk in a fire safe.
Another benefit to this method is that I can easily verify the media -- I
was once a user of 8mm tape drives, so I've got several scars related
to the inability to recover data from tapes (they had a nasty habit of
writing tapes that couldn't be read from other 8mm drives, so if you
had to repair your drive (likely), then you might not be able to read
your tapes).

> As you've rightly pointed out, it's done now and even if I did manage to 
> reproduce this again, that won't help my data locked away in these 2 .zfs 
> files, so focusing on the hopeful is there anything I can do to recover my 
> data from these zfs dumps? Anything at all :)
>   

I filed RFE CR 6736794, option for partial zfs receives.  But I'm
not confident that it can be implemented easily or quickly.

> If the problem is "just" that "zfs receive" is checksumming the data on the 
> way in, can I disable this somehow within zfs? 
> Can I globally disable checksumming in the kernel module? mdb something or 
> rather?
>
> I read this thread where someone did successfully manage to recovery data 
> from a damaged zfs, which fulls me with some hope:
> http://www.opensolaris.org/jive/thread.jspa?messageID=220125
>
> It's way over my head, but if anyone can tell me the mdb commands I'm happy 
> to try them, even if they do kill my cat. I don't really have anything to 
> loose with a copy of the data, and I'll do it all in a VM anyway.
>   

With mdb and the source, all things are possible.  But I'll have
to defer to someone who uses mdb more frequently than I.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] corrupt zfs stream? checksum mismatch

Reply via email to