On May 9, 2016, at 12:46 PM, Valeri Galtsev <galt...@kicp.uchicago.edu> wrote:
> 
> On Mon, May 9, 2016 1:14 pm, Gordon Messmer wrote:
>> On 05/09/2016 11:01 AM, Valeri Galtsev wrote:
>>> Thanks Gordon! Yes, I know, ZFS, of course. I hear it as you definitely
>>> will use zfs for "bricks" of distributed file system, right?
>> 
>> I don't think its use case is limited to
>> that.  There aren't many spaces where I think you *shouldn't* plan to
>> use reliable filesystems (ZFS, btrfs, ReFS).
> 
> For distributed file system "brick" boxes ZFS
> (btrfs,...) may be a must, but only if distributed filesystem doesn't have
> its own mechanism ensuring file integrity, right?

No.  ZFS is superior to RAID in many respects, which makes it valuable for any 
situation where you care about data integrity, even on a desktop PC.

ObWarStory:

I have a ZFS pool on my desktop PC at home.  That pool is composed of two 
2-disk mirrors, which makes it kind of like RAID 10.  Each mirror is in a 
separate external Thunderbolt disk enclosure.  

One of those enclosures started to fail, so I removed both of the raw drives 
and put them into some cheap USB single-drive enclosures I had on hand.

That’s lesson #1: ZFS doesn’t care which bus or controller your drives are on.  
It doesn’t even care about the OS type or CPU type.  As long as the target 
system supports the features enabled on the ZFS vdev the drive came from, ZFS 
will happily attach the drives.

Because the failing enclosure made me nervous about the state of the data on 
the raw drives, I ran a ZFS scrub operation.  This is similar to the “verify” 
feature of a good hardware RAID controller, except that because ZFS adds a 
cryptographically strong hash to every stripe it writes, it can detect every 
problem a hardware RAID controller plus several others.

That’s lesson #2: ZFS scrub beats the pants off RAID verify.  This has nothing 
to do with distributed storage or bricks or anything else.  It is purely about 
data integrity.  Of all the storage you manage, which percentage of that do you 
not care about data integrity?

Hours into that scrub operation, I started to see errors!  Glad I scrubbed the 
pool, right?  But no worry, ZFS fixed each error.  I had no worry that there 
were undetectable errors, due to the cryptographically-strong hashes used on 
each block. Not only that, ZFS told me which files the errors affected, so I 
could test those files at the userspace level, to make sure ZFS’s repairs did 
the right thing.

That’s lesson #3: integrating your disk redundancy and checksumming with the 
filesystem has tangible benefits.  Hardware RAID controllers can’t tell you 
which files a given stripe belongs to.

A few hours further along, the scrub operation’s error counts started spiking.  
A lot.  Like millions of errors.  Was the hard drive dying?  No, it turned out 
to be one of the USB disk enclosures.  (Yes, it was a bad week at Warren Young 
Galactic HQ.  A certain amount of heart palpitations occurred.)  Was the scrub 
operation scribbling all over my disks?

No, and that’s lesson #4: A hardware RAID controller will refuse to return bad 
blocks in the middle of a file, but if ZFS encounters an unrecoverable error in 
the middle of a file, that file simply disappears from the filesystem.  (To 
those who think half a file is better than no file, that’s what backups are 
for.)  If ZFS lets you open a file it has reported errors on, it’s fixed the 
problem already.  You don’t have to verify the file byte-by-byte because ZFS 
scrub already did that.

After all this agida, I bought a new 2-disk enclosure with new disks and added 
those disks to the failing mirror, temporarily turning the 2-way mirror to a 
4-way mirror.  This let me replicate the failing disks onto the fresh disks in 
a secure way.  I knew that if ZFS finished resilvering that mirror, that I 
could safely drop the original pair of drives and know — in a 
cryptographically-strong way — that the new disks had an *exact* copy of the 
original disks’ data, even if one drive or the other failed to return correct 
data.

That’s lesson #4: A typical hardware RAID version of that scheme would use a 
2-disk controller, which means you’d have to swap out one of the disks for a 
fresh one, temporarily dropping to zero redundancy.  The flexibility to add 
disks to a pool independent of physical connection means I never lost any 
redundancy.  Even in the worst possible case with half the stripes on each 
disks bad, as long as those stripes zippered together, I could always recover 
each stripe during the resilver operation.

After resilvering the problem mirror, I dropped the two original disks out of 
the pool, returning the vdev to a 2-way mirror.  A subsequent scrub turned up 
*zero* problems.

And that’s lesson #5: even in the face of failing hardware, ZFS will often keep 
your data safe long enough for you to migrate the data.  It doesn’t kick whole 
drives out of the pool at the first hint of a problem.  It will keep trying and 
trying.
_______________________________________________
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos

Reply via email to