Re: Fedora 33 System-Wide Change proposal: Make btrfs the default file system for desktop variants

Josef Bacik Fri, 03 Jul 2020 08:15:00 -0700

On 7/3/20 9:37 AM, Eric Sandeen wrote:

On 7/1/20 2:50 PM, Josef Bacik wrote:

On 7/1/20 2:24 PM, Matthew Miller wrote:

On Wed, Jul 01, 2020 at 06:54:02AM +0000, Zbigniew Jędrzejewski-Szmek wrote:

Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34
could be good option. I know technically it is already opt-in, but it's not
very visible or popular. We could make the btrfs option more prominent and
ask people to pick it if they are ready to handle potential fallout.


I'm leaning towards recommending this as well. I feel like we don't have
good data to make a decision on -- the work that Red Hat did previously when
making a decision was 1) years ago and 2) server-focused, and the Facebook
production usage is encouraging but also not the same use case. I'm
particularly concerned about metadata corruption fragility as noted in the
Usenix paper. (It'd be nice if we could do something about that!)


There's only so much we can do about this.  I've sent up patches to ignore 
failed global trees to allow users to more easily recover data in case of 
corruption in the case of global trees, but as they say if only 1 bit is off in 
a node, we throw the whole node away.  And throwing a node away means you lose 
access to any of its children, which could be a large chunk of the file system.

This sounds like a "wtf, why are you doing this btrfs?" sort of thing, but this 
is just the reality of using checksums.  It's a checksum, not ECC.  We don't know _which_ 
bits are fucked, we just know somethings fucked, so we throw it all away.  If you have 
RAID or DUP then we go read the other copy, and fix the broken copy if we find a good 
copy.  If we don't, well then there's nothing really we can do.


There is often a path forward when a bad metadata checksum is detected.
i.e. e2fsck:

scan_extent_node() {
...

         /* Failed csum but passes checks?  Ask to fix checksum. */
         if (failed_csum &&
             fix_problem(ctx, PR_1_EXTENT_ONLY_CSUM_INVALID, pctx)) {
                 pb->inode_modified = 1;
                 pctx->errcode = ext2fs_extent_replace(ehandle, 0, &extent);
                 if (pctx->errcode)
                         return;
         }

it does similarly for many types of metadata:

/* inode passes checks, but checksum does not match inode */
#define PR_1_INODE_ONLY_CSUM_INVALID            0x010068
--
/* Inode extent block passes checks, but checksum does not match extent */
#define PR_1_EXTENT_ONLY_CSUM_INVALID           0x01006A
--
/* Inode extended attribute block passes checks, but checksum does not
  * match block. */
#define PR_1_EA_BLOCK_ONLY_CSUM_INVALID         0x01006C
--
/* dir leaf node passes checks, but fails checksum */
#define PR_2_LEAF_NODE_ONLY_CSUM_INVALID        0x02004D

Does btrfsck really never attempt to salvage a metadata block with a bad CRC by
validating its fields?

No, I suppose we could, I'll add it to the list. Generally speaking if there'sa bad checksum detected we just attempt to recover based on what we couldn't getaccess to. However that's difficult if it's a node. If it's a leaf thenusually you just lose some metadata that can be inferred from other data. Forexample if you lose a leaf in the extent tree, well we can add all thatinformation back once we've scanned the rest of the file system and know whatextents are missing in the extent tree.

Same goes for directory items, we detect that we are missing directory items,but we have references for them and so we add the missing directory items thatwere lost from that corrupt block.

But again, if you lose a node you lose access to many leaves, which makes itmore likely we'll lose somehting because we'll lose the other information we canuse to recover what was lost. The extent tree and checksum trees are exceptionsto this, since they can be rebuilt from scratch, provided everything else is fine.

And then if we did decide to validate nodes, we _might_ be ok, but we might endup with old versions of leaves because it happens to point at something thatappears to be correct, but isn't really. Our metadata changes all the time, soit's not outside the realm of possiblities that the corruption points at aseemlingly valid piece of metadata, but isn't and thus makes us do something_really_ wrong. Thanks,


Josef
_______________________________________________
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org

Re: Fedora 33 System-Wide Change proposal: Make btrfs the default file system for desktop variants

Reply via email to