On Jan 20, 2018, at 5:06 PM, Mark H Weaver <m...@netris.org> wrote: > > Andreas Dilger <adil...@dilger.ca> writes: > >> On Jan 10, 2018, at 4:50 AM, Pavel Raiskup <prais...@redhat.com> wrote: >>> >>> On Wednesday, January 10, 2018 3:42:52 AM CET Mark H Weaver wrote: >>>> From da922703282b0d3b8837a99a9c7fdd32f1d20d49 Mon Sep 17 00:00:00 2001 >>>> From: Mark H Weaver <m...@netris.org> >>>> Date: Tue, 9 Jan 2018 20:16:14 -0500 >>>> Subject: [PATCH] Remove nonportable check for files containing only zeroes. >>>> >>>> This check benefitted only one unlikely case (large files containing >>>> only zeroes, on systems that do not support SEEK_HOLE) >>> >>> It drops the optimization even for situations when SEEK_HOLE is not >>> available, which is not 100% necessary. I'm not proposing doing otherwise >>> (I actually proposed this in [1]), but I'm rather CCing Andreas once more, >>> as that's the original requester, the use-cases with lustre were >>> definitely not unlikely and the question is whether SEEK_HOLE covers them >>> nowadays. >>> >>> [1] https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00017.html >> >> Sorry for the late reply on this thread. >> >> It should be noted that the real problem was not related to backing >> up files at the Lustre level, but rather backing up files directly from >> the backing ext4 filesystem of the metadata target, for example if migrating >> to new hardware, or for backup/restore of only that target. The MDT stored >> the actual file size in the metadata inode (could be GB or TB in size), but >> the file data was stored on data servers on other nodes. >> >> This meant that using the old tar versions to do the MDT backup might take >> days at 100% CPU just to write sparse files to the tarball. If tar is now >> using SEEK_HOLE to determine sparseness, then the ext4-level backups with >> newer systems should work OK (SEEK_HOLE was added to ext4 in the 3.0 kernel, >> and was improved in 3.7, though a data consistency bug with unwritten data >> was just fixed in 4.12). >> >> That means SEEK_HOLE is NOT available in RHEL 6.x kernels, which are still >> in fairly widespread (though declining) use. > > Ah, I see, thank you for this explanation. So this corner case is not > so unlikely as I supposed. > >> I'd prefer that the heuristic for sparse files without SEEK_HOLE not >> be removed completely, but I do think that it needs to be fixed for >> the small inline file and file in cache cases. > > I appreciate your concerns. Unfortunately, as I explain below, this > heuristic is based on a false assumption, even for very large files, and > leads to data loss on Btrfs. > > If we are to continue using this heuristic at all, we will need a way to > decide when it can be used safely. I don't know of a good way to do > this, but I'm open to suggestions. > > It makes me wonder how many RHEL 6.x users will use GNU tar 1.31 on > their otherwise very old enterprise systems to back up their disks. > I would expect those users to use their enterprise-grade old 'tar'. > >>>> and was based on an assumption about file system behavior that is not >>>> mandated by POSIX and no longer holds in practice, namely that for >>>> sufficiently large files, (st_blocks == 0) implies that the file >>>> contains only zeroes. Examples of file systems that violate this >>>> assumption include Linux's /proc file system and Btrfs. >> >> Is that comment correct, namely that btrfs has "large" files that report >> st_blocks == 0 even though they contain data? > > Yes, on Btrfs I reliably see (st_blocks == 0) on a recently written, > mostly sparse file with size > 8G, using linux-libre-4.14.14. More > specifically, the "storing sparse files > 8G" test in tar's test suite > reliably fails on my system: > > 140: storing sparse files > 8G FAILED (sparse03.at:29)
I'd consider this a bug in Btrfs. As mentioned previously, we had the same problem with ext4 (twice) and Lustre, and in both cases fixed this by adding in the (dirty page cache pages/512) if the current block count is zero: int ext4_file_getattr(const struct path *path, struct kstat *stat, u32 request_mask, unsigned int query_flags) { struct inode *inode = d_inode(path->dentry); u64 delalloc_blocks; ext4_getattr(path, stat, request_mask, query_flags); /* * If there is inline data in the inode, the inode will normally not * have data blocks allocated (it may have an external xattr block). * Report at least one sector for such files, so tools like tar, rsync, * others don't incorrectly think the file is completely sparse. */ if (unlikely(ext4_has_inline_data(inode))) stat->blocks += (stat->size + 511) >> 9; /* * We can't update i_blocks if the block allocation is delayed * otherwise in the case of system crash before the real block * allocation is done, we will have i_blocks inconsistent with * on-disk file blocks. * We always keep i_blocks updated together with real * allocation. But to not confuse with user, stat * will return the blocks that include the delayed allocation * blocks for this file. */ delalloc_blocks = EXT4_C2B(EXT4_SB(inode->i_sb), EXT4_I(inode)->i_reserved_data_blocks); stat->blocks += delalloc_blocks << (inode->i_sb->s_blocksize_bits - 9); return 0; } static int vvp_object_glimpse(const struct lu_env *env, const struct cl_object *obj, struct ost_lvb *lvb) { struct inode *inode = vvp_object_inode(obj); ENTRY; lvb->lvb_mtime = LTIME_S(inode->i_mtime); lvb->lvb_atime = LTIME_S(inode->i_atime); lvb->lvb_ctime = LTIME_S(inode->i_ctime); /* * LU-417: Add dirty pages block count lest i_blocks reports 0, some * "cp" or "tar" on remote node may think it's a completely sparse file * and skip it. */ if (lvb->lvb_size > 0 && lvb->lvb_blocks == 0) lvb->lvb_blocks = dirty_cnt(inode); RETURN(0); } Cheers, Andreas
signature.asc
Description: Message signed with OpenPGP