On 21/07/16 13:41, Dave Chinner wrote: > On Wed, Jul 20, 2016 at 09:40:06AM -0400, Paolo Bonzini wrote: >>>> 1) is it expected that SEEK_HOLE skips unwritten extents? >>> >>> There are multiple answers to this, all of which are correct depending >>> on current context and state: >>> >>> 1. No - some filesystems will report clean unwritten extents as holes. >>> >>> 2. Yes - some filesystems will report clean unwritten extents as data. >>> >>> 3. Maybe - if there is written data in memory over the unwritten >>> extent on disk (i.e. hasn't been flushed to disk, it will be >>> considered a data region with non-zero data. (FIEMAP will still >>> report is as unwritten) >> >> Ok, I thought it would return FIEMAP_EXTENT_UNKNOWN|FIEMAP_EXTENT_DELALLOC >> in this case (not FIEMAP_EXTENT_UNWRITTEN). > > No. FIEMAP only returns the known extent state at the given file > offset. "delalloc" extents exist in memory, indicating the space > has already been accounted for over that offset, but the extent has > not been physically allocated. Like all other types of extents, > there may or may not be valid data over a delayed allocation extent. > > IOWs, fiemap only gives you a snapshot of extent state, not the > ranges of valid data in the file. > >>>> If not, would >>>> it be acceptable to introduce Linux-specific SEEK_ZERO/SEEK_NONZERO, which >>>> would be similar to what SEEK_HOLE/SEEK_DATA do now? >>> >>> To solve what problem? You haven't explained what problem you are >>> trying to solve yet. >>> >>>> 2) for FIEMAP do we really need FIEMAP_FLAG_SYNC? And if not, for what >>>> filesystems and kernel releases is it really not needed? >>> >>> I can't answer this question, either, because I don't know what >>> you want the fiemap information for. >> >> The answer is the same no matter if we use both lseek and FIEMAP, so >> I'll answer just once. We want to do two things: >> >> 1) avoid copying zero data, to keep the copy process efficient. For this, >> SEEK_HOLE/SEEK_DATA are enough. >> >> 2) copy file contents while preserving the allocation state of the file's >> extents. > > Which is /very difficult/ to do safely and reliably. > > We do actually do reliable, safe, exact hole and preallocation > layout duplication with xfs_fsr, but that uses kernel provided > cookies (from XFS_IOC_BULKSTAT) to detect that data in the source > file has not changed while it was being copied before executing the > final defrag operation in the kernel (XFS_IOC_SWAPEXT) that makes > the new copy of the data user visible. > > i.e. the use of fiemap to duplicate the exact layout of a file > from userspace is only posisble if you can /guarantee/ the source > file has not changed in any way during the copy operation at the > pointin time you finalise the destination data copy. > >> There can be various reasons why the user has preallocated the file (because >> they >> don't want an ENOSPC to happen while the VM runs; on some filesystems, to >> minimize cases where io_submit is very un-asynchronous; or just because >> someone >> had a reason to do a BLKZEROOUT ioctl on the virtual disk). We want to >> preserve >> these while converting or otherwise moving the file around. > > Sure, there's many reasons for using prealloc/punch/zero. The real > difference to other file operations is that they interface with low > level filesystem structure, not the data contained within the > extents. That's what makes them problematic for duplication - > userspace cannot serialise against low level filesystem structure > modifications. > > Optimising file copies safely is one of the reasons the > copy_file_range() syscall has been introduced (in 4.5). While we > haven't implemented anything special in XFS yet, it will internally > use splice to do a zero-copy data transfer from source to > destination file. Optimising for exact layout copies is precisely > the sort of thing this syscall is intended for. > > It's also intended to enable applications to take advantage of > hardware acceleration of data copying (e.g. server side copies to > avoid round trips as has been implemented for NFS, or storage array > offload of data copying) when such support is provided by the kernel. > > IOWs, I think you should be looking to optimise file copies by using > copy_file_range() and getting filesystems to do exactly what you > need. Using FIEMAP, fallocate and moving data through userspace > won't ever be reliable without special filesystem help (that only > exists for XFS right now), nor will it enable the application to > transparently use smart storage protocols and hardware when it is > present on user systems....
Yes higher level calls are useful here and we'll consider using them in cp etc. When I previously looked at this I noticed some implementations would fall back to do_splice_direct() which is essentially sendfile() and that expands holes which wouldn't be a good default. So there may be soem need for control flags for copy_file_range() to have it generally useful. thanks for the info, Pádraig.