On Sun, Jun 30, 2024 at 5:31 PM Nir Soffer <[email protected]> wrote: > > I found a strange behavior in qemu-img map - zero/data status depends on page > cache content. It looks like a kernel issue since qemu-img map is using > SEEK_HOLE/DATA (block/file-posix.c line 3111). > > Tested with latest qemu on kernel 6.9.6-100.fc39.x86_64. I see similar > behavior > in xfs and ex4 filesystems. > > After creating a allocated image: > > # qemu-img create -f raw -o preallocation=falloc falloc.img 1g > Formatting 'falloc.img', fmt=raw size=1073741824 preallocation=falloc > > qemu-img map reports the image as sparse (expect the first block which we > fully > allocate): > > # qemu-img map --output json falloc.img > [{ "start": 0, "length": 4096, "depth": 0, "present": true, > "zero": false, "data": true, "offset": 0}, > { "start": 4096, "length": 1073737728, "depth": 0, "present": > true, "zero": true, "data": false, "offset": 4096}] > > This is goo for copy or read performance, since we can skip reading the areas > with data=false, but on the other hand this is bad for correctness, since we > cannot preserve the allocation of the entire image, since it look like a > sparse > image: > > # qemu-img create -f raw sparse.img 1g > Formatting 'sparse.img', fmt=raw size=1073741824 > > # qemu-img map --output json sparse.img > [{ "start": 0, "length": 4096, "depth": 0, "present": true, > "zero": false, "data": true, "offset": 0}, > { "start": 4096, "length": 1073737728, "depth": 0, "present": > true, "zero": true, "data": false, "offset": 4096}] > > But look what happens when we get some of the image into the page cache: > > # dd if=falloc.img bs=1M count=512 of=/dev/null > > # qemu-img map --output json falloc.img > [{ "start": 0, "length": 544210944, "depth": 0, "present": true, > "zero": false, "data": true, "offset": 0}, > { "start": 544210944, "length": 529530880, "depth": 0, "present": > true, "zero": true, "data": false, "offset": 544210944}] > > Now half of the image is reported as data=true and half as data=false. If we > read the entire image all of it is reported as data=true: > > # dd if=falloc.img bs=1M count=1024 of=/dev/null > > # qemu-img map --output json falloc.img > [{ "start": 0, "length": 1073741824, "depth": 0, "present": true, > "zero": false, "data": true, "offset": 0}] > > If we drop caches, the image go back to the initial state (almost): > > # sync; echo 1 > /proc/sys/vm/drop_caches > > # qemu-img map --output json falloc.img > [{ "start": 0, "length": 16384, "depth": 0, "present": true, > "zero": false, "data": true, "offset": 0}, > { "start": 16384, "length": 1073725440, "depth": 0, "present": > true, "zero": true, "data": false, "offset": 16384}] > > Based on the lseek(2) the file system can do anything, but the page > cache is not mentioned > as something that may affect the result of the call: > > Seeking file data and holes > Since Linux 3.1, Linux supports the following additional values for > whence: > > SEEK_DATA > Adjust the file offset to the next location in the file greater > than or equal to offset containing data. If offset points to > data, then the file offset is set to offset. > > SEEK_HOLE > Adjust the file offset to the next hole in the file greater than > or equal to offset. If offset points into the middle of a hole, > then the file offset is set to offset. If there is no hole past > offset, then the file offset is adjusted to the end of the file > (i.e., there is an implicit hole at the end of any file). > > In both of the above cases, lseek() fails if offset points past the end > of the file. > > These operations allow applications to map holes in a sparsely allo‐ > cated file. This can be useful for applications such as file backup > tools, which can save space when creating backups and preserve holes, > if they have a mechanism for discovering holes. > > For the purposes of these operations, a hole is a sequence of zeros > that (normally) has not been allocated in the underlying file storage. > However, a filesystem is not obliged to report holes, so these opera‐ > tions are not a guaranteed mechanism for mapping the storage space ac‐ > tually allocated to a file. (Furthermore, a sequence of zeros that ac‐ > tually has been written to the underlying storage may not be reported > as a hole.) In the simplest implementation, a filesystem can support > the operations by making SEEK_HOLE always return the offset of the end > of the file, and making SEEK_DATA always return offset (i.e., even if > the location referred to by offset is a hole, it can be considered to > consist of data that is a sequence of zeros). > > On xfs filesystem we can inspect the actual allocation: > > $ xfs_bmap -v falloc.img > falloc.img: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL > 0: [0..7]: 192..199 0 (192..199) 8 > 1: [8..2097151]: 200..2097343 0 (200..2097343) 2097144 > > $ xfs_bmap -v sparse.img > sparse.img: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL > 0: [0..7]: 2097344..2097351 0 (2097344..2097351) 8 > 1: [8..2047]: 2097352..2099391 0 (2097352..2099391) 2040 > 2: [2048..2097151]: hole 2095104 > > Maybe qemu-img should use file system specific APIs like ioctl_xfs_getbmap(2) > to get more correct and consistent allocation info?
Maybe some kernel filesystem mailing list is a better place to discuss this?
