file-posix used to assume that existing holes satisfy the requested
alignment, which equals to the estimated direct I/O alignment
requirement if direct I/O is requested, and assert the assumption
unless it is at EOF.

However, the estimation of direct I/O alignment requirement is sometimes
inexact and can be overly strict. For example, I observed that QEMU
estimated the alignment requirement as 16K while the real requirement
is 4K when Btrfs is used on Linux 6.14.6 and the host page size is 16K.

For direct I/O alignment, open(2) sugguests as follows:
> Since Linux 6.1, O_DIRECT support and alignment restrictions for a
> file can be queried using statx(2), using the STATX_DIOALIGN flag.
> Support for STATX_DIOALIGN varies by filesystem; see statx(2).
>
> Some filesystems provide their own interfaces for querying O_DIRECT
> alignment restrictions, for example the XFS_IOC_DIOINFO operation in
> xfsctl(3). STATX_DIOALIGN should be used instead when it is available.
>
> If none of the above is available, then direct I/O support and
> alignment restrictions can only be assumed from known characteristics
> of the filesystem, the individual file, the underlying storage
> device(s), and the kernel version. In Linux 2.4, most filesystems
> based on block devices require that the file offset and the length and
> memory address of all I/O segments be multiples of the filesystem
> block size (typically 4096 bytes). In Linux 2.6.0, this was relaxed to
> the logical block size of the block device (typically 512 bytes). A
> block device's logical block size can be determined using the ioctl(2)
> BLKSSZGET operation or from the shell using the command:

Apparently Btrfs doesn't support STATX_DIOALIGN nor provide its own
interface for querying the requirement. Using BLKSSZGET brings another
problem to determine the underlying block device, which also involves
heuristics.

Moreover, even if we could figure out the direct I/O alignment
requirement, I could not find a documentation saying it will exactly
match with the alignment of holes.

So stop asserting the assumption on the holes and handle unaligned holes
properly.

Signed-off-by: Akihiko Odaki <akihiko.od...@daynix.com>
---
Changes in v2:
- Changed to round the number also when the specified offset in a hole.
- Changed to iterate until finding an aligned location.
- Link to v1: 
https://lore.kernel.org/qemu-devel/20250528-dio-v1-1-633066a71...@daynix.com
---
 block/file-posix.c | 83 ++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 53 insertions(+), 30 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index ec95b748696b..d3c598d96895 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -3280,6 +3280,7 @@ static int coroutine_fn 
raw_co_block_status(BlockDriverState *bs,
                                             BlockDriverState **file)
 {
     off_t data = 0, hole = 0;
+    bool has_data = false;
     int ret;
 
     assert(QEMU_IS_ALIGNED(offset | bytes, bs->bl.request_alignment));
@@ -3297,40 +3298,62 @@ static int coroutine_fn 
raw_co_block_status(BlockDriverState *bs,
         return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
     }
 
-    ret = find_allocation(bs, offset, &data, &hole);
-    if (ret == -ENXIO) {
-        /* Trailing hole */
-        *pnum = bytes;
-        ret = BDRV_BLOCK_ZERO;
-    } else if (ret < 0) {
-        /* No info available, so pretend there are no holes */
-        *pnum = bytes;
-        ret = BDRV_BLOCK_DATA;
-    } else if (data == offset) {
-        /* On a data extent, compute bytes to the end of the extent,
-         * possibly including a partial sector at EOF. */
-        *pnum = hole - offset;
+    /*
+     * We may have allocation unaligned with the requested alignment
+     * due to the following reaons:
+     * - unaligned file size
+     * - inexact direct I/O alignment requirement estimation
+     * - mismatch between the allocation size and
+     *   direct I/O alignment requirement
+     *
+     * We are not allowed to return partial sectors, though, so iterate
+     * until finding an aligned location in or at a border of a hole.
+     */
+    *pnum = 0;
+    do {
+        ret = find_allocation(bs, offset + *pnum, &data, &hole);
+        if (ret == -ENXIO) {
+            /* Trailing hole */
+            if (!has_data) {
+                *pnum = bytes;
+            }
+            break;
+        }
 
-        /*
-         * We are not allowed to return partial sectors, though, so
-         * round up if necessary.
-         */
-        if (!QEMU_IS_ALIGNED(*pnum, bs->bl.request_alignment)) {
-            int64_t file_length = raw_getlength(bs);
-            if (file_length > 0) {
-                /* Ignore errors, this is just a safeguard */
-                assert(hole == file_length);
+        if (ret < 0) {
+            /* No info available, so pretend there are no holes */
+            *pnum = bytes;
+            has_data = true;
+            break;
+        }
+
+        data -= offset;
+        hole -= offset;
+
+        if (data == *pnum) {
+            /* Return the end of a data extent if aligned. */
+            has_data = true;
+            *pnum = ROUND_UP(hole, bs->bl.request_alignment);
+            if (*pnum == hole) {
+                break;
+            }
+        } else {
+            /* Round down the end of a hole. */
+            assert(hole == *pnum);
+
+            if (data - *pnum >= bs->bl.request_alignment) {
+                if (!has_data) {
+                    *pnum = ROUND_DOWN(data, bs->bl.request_alignment);
+                }
+                break;
             }
-            *pnum = ROUND_UP(*pnum, bs->bl.request_alignment);
+
+            has_data = true;
+            *pnum += bs->bl.request_alignment;
         }
 
-        ret = BDRV_BLOCK_DATA;
-    } else {
-        /* On a hole, compute bytes to the beginning of the next extent.  */
-        assert(hole == offset);
-        *pnum = data - offset;
-        ret = BDRV_BLOCK_ZERO;
-    }
+    } while (*pnum < bytes);
+    ret = has_data ? BDRV_BLOCK_DATA : BDRV_BLOCK_ZERO;
     *map = offset;
     *file = bs;
     return ret | BDRV_BLOCK_OFFSET_VALID;

---
base-commit: f0737158b483e7ec2b2512145aeab888b85cc1f7
change-id: 20250528-dio-db04a66a7848

Best regards,
-- 
Akihiko Odaki <akihiko.od...@daynix.com>


Reply via email to