There is a block I/O corner case that I don't fully understand. I'd appreciate thoughts on the expected behavior.
At one point during a Windows Server 2008 install to an IDE disk the guest sends a read request with overlapping sglist buffers. It looks like this: [0] addr=A len=4k [1] addr=B len=4k [2] addr=C len=4k [3] addr=B len=4k Buffers 1 and 3 are the same guest memory, their addresses match. If I understand correctly, IDE will perform each operation in turn and DMA the result back to the buffers in order. Therefore disk contents at +12k should be written to address B. Unfortunately QEMU does not guarantee this today. Sometimes the disk contents at +4k (buffer 1) are read and other times the disk contents at +12k (buffer 3) are read. QEMU can be taken out of the picture and replaced by a simple test program that calls preadv(2) directly with the same overlapping buffer pattern. There doesn't appear to be a guarantee that the disk contents at +12k (buffer 3) will be read instead of +4k (buffer 1). When the page cache is active preadv(2) produces consistent results. When the page cache is bypassed (O_DIRECT) preadv(2) produces consistent results against a physical disk: a-22904 [001] 3042.186790: block_bio_queue: 8,0 R 2048 + 32 [a] a-22904 [001] 3042.186807: block_getrq: 8,0 R 2048 + 32 [a] a-22904 [001] 3042.186812: block_plug: [a] a-22904 [001] 3042.186816: block_rq_insert: 8,0 R 0 () 2048 + 32 [a] a-22904 [001] 3042.186822: block_unplug_io: [a] 1 a-22904 [001] 3042.186829: block_rq_issue: 8,0 R 0 () 2048 + 32 [a] pam-foreground--22912 [001] 3042.187066: block_rq_complete: 8,0 R () 2048 + 32 [0] Notice that a single 32 sector read is issued on /dev/sda (8,0). This makes sense under the assumption that the disk honors DMA buffer ordering within a request. However, when the page cache is bypassed preadv(2) produces inconsistent results against a file on ext3 -> LVM -> dm-crypt -> /dev/sda. a-22834 [001] 3038.425802: block_bio_queue: 254,3 R 32616672 + 8 [a] a-22834 [001] 3038.425812: block_remap: 254,0 R 58544736 + 8 <- (254,3) 32616672 a-22834 [001] 3038.425813: block_bio_queue: 254,0 R 58544736 + 8 [a] kcryptd_io-379 [001] 3038.425832: block_remap: 8,0 R 59044807 + 8 <- (8,2) 58546792 kcryptd_io-379 [001] 3038.425833: block_bio_queue: 8,0 R 59044807 + 8 [kcryptd_io] kcryptd_io-379 [001] 3038.425841: block_getrq: 8,0 R 59044807 + 8 [kcryptd_io] kcryptd_io-379 [001] 3038.425845: block_plug: [kcryptd_io] kcryptd_io-379 [001] 3038.425848: block_rq_insert: 8,0 R 0 () 59044807 + 8 [kcryptd_io] kcryptd_io-379 [001] 3038.425859: block_rq_issue: 8,0 R 0 () 59044807 + 8 [kcryptd_io] a-22834 [001] 3038.425894: block_bio_queue: 254,3 R 32616792 + 16 [a] a-22834 [001] 3038.425898: block_remap: 254,0 R 58544856 + 16 <- (254,3) 32616792 a-22834 [001] 3038.425899: block_bio_queue: 254,0 R 58544856 + 16 [a] kcryptd_io-379 [001] 3038.425908: block_remap: 8,0 R 59044927 + 16 <- (8,2) 58546912 kcryptd_io-379 [001] 3038.425909: block_bio_queue: 8,0 R 59044927 + 16 [kcryptd_io] kcryptd_io-379 [001] 3038.425911: block_getrq: 8,0 R 59044927 + 16 [kcryptd_io] kcryptd_io-379 [001] 3038.425913: block_plug: [kcryptd_io] kcryptd_io-379 [001] 3038.425914: block_rq_insert: 8,0 R 0 () 59044927 + 16 [kcryptd_io] a-22834 [001] 3038.425920: block_bio_queue: 254,3 R 32616992 + 8 [a] a-22834 [001] 3038.425922: block_remap: 254,0 R 58545056 + 8 <- (254,3) 32616992 a-22834 [001] 3038.425923: block_bio_queue: 254,0 R 58545056 + 8 [a] a-22834 [001] 3038.425929: block_unplug_io: [a] 0 a-22834 [001] 3038.425930: block_unplug_io: [a] 0 a-22834 [001] 3038.425931: block_unplug_io: [a] 2 a-22834 [001] 3038.425934: block_rq_issue: 8,0 R 0 () 59044927 + 16 [a] kcryptd_io-379 [001] 3038.425948: block_remap: 8,0 R 59045127 + 8 <- (8,2) 58547112 kcryptd_io-379 [001] 3038.425949: block_bio_queue: 8,0 R 59045127 + 8 [kcryptd_io] kcryptd_io-379 [001] 3038.425951: block_getrq: 8,0 R 59045127 + 8 [kcryptd_io] kcryptd_io-379 [001] 3038.425953: block_plug: [kcryptd_io] kcryptd_io-379 [001] 3038.425954: block_rq_insert: 8,0 R 0 () 59045127 + 8 [kcryptd_io] <idle>-0 [001] 3038.427414: block_unplug_timer: [swapper] 3 kblockd/1-21 [001] 3038.427437: block_unplug_io: [kblockd/1] 3 kblockd/1-21 [001] 3038.427440: block_rq_issue: 8,0 R 0 () 59045127 + 8 [kblockd/1] <idle>-0 [000] 3038.436786: block_rq_complete: 8,0 R () 59044807 + 8 [0] kcryptd-380 [001] 3038.436960: block_bio_complete: 254,0 R 58544736 + 8 [0] kcryptd-380 [001] 3038.436963: block_bio_complete: 254,3 R 32616672 + 8 [0] <idle>-0 [001] 3038.437070: block_rq_complete: 8,0 R () 59044927 + 16 [0] kcryptd-380 [000] 3038.437343: block_bio_complete: 254,0 R 58544856 + 16 [611733513] kcryptd-380 [000] 3038.437346: block_bio_complete: 254,3 R 32616792 + 16 [-815025730] <idle>-0 [000] 3038.437428: block_rq_complete: 8,0 R () 59045127 + 8 [0] kcryptd-380 [000] 3038.437569: block_bio_complete: 254,0 R 58545056 + 8 [-2107963545] kcryptd-380 [000] 3038.437571: block_bio_complete: 254,3 R 32616992 + 8 [176593183] The 32 sectors are broken up into 8, 8, and 16 sector requests. I believe the filesystem is doing this before LVM is reached. This makes sense since a file may not be contiguous on disk and several extents need to be read. These 3 independent requests can complete in any order. The order will affect what contents are visible at address B when the read completes. So now my question: Is QEMU risking data corruption when buffers overlap? If IDE guarantees that buffers are read in order then we are doing it wrong (at least when O_DIRECT is used). Perhaps there is no ordering guarantee in IDE, Windows is doing something crazy, and QEMU is within its writes to use preadv(2) like this. Stefan