Package: linux-image-3.9-0.bpo.1-amd64 Version: 3.9.6-1~bpo70+1 Other possibly relevant packages: Xen, lvm (2.02.95-7), fio (2.0.8-2), libaio1:amd64 (0.3.109-3).
I've been seeing "fio" (flexible I/O tester) fail to read back the expected data from a storage device when using 512B blocks under a Wheezy domU under Xen (with Ubuntu 12.04 dom0, or Amazon EC2). The problem doesn't seem to show up if the guest is running Squeeze, or not under Xen, or if the block size is at least 4KB. fio --bs=512 --rw=randwrite --filename=/dev/scratch --name=foo --direct=1 --iodepth=1024 --iodepth_batch_submit=1 --ioengine=libaio --size=2M --do_verify=1 --verify=meta --verify_dump=1 --verify_fatal=1 --verify_pattern=0 -dmem Translation: write each 512B block in the first 2MB of /dev/scratch in randomized order, writing a recognizable pattern that includes a magic number and the block's offset. Use libaio with O_DIRECT, submit one block at a time, and keep up to 1024 I/O operations in flight at any given time. Then read them back and verify the stored fields. If any blocks fail verification, write out into the current directory two files with the expected and found values. Enable the debugging option that prints out the location of each buffer in memory. Most of the time, fio reports that verification of some of the written data fails, because the offset is wrong; examining the saved "received" block I find the correct magic number but an offset that belongs elsewhere in the file. Sometimes it's just a couple of blocks; sometimes it's dozens. In very rare cases, the magic number appears to be incorrect. This happens with the Debian distributed fio binary, or with locally-built fio 2.0.7 binaries compiled on squeeze. Actually, on Xen guests we're running the Debian kernel sources, just rebuilt with the configuration modified only to select gzip compression. The dom0 for most of my tests is Ubuntu 12.04 (Xen 4.1.2-2ubuntu2.8), but we're also seeing it in Amazon's EC2. This does not happen with the squeeze-built fio 2.0.7 binaries on: Wheezy on real hardware (direct device access or via LVM) Wheezy on VMware (direct device access) squeeze (3.2.0-0.bpo.3-amd64) CentOS 6.3 (2.6.32-358.18.1.el6.x86_64) SLES 11 SP3 (3.0.76-0.11-default) It does fail on a Wheezy domU with a 3.10 kernel with a couple local patches. We don't have any other systems handy with post-3.2 kernels. The block device /dev/scratch is a logical volume on /dev/xvda2. The rest of xvda2 is an LVM mounted as a file system, so it's tough to test directly against xvda2 on these domUs. In /sys/block/dm-1/queue, hw_sector_size, logical_block_size, minimum_io_size, and physical_block_size are all 512, and max_segment_size is 4096. The Xen device xvda2, in turn, is a logical volume defined on the host. (The host is running Ubuntu 12.04, but we've seen the same problem on Amazon EC2 guests also running Wheezy.) It also fails with bs=1024 and bs=2048, but generally with fewer verification failures reported. At bs=4096, it passes. Changing the ioengine setting to psync (uses pwrite and pread instead of libaio) or posixaio (glibc's aio implementation) makes the problem go away. Sequential write tests ("--rw=write") show fewer errors but still fail. If I use "dd iflag=direct bs=512 count=..." to read the first 2MB from the device, and examine the offsets myself, it all looks fine. If I use a recent patch off the fio mailing list, which implements a "verify-only" option that doesn't write the data but just verifies the previously written data, verification still fails well after the initial writing, so it doesn't seem to be tied to being the process that wrote the data, or doing the verify immediately after issuing the writes. So it looks like it's the random-access small reads that are the problem; they seem to sometimes receive the wrong blocks. Assuming the underlying storage device is a modern advanced-format hard drive with 4KB sectors and just emulating 512B sectors, this testing is going to involve not just partial-sector I/O on the device, and often concurrent I/Os to one memory page. So I tried the upstream development version of fio from its git repository. (The test at the top still fails.) There's a new option to specify block sizes not just for read and write, but also for trim (--bs=X,Y,Z; the version in Debian doesn't complain if you pass three values either, but it seems to set read=X and write=Z). If I use "--bs=512,512,4096", this keeps the 512B block size for both reads and writes and specifies a trim size of 4KB. This alters the buffer allocation to use an array of 4KB blocks, with only the first 512B of each actually used for I/O, as can be verified from the output when the "-dmem" option is given. In this case, the data written and the pattern of offsets selected should be unchanged, but each data buffer will be at the start of its own page. The verification passes in this case. With --bs=512: [...] mem 23064 io_u alloc 0x1272520, index 1018 mem 23064 io_u 0x1272520, mem 0x7f5a7a41f400 mem 23064 io_u alloc 0x1272800, index 1019 mem 23064 io_u 0x1272800, mem 0x7f5a7a41f600 [...] meta: verify failed at file /dev/scratch offset 1512960, length 512 received data dumped as scratch.1512960.received expected data dumped as scratch.1512960.expected [...] With --bs=512,512,4096: [...] mem 23068 io_u alloc 0x1f6f100, index 1011 mem 23068 io_u 0x1f6f100, mem 0x7f67093e1000 mem 23068 io_u alloc 0x1f6f420, index 1012 mem 23068 io_u 0x1f6f420, mem 0x7f67093e2000 [...] ... and no failures. Running btrace confirms that it's still issuing I/O operations of one sector; the trim block size doesn't affect the I/O directly. So, at this point I'm thinking that there's some issue in 3.9 (and maybe earlier) kernels somewhere relating to multiple outstanding direct-I/O reads from Xen to the same page of domU process memory. Though why dd with direct I/O doesn't see the problem too, I don't know; perhaps the sequential nature of its I/Os makes the problem go away, maybe by merging I/O requests. A bit more experimentation seems to suggest that the "verify" phase uses the "write" block size to issue its reads; presumably the "read" block size is only used for tests mixing reads and writes (--rw=rw or --rw=randrw), not for the post-writing verification phase. But changing the "read" block size does seem to affect the I/O pattern for verification: With "--bs=512,1024", it succeeds, though the memory buffers are allocated at 1KB intervals, just as they are with "--bs=1024" (which fails). So that may poke a hole in my "multiple I/Os per page" hypothesis. However, btrace shows quite different behavior for these two cases. I'm seeing reads of two sectors at a time in both cases, but "--bs=512,1024" alternates short bursts of writes and short bursts of reads, whereas "--bs=1024" issues lots of writes and then lots of reads; the latter may give more chances for I/Os to different parts of a given page to run concurrently. If I copy the libaio.so.1 from squeeze (where the test passes) and load it in the wheezy fio test via LD_PRELOAD, the test still fails on wheezy. At this point, I could use help from someone more familiar with the Xen I/O code, but it seems pretty clear there's a bug here, and my best guess is that it's not in fio or libaio, maybe in the kernel or Xen. I'm happy to run more experiments if they'll help. Ken -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/6eeh8f14re....@just-testing.permabit.com