[Qemu-devel] [PATCH 3/3] block/parallels: fix access to not initialized memory in catalog_bitmap
found by valgrind. Command: ./qemu-img convert -f parallels -O qcow2 1.hds 1.img Invalid read of size 4 at 0x17D0EF: parallels_co_read (parallels.c:357) by 0x11FEE4: bdrv_aio_rw_vector (block.c:4640) by 0x11FFBF: bdrv_aio_readv_em (block.c:4652) by 0x11F55F: bdrv_co_readv_em (block.c:4862) by 0x123428: bdrv_aligned_preadv (block.c:3056) by 0x1239FA: bdrv_co_do_preadv (block.c:3162) by 0x125424: bdrv_rw_co_entry (block.c:2706) by 0x155DD9: coroutine_trampoline (coroutine-ucontext.c:118) by 0x6975B6F: ??? (in /lib/x86_64-linux-gnu/libc-2.19.so) The problem is that s->catalog_bitmap is allocated/filled as gmalloc(s->catalog_size) thus index validity check must be inclusive, i.e. index >= s->catalog_size is invalid. Signed-off-by: Denis V. Lunev CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/parallels.c b/block/parallels.c index 2a814f3..4f9cd8d 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -155,7 +155,7 @@ static int64_t seek_to_sector(BlockDriverState *bs, int64_t sector_num) offset = sector_num % s->tracks; /* not allocated */ -if ((index > s->catalog_size) || (s->catalog_bitmap[index] == 0)) +if ((index >= s->catalog_size) || (s->catalog_bitmap[index] == 0)) return -1; return ((uint64_t)s->catalog_bitmap[index] * s->off_multiplier + offset) * 512; -- 1.9.1
[Qemu-devel] [PATCH 0/3] parallels: additional iotests and a minor bugfix
Pls find here test authentic test material, i.e. parallels images with "WithoutFreeSpace" and "WithouFreSpacExt" signatures created in authentic way + a minor bug fix for access to non-initialized memory found by valgrind. Signed-off-by: Denis V. Lunev CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi
[Qemu-devel] [PATCH 2/3] iotests: add v2 parallels sample image and simple test for it
This is simple test image for the following commit made by me. commit d25d59802021a747812472780d80a0e792078f40 Author: Denis V. Lunev Date: Mon Jul 28 20:23:55 2014 +0400 parallels: 2TB+ parallels images support Signed-off-by: Denis V. Lunev CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- tests/qemu-iotests/076| 5 + tests/qemu-iotests/076.out| 4 tests/qemu-iotests/sample_images/parallels-v2.bz2 | Bin 0 -> 150 bytes 3 files changed, 9 insertions(+) create mode 100644 tests/qemu-iotests/sample_images/parallels-v2.bz2 diff --git a/tests/qemu-iotests/076 b/tests/qemu-iotests/076 index a300ee2..98d67b3 100755 --- a/tests/qemu-iotests/076 +++ b/tests/qemu-iotests/076 @@ -70,6 +70,11 @@ _use_sample_img parallels-v1.bz2 poke_file "$TEST_IMG" "$tracks_offset" "\x00\x00\x00\x00" { $QEMU_IO -c "read 0 512" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir +echo +echo "== Read from a valid v2 image ==" +_use_sample_img parallels-v2.bz2 +{ $QEMU_IO -c "read -P 0x11 0 64k" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir + # success, all done echo "*** done" rm -f $seq.full diff --git a/tests/qemu-iotests/076.out b/tests/qemu-iotests/076.out index fd26f3c..32ade08 100644 --- a/tests/qemu-iotests/076.out +++ b/tests/qemu-iotests/076.out @@ -15,4 +15,8 @@ no file open, try 'help open' == Zero sectors per track == qemu-io: can't open device TEST_DIR/parallels-v1: Invalid image: Zero sectors per track no file open, try 'help open' + +== Read from a valid v2 image == +read 65536/65536 bytes at offset 0 +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) *** done diff --git a/tests/qemu-iotests/sample_images/parallels-v2.bz2 b/tests/qemu-iotests/sample_images/parallels-v2.bz2 new file mode 100644 index ..fd8614d061172faae50a993ac7acd3173de9aa98 GIT binary patch literal 150 zcmV;H0BQe1T4*^jL0KkKS<`z1ga8U8f6)C%U;t162ml8F2!JYJ)<8f2004jpFaWp= zU;vl^35GBLOaKJHsVaJ=X*QsGnW)758mCWPt$+@EvggMtTP~K8Aeq(bOlAS_fGVI1 zR{#%`0aSoc2hsqlKqv!50aXBg@8a8e-{1Q8zBk4(ssXG4tO2Y6qyhde
[Qemu-devel] [PATCH 1/3] iotests: replace fake parallels image with authentic one
The image was generated using http://openvz.org/Ploop utility and properly filled with the same content as original one. Signed-off-by: Denis V. Lunev CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- tests/qemu-iotests/076 | 10 +- tests/qemu-iotests/076.out | 8 tests/qemu-iotests/sample_images/fake.parallels.bz2 | Bin 141 -> 0 bytes tests/qemu-iotests/sample_images/parallels-v1.bz2 | Bin 0 -> 147 bytes 4 files changed, 9 insertions(+), 9 deletions(-) delete mode 100644 tests/qemu-iotests/sample_images/fake.parallels.bz2 create mode 100644 tests/qemu-iotests/sample_images/parallels-v1.bz2 diff --git a/tests/qemu-iotests/076 b/tests/qemu-iotests/076 index b614a7d..a300ee2 100755 --- a/tests/qemu-iotests/076 +++ b/tests/qemu-iotests/076 @@ -47,26 +47,26 @@ catalog_entries_offset=$((0x20)) nb_sectors_offset=$((0x24)) echo -echo "== Read from a valid (enough) image ==" -_use_sample_img fake.parallels.bz2 +echo "== Read from a valid v1 image ==" +_use_sample_img parallels-v1.bz2 { $QEMU_IO -c "read -P 0x11 0 64k" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir echo echo "== Negative catalog size ==" -_use_sample_img fake.parallels.bz2 +_use_sample_img parallels-v1.bz2 poke_file "$TEST_IMG" "$catalog_entries_offset" "\xff\xff\xff\xff" { $QEMU_IO -c "read 0 512" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir echo echo "== Overflow in catalog allocation ==" -_use_sample_img fake.parallels.bz2 +_use_sample_img parallels-v1.bz2 poke_file "$TEST_IMG" "$nb_sectors_offset" "\xff\xff\xff\xff" poke_file "$TEST_IMG" "$catalog_entries_offset" "\x01\x00\x00\x40" { $QEMU_IO -c "read 64M 64M" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir echo echo "== Zero sectors per track ==" -_use_sample_img fake.parallels.bz2 +_use_sample_img parallels-v1.bz2 poke_file "$TEST_IMG" "$tracks_offset" "\x00\x00\x00\x00" { $QEMU_IO -c "read 0 512" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir diff --git a/tests/qemu-iotests/076.out b/tests/qemu-iotests/076.out index f7745d8..fd26f3c 100644 --- a/tests/qemu-iotests/076.out +++ b/tests/qemu-iotests/076.out @@ -1,18 +1,18 @@ QA output created by 076 -== Read from a valid (enough) image == +== Read from a valid v1 image == read 65536/65536 bytes at offset 0 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) == Negative catalog size == -qemu-io: can't open device TEST_DIR/fake.parallels: Catalog too large +qemu-io: can't open device TEST_DIR/parallels-v1: Catalog too large no file open, try 'help open' == Overflow in catalog allocation == -qemu-io: can't open device TEST_DIR/fake.parallels: Catalog too large +qemu-io: can't open device TEST_DIR/parallels-v1: Catalog too large no file open, try 'help open' == Zero sectors per track == -qemu-io: can't open device TEST_DIR/fake.parallels: Invalid image: Zero sectors per track +qemu-io: can't open device TEST_DIR/parallels-v1: Invalid image: Zero sectors per track no file open, try 'help open' *** done diff --git a/tests/qemu-iotests/sample_images/fake.parallels.bz2 b/tests/qemu-iotests/sample_images/fake.parallels.bz2 deleted file mode 100644 index ffb5f13bac31bc9ab6e1ea5c0cfa26786f2c4cc6.. GIT binary patch literal 0 HcmV?d1 literal 141 zcmV;80CN9AT4*^jL0KkKS*i&LJ^%_Hf6(xNVE_;S2ml2D2!JYJ)&M{N00969FaWp; z000b`1pojBOn|7QnnOSv)YEF7cgIVO0ByGSdk7e?fW`f$x`2Bi3t$bd06owJs09G{ vKo+1B1LXi)0CVe)J@eC^zBuEJbFFJA24D=p8Gt*$AL8yvrwS4kK_LggA5<|C diff --git a/tests/qemu-iotests/sample_images/parallels-v1.bz2 b/tests/qemu-iotests/sample_images/parallels-v1.bz2 new file mode 100644 index ..d1ef14205401a8e010d1be68bffeee92ce137d39 GIT binary patch literal 147 zcmZ>Y%CIzaj8qGbbEBnV0~X zZfy+=3|x~mCkJ!L1skp^{Ftks!;qfyS}R^xuR}6WLEksaZ~=P@V<pD?kL6rKj5mT56#x^qEyDl+ literal 0 HcmV?d1 -- 1.9.1
Re: [Qemu-devel] [PATCH 0/3] parallels: additional iotests and a minor bugfix
On 08/10/14 13:13, Denis V. Lunev wrote: Pls find here test authentic test material, i.e. parallels images with "WithoutFreeSpace" and "WithouFreSpacExt" signatures created in authentic way + a minor bug fix for access to non-initialized memory found by valgrind. Signed-off-by: Denis V. Lunev CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi ping
[Qemu-devel] [PATCH 7/9] block/parallels: support padded Parallels images
Unfortunately, old guest OSes do not align partitions to page size by default. This is true for Windows 2003 and Windows XP. For the time being Parallels was created an optimization for such OSes in its desktop product. Desktop users are not qualified enough to create properly aligned installations. Thus Parallels makes a blind guess on a customer behalf and creates so-called "padded" images if guest OS type is specified as WinXP, Win2k and Win2k3. "Padding" is a value which should be added to guest LBA to obtain sector number inside the image. This results in a shifted images. 0123offset inside image (in 512 byte sectors) +--- +.012guest data (512 byte sectors) +--- The information about this is available in DiskDescriptor.xml ONLY. There is no such data in the image header. There share of such images could be evaluated as 6-8% according to the statistics in my hands. This patch obtains proper value from XML and applies it on reading. Signed-off-by: Denis V. Lunev Acked-by: Roman Kagan CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/block/parallels.c b/block/parallels.c index 201c0f1..d07e4f7 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -66,6 +66,7 @@ typedef struct BDRVParallelsState { unsigned int tracks; unsigned int off_multiplier; +unsigned int padding; } BDRVParallelsState; static int parallels_open_image(BlockDriverState *bs, Error **errp); @@ -144,6 +145,7 @@ static int parallels_open_xml(BlockDriverState *bs, int flags, Error **errp) const char *data; char image_path[PATH_MAX]; Error *local_err = NULL; +BDRVParallelsState *s = bs->opaque; ret = size = bdrv_getlength(bs->file); if (ret < 0) { @@ -173,6 +175,19 @@ static int parallels_open_xml(BlockDriverState *bs, int flags, Error **errp) if (root == NULL) { goto fail; } + +data = xml_get_text(root, "/Disk_Parameters/Padding"); +if (data != NULL) { +char *endptr; +unsigned long pad; + +pad = strtoul(data, &endptr, 0); +if ((endptr != NULL && *endptr != '\0') || pad > UINT_MAX) { +goto fail; +} +s->padding = (uint32_t)pad; +} + image = xml_seek(root, "/StorageData/Storage/Image"); data = ""; /* make gcc happy */ for (size = 0; image != NULL; image = image->next) { @@ -385,6 +400,10 @@ static int64_t seek_to_sector(BlockDriverState *bs, int64_t sector_num) static int parallels_read(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, int nb_sectors) { +BDRVParallelsState *s = bs->opaque; + +sector_num += s->padding; + while (nb_sectors > 0) { int64_t position = seek_to_sector(bs, sector_num); if (position >= 0) { -- 1.9.1
[Qemu-devel] [PATCH 1/9] iotests: replace fake parallels image with authentic one
The image was generated using http://openvz.org/Ploop utility and properly filled with the same content as original one. Signed-off-by: Denis V. Lunev Reviewed-by: Paolo Bonzini CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- tests/qemu-iotests/076 | 10 +- tests/qemu-iotests/076.out | 8 tests/qemu-iotests/sample_images/fake.parallels.bz2 | Bin 141 -> 0 bytes tests/qemu-iotests/sample_images/parallels-v1.bz2 | Bin 0 -> 147 bytes 4 files changed, 9 insertions(+), 9 deletions(-) delete mode 100644 tests/qemu-iotests/sample_images/fake.parallels.bz2 create mode 100644 tests/qemu-iotests/sample_images/parallels-v1.bz2 diff --git a/tests/qemu-iotests/076 b/tests/qemu-iotests/076 index b614a7d..a300ee2 100755 --- a/tests/qemu-iotests/076 +++ b/tests/qemu-iotests/076 @@ -47,26 +47,26 @@ catalog_entries_offset=$((0x20)) nb_sectors_offset=$((0x24)) echo -echo "== Read from a valid (enough) image ==" -_use_sample_img fake.parallels.bz2 +echo "== Read from a valid v1 image ==" +_use_sample_img parallels-v1.bz2 { $QEMU_IO -c "read -P 0x11 0 64k" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir echo echo "== Negative catalog size ==" -_use_sample_img fake.parallels.bz2 +_use_sample_img parallels-v1.bz2 poke_file "$TEST_IMG" "$catalog_entries_offset" "\xff\xff\xff\xff" { $QEMU_IO -c "read 0 512" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir echo echo "== Overflow in catalog allocation ==" -_use_sample_img fake.parallels.bz2 +_use_sample_img parallels-v1.bz2 poke_file "$TEST_IMG" "$nb_sectors_offset" "\xff\xff\xff\xff" poke_file "$TEST_IMG" "$catalog_entries_offset" "\x01\x00\x00\x40" { $QEMU_IO -c "read 64M 64M" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir echo echo "== Zero sectors per track ==" -_use_sample_img fake.parallels.bz2 +_use_sample_img parallels-v1.bz2 poke_file "$TEST_IMG" "$tracks_offset" "\x00\x00\x00\x00" { $QEMU_IO -c "read 0 512" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir diff --git a/tests/qemu-iotests/076.out b/tests/qemu-iotests/076.out index f7745d8..fd26f3c 100644 --- a/tests/qemu-iotests/076.out +++ b/tests/qemu-iotests/076.out @@ -1,18 +1,18 @@ QA output created by 076 -== Read from a valid (enough) image == +== Read from a valid v1 image == read 65536/65536 bytes at offset 0 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) == Negative catalog size == -qemu-io: can't open device TEST_DIR/fake.parallels: Catalog too large +qemu-io: can't open device TEST_DIR/parallels-v1: Catalog too large no file open, try 'help open' == Overflow in catalog allocation == -qemu-io: can't open device TEST_DIR/fake.parallels: Catalog too large +qemu-io: can't open device TEST_DIR/parallels-v1: Catalog too large no file open, try 'help open' == Zero sectors per track == -qemu-io: can't open device TEST_DIR/fake.parallels: Invalid image: Zero sectors per track +qemu-io: can't open device TEST_DIR/parallels-v1: Invalid image: Zero sectors per track no file open, try 'help open' *** done diff --git a/tests/qemu-iotests/sample_images/fake.parallels.bz2 b/tests/qemu-iotests/sample_images/fake.parallels.bz2 deleted file mode 100644 index ffb5f13bac31bc9ab6e1ea5c0cfa26786f2c4cc6.. GIT binary patch literal 0 HcmV?d1 literal 141 zcmV;80CN9AT4*^jL0KkKS*i&LJ^%_Hf6(xNVE_;S2ml2D2!JYJ)&M{N00969FaWp; z000b`1pojBOn|7QnnOSv)YEF7cgIVO0ByGSdk7e?fW`f$x`2Bi3t$bd06owJs09G{ vKo+1B1LXi)0CVe)J@eC^zBuEJbFFJA24D=p8Gt*$AL8yvrwS4kK_LggA5<|C diff --git a/tests/qemu-iotests/sample_images/parallels-v1.bz2 b/tests/qemu-iotests/sample_images/parallels-v1.bz2 new file mode 100644 index ..d1ef14205401a8e010d1be68bffeee92ce137d39 GIT binary patch literal 147 zcmZ>Y%CIzaj8qGbbEBnV0~X zZfy+=3|x~mCkJ!L1skp^{Ftks!;qfyS}R^xuR}6WLEksaZ~=P@V<pD?kL6rKj5mT56#x^qEyDl+ literal 0 HcmV?d1 -- 1.9.1
[Qemu-devel] [PATCH 4/9] configure: add dependency from libxml2
This dependency is required for adequate Parallels images support. Typically the disk consists of several images which are glued by XML disk descriptor. Also XML hides inside several important parameters which are not available in the image header. The patch also adds clause to checkpatch.pl to understand libxml2 types. In the other case there is the following false positive: ERROR: need consistent spacing around '*' (ctx:WxB) #210: FILE: block/parallels.c:232: + !xmlStrcmp(root->name, (const xmlChar *)"Parallels_disk_image")) Signed-off-by: Denis V. Lunev Acked-by: Roman Kagan CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi CC: Paolo Bonzini --- configure | 29 + scripts/checkpatch.pl | 1 + 2 files changed, 30 insertions(+) diff --git a/configure b/configure index a9e4d49..5426752 100755 --- a/configure +++ b/configure @@ -335,6 +335,7 @@ libssh2="" vhdx="" quorum="" numa="" +libxml2="" # parse CC options first for opt do @@ -1129,6 +1130,10 @@ for opt do ;; --enable-numa) numa="yes" ;; + --disable-libxml2) libxml2="no" + ;; + --enable-libxml2) libxml2="yes" + ;; *) echo "ERROR: unknown option $opt" echo "Try '$0 --help' for more information" @@ -1400,6 +1405,8 @@ Advanced options (experts only): --enable-quorum enable quorum block filter support --disable-numa disable libnuma support --enable-numaenable libnuma support + --disable-libxml2disable libxml2 (for Parallels image format) + --enable-libxml2 enable libxml2 (for Parallels image format) NOTE: The object files are built at the place where configure is launched EOF @@ -4097,6 +4104,23 @@ if test -z "$zero_malloc" ; then fi fi +# check for libxml2 +if test "$libxml2" != "no" ; then +if $pkg_config --exists libxml-2.0; then +libxml2="yes" +libxml2_cflags=$($pkg_config --cflags libxml-2.0) +libxml2_libs=$($pkg_config --libs libxml-2.0) +QEMU_CFLAGS="$QEMU_CFLAGS $libxml2_cflags" +libs_softmmu="$libs_softmmu $libxml2_libs" +libs_tools="$libs_tools $libxml2_libs" +else +if test "$libxml2" = "yes"; then +feature_not_found "libxml2" "Install libxml2 devel" +fi +libxml2="no" +fi +fi + # Now we've finished running tests it's OK to add -Werror to the compiler flags if test "$werror" = "yes"; then QEMU_CFLAGS="-Werror $QEMU_CFLAGS" @@ -4340,6 +4364,7 @@ echo "Quorum$quorum" echo "lzo support $lzo" echo "snappy support$snappy" echo "NUMA host support $numa" +echo "libxml2 $libxml2" if test "$sdl_too_old" = "yes"; then echo "-> Your SDL version is too old - please upgrade to have SDL support" @@ -4846,6 +4871,10 @@ if test "$rdma" = "yes" ; then echo "CONFIG_RDMA=y" >> $config_host_mak fi +if test "$libxml2" = "yes" ; then + echo "CONFIG_LIBXML2=y" >> $config_host_mak +fi + # Hold two types of flag: # CONFIG_THREAD_SETNAME_BYTHREAD - we've got a way of setting the name on # a thread we have a handle to diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 053e432..3ddb097 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -242,6 +242,7 @@ our @typeList = ( qr{${Ident}_t}, qr{${Ident}_handler}, qr{${Ident}_handler_fn}, + qr{xml${Ident}}, ); our @modifierList = ( qr{fastcall}, -- 1.9.1
[Qemu-devel] [PATCH 6/9] iotests: simple parallels XML disk descriptor file test added
Signed-off-by: Denis V. Lunev Acked-by: Roman Kagan CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- tests/qemu-iotests/076| 6 ++ tests/qemu-iotests/076.out| 4 tests/qemu-iotests/sample_images/parallels-simple.xml.bz2 | Bin 0 -> 374 bytes 3 files changed, 10 insertions(+) create mode 100644 tests/qemu-iotests/sample_images/parallels-simple.xml.bz2 diff --git a/tests/qemu-iotests/076 b/tests/qemu-iotests/076 index 98d67b3..8474fdd 100755 --- a/tests/qemu-iotests/076 +++ b/tests/qemu-iotests/076 @@ -75,6 +75,12 @@ echo "== Read from a valid v2 image ==" _use_sample_img parallels-v2.bz2 { $QEMU_IO -c "read -P 0x11 0 64k" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir +echo +echo "== Read from a valid v2 image opened through xml ==" +_use_sample_img parallels-v2.bz2 +_use_sample_img parallels-simple.xml.bz2 +{ $QEMU_IO -c "read -P 0x11 0 64k" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir + # success, all done echo "*** done" rm -f $seq.full diff --git a/tests/qemu-iotests/076.out b/tests/qemu-iotests/076.out index 32ade08..628d9bf 100644 --- a/tests/qemu-iotests/076.out +++ b/tests/qemu-iotests/076.out @@ -19,4 +19,8 @@ no file open, try 'help open' == Read from a valid v2 image == read 65536/65536 bytes at offset 0 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + +== Read from a valid v2 image opened through xml == +read 65536/65536 bytes at offset 0 +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) *** done diff --git a/tests/qemu-iotests/sample_images/parallels-simple.xml.bz2 b/tests/qemu-iotests/sample_images/parallels-simple.xml.bz2 new file mode 100644 index ..13760bc9f30246ba01f43cdca3d3d90780430f61 GIT binary patch literal 374 zcmV-+0g3)XT4*^jL0KkKSx`RPoB#ls-+)w5Py_$xpWsdazwh08Faddp1^@s613&-( z00UFVg-_KD8fY4NVup`X5ukcZBTOVwCXAT_OhRY?0}wJ9sXCPa+IAr*0!sLV#k$+r zFRIp6TJ>A%(u7Qkn2MJtX_(TWtce9REA{g&qLoa&l#2??q)aKM7Qx|d6glC9JSZw~ z-{Dy+9yLxWKg_2Xm02klNIFQqySl|-EJ?WkLWzOBm`Xui7_);kgut`D^IdcIOOiAVM)WoC{=C8L4k%Ybsug63k@+f`frkh@_Ij10W;xr->@K!oLkGw3Lv$ zS&-sh3*_e*@-+(JnDfSD#7GS;nh3VYlxtRDgM
[Qemu-devel] [PATCH 2/9] iotests: add v2 parallels sample image and simple test for it
This is simple test image for the following commit made by me. commit d25d59802021a747812472780d80a0e792078f40 Author: Denis V. Lunev Date: Mon Jul 28 20:23:55 2014 +0400 parallels: 2TB+ parallels images support Signed-off-by: Denis V. Lunev Reviewed-by: Paolo Bonzini CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- tests/qemu-iotests/076| 5 + tests/qemu-iotests/076.out| 4 tests/qemu-iotests/sample_images/parallels-v2.bz2 | Bin 0 -> 150 bytes 3 files changed, 9 insertions(+) create mode 100644 tests/qemu-iotests/sample_images/parallels-v2.bz2 diff --git a/tests/qemu-iotests/076 b/tests/qemu-iotests/076 index a300ee2..98d67b3 100755 --- a/tests/qemu-iotests/076 +++ b/tests/qemu-iotests/076 @@ -70,6 +70,11 @@ _use_sample_img parallels-v1.bz2 poke_file "$TEST_IMG" "$tracks_offset" "\x00\x00\x00\x00" { $QEMU_IO -c "read 0 512" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir +echo +echo "== Read from a valid v2 image ==" +_use_sample_img parallels-v2.bz2 +{ $QEMU_IO -c "read -P 0x11 0 64k" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir + # success, all done echo "*** done" rm -f $seq.full diff --git a/tests/qemu-iotests/076.out b/tests/qemu-iotests/076.out index fd26f3c..32ade08 100644 --- a/tests/qemu-iotests/076.out +++ b/tests/qemu-iotests/076.out @@ -15,4 +15,8 @@ no file open, try 'help open' == Zero sectors per track == qemu-io: can't open device TEST_DIR/parallels-v1: Invalid image: Zero sectors per track no file open, try 'help open' + +== Read from a valid v2 image == +read 65536/65536 bytes at offset 0 +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) *** done diff --git a/tests/qemu-iotests/sample_images/parallels-v2.bz2 b/tests/qemu-iotests/sample_images/parallels-v2.bz2 new file mode 100644 index ..fd8614d061172faae50a993ac7acd3173de9aa98 GIT binary patch literal 150 zcmV;H0BQe1T4*^jL0KkKS<`z1ga8U8f6)C%U;t162ml8F2!JYJ)<8f2004jpFaWp= zU;vl^35GBLOaKJHsVaJ=X*QsGnW)758mCWPt$+@EvggMtTP~K8Aeq(bOlAS_fGVI1 zR{#%`0aSoc2hsqlKqv!50aXBg@8a8e-{1Q8zBk4(ssXG4tO2Y6qyhde
[Qemu-devel] [PATCH 5/9] block/parallels: allow to specify DiskDescriptor.xml instead of image file
Typically Parallels disk bundle consists of several images which are glued by XML disk descriptor. Also XML hides inside several important parameters which are not available in the image header. This patch allows to specify this XML file as a filename for an image to open. It is allowed only to open Compressed images with the only snapshot inside. No additional options are parsed at the moment. The code itself is dumb enough for a while. If XML file is specified, the file is parsed and the image is reopened as bs->file to keep the rest of the driver untouched. This would be changed later with more features added. Signed-off-by: Denis V. Lunev Acked-by: Roman Kagan CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 231 -- 1 file changed, 226 insertions(+), 5 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index 4f9cd8d..201c0f1 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -27,6 +27,11 @@ #include "block/block_int.h" #include "qemu/module.h" +#if CONFIG_LIBXML2 +#include +#include +#endif + /**/ #define HEADER_MAGIC "WithoutFreeSpace" @@ -34,6 +39,10 @@ #define HEADER_VERSION 2 #define HEADER_SIZE 64 +#define PARALLELS_XML 100 +#define PARALLELS_IMAGE 101 + + // always little-endian struct parallels_header { char magic[16]; // "WithoutFreeSpace" @@ -59,6 +68,194 @@ typedef struct BDRVParallelsState { unsigned int off_multiplier; } BDRVParallelsState; +static int parallels_open_image(BlockDriverState *bs, Error **errp); + +#if CONFIG_LIBXML2 +static xmlNodePtr xml_find(xmlNode *node, const char *elem) +{ +xmlNode *child; + +for (child = node->xmlChildrenNode; child != NULL; child = child->next) { +if (!xmlStrcmp(child->name, (const xmlChar *)elem) && +child->type == XML_ELEMENT_NODE) { +return child; +} +} +return NULL; +} + +static xmlNodePtr xml_seek(xmlNode *root, const char *elem) +{ +xmlNode *child = root; +const char *path; +char nodename[128]; +int last = 0; + +path = elem; +if (path[0] == '/') { +path++; +} +if (path[0] == 0) { +return NULL; +} +while (!last) { +const char *p = strchr(path, '/'); +int length; +if (p == NULL) { +length = strlen(path); +last = 1; +} else { +length = p - path; +} +memcpy(nodename, path, length); +nodename[length] = 0; +child = xml_find(child, nodename); +if (child == NULL) { +return NULL; +} +path = ++p; +} +return child; +} + +static const char *xml_get_text(xmlNode *node, const char *name) +{ +xmlNode *child; + +node = xml_seek(node, name); +if (node == NULL) { +return NULL; +} + +for (child = node->xmlChildrenNode; child; child = child->next) { +if (child->type == XML_TEXT_NODE) { +return (const char *)child->content; +} +} +return NULL; +} + +static int parallels_open_xml(BlockDriverState *bs, int flags, Error **errp) +{ +int size, ret; +xmlDoc *doc = NULL; +xmlNode *root, *image; +char *xml = NULL; +const char *data; +char image_path[PATH_MAX]; +Error *local_err = NULL; + +ret = size = bdrv_getlength(bs->file); +if (ret < 0) { +goto fail; +} +/* XML file size should be resonable */ +ret = -EFBIG; +if (size > 65536) { +goto fail; +} + +xml = g_malloc(size + 1); + +ret = bdrv_pread(bs->file, 0, xml, size); +if (ret != size) { +goto fail; +} +xml[size] = 0; + +ret = -EINVAL; +doc = xmlReadMemory(xml, size, NULL, NULL, +XML_PARSE_NOERROR | XML_PARSE_NOWARNING); +if (doc == NULL) { +goto fail; +} +root = xmlDocGetRootElement(doc); +if (root == NULL) { +goto fail; +} +image = xml_seek(root, "/StorageData/Storage/Image"); +data = ""; /* make gcc happy */ +for (size = 0; image != NULL; image = image->next) { +if (image->type != XML_ELEMENT_NODE) { +continue; +} + +size++; +data = xml_get_text(image, "Type"); +if (data != NULL && strcmp(data, "Compressed")) { +error_setg(errp, "Only compressed Parallels images are supported"); +goto done; +} + +data = xml_get_text(image, "File"); +if (data == NULL) { +goto fail; +} +} +/* Images with more than 1 snapshots are not supported at the moment */ +if (size != 1) { +error_setg(errp, "Parallels images with sn
[Qemu-devel] [PATCH 0/9] parallels format support improvements
This patchset contains my previous patchset sent originally 08.10.2014 with minor Valgrind fix and iotests update as a base. The patchset implements additional compatibility bits for Parallels format: - initial support of parsing of Parallels DiskDeskriptor.xml Typically Parallels disk bundle consists of several images which are glued by XML disk descriptor. Also XML hides inside several important parameters which are not available in the image header. - support for padded Parallels images. For the time being Parallels was created an optimization for such OSes in its desktop product. Desktop users are not qualified enough to create properly aligned installations. Thus Parallels makes a blind guess on a customer behalf and creates so-called "padded" images if guest OS type is specified as WinXP, Win2k and Win2k3. The code uses approach from VMDK support, either image or XML descriptor could be used. Though there is temporary hack in the opening code: BlockDriverState->file is being reopened inside parallels_open. I prefer to keep this code in this state till proper Parallels snapshots support in order to minimize current changes. CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi CC: Roman Kagan
[Qemu-devel] [PATCH 3/9] block/parallels: fix access to not initialized memory in catalog_bitmap
found by valgrind. Command: ./qemu-img convert -f parallels -O qcow2 1.hds 1.img Invalid read of size 4 at 0x17D0EF: parallels_co_read (parallels.c:357) by 0x11FEE4: bdrv_aio_rw_vector (block.c:4640) by 0x11FFBF: bdrv_aio_readv_em (block.c:4652) by 0x11F55F: bdrv_co_readv_em (block.c:4862) by 0x123428: bdrv_aligned_preadv (block.c:3056) by 0x1239FA: bdrv_co_do_preadv (block.c:3162) by 0x125424: bdrv_rw_co_entry (block.c:2706) by 0x155DD9: coroutine_trampoline (coroutine-ucontext.c:118) by 0x6975B6F: ??? (in /lib/x86_64-linux-gnu/libc-2.19.so) The problem is that s->catalog_bitmap is allocated/filled as gmalloc(s->catalog_size) thus index validity check must be inclusive, i.e. index >= s->catalog_size is invalid. Signed-off-by: Denis V. Lunev Reviewed-by: Paolo Bonzini CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/parallels.c b/block/parallels.c index 2a814f3..4f9cd8d 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -155,7 +155,7 @@ static int64_t seek_to_sector(BlockDriverState *bs, int64_t sector_num) offset = sector_num % s->tracks; /* not allocated */ -if ((index > s->catalog_size) || (s->catalog_bitmap[index] == 0)) +if ((index >= s->catalog_size) || (s->catalog_bitmap[index] == 0)) return -1; return ((uint64_t)s->catalog_bitmap[index] * s->off_multiplier + offset) * 512; -- 1.9.1
[Qemu-devel] [PATCH 9/9] parallels: change copyright information in the image header
Signed-off-by: Denis V. Lunev Acked-by: Roman Kagan CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/block/parallels.c b/block/parallels.c index d07e4f7..1491211 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -2,8 +2,10 @@ * Block driver for Parallels disk image format * * Copyright (c) 2007 Alex Beregszaszi + * Copyright (c) 2014 Denis V. Lunev * - * This code is based on comparing different disk images created by Parallels. + * This code was originally based on comparing different disk images created + * by Parallels. * * Permission is hereby granted, free of charge, to any person obtaining a copy * of this software and associated documentation files (the "Software"), to deal -- 1.9.1
[Qemu-devel] [PATCH 8/9] iotests: padded parallels image test
Unfortunately, old guest OSes do not align partitions to page size by default. This is true for Windows 2003 and Windows XP. "Padding" is a value which should be added to guest LBA to obtain sector number inside the image. This results in a shifted images. 0123offset inside image (in 512 byte sectors) +--- +.012guest data (512 byte sectors) +--- The information about this is available in DiskDescriptor.xml ONLY. There is no such data in the image header. This patch contains very simple image with padding and corresponding XML disk descriptor created in authentic way. Signed-off-by: Denis V. Lunev Acked-by: Roman Kagan CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- tests/qemu-iotests/076| 6 ++ tests/qemu-iotests/076.out| 4 tests/qemu-iotests/sample_images/parallels-padded.xml.bz2 | Bin 0 -> 377 bytes tests/qemu-iotests/sample_images/parallels-v2-padded.bz2 | Bin 0 -> 139 bytes 4 files changed, 10 insertions(+) create mode 100644 tests/qemu-iotests/sample_images/parallels-padded.xml.bz2 create mode 100644 tests/qemu-iotests/sample_images/parallels-v2-padded.bz2 diff --git a/tests/qemu-iotests/076 b/tests/qemu-iotests/076 index 8474fdd..fb2b41d 100755 --- a/tests/qemu-iotests/076 +++ b/tests/qemu-iotests/076 @@ -81,6 +81,12 @@ _use_sample_img parallels-v2.bz2 _use_sample_img parallels-simple.xml.bz2 { $QEMU_IO -c "read -P 0x11 0 64k" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir +echo +echo "== Read from a valid v2 image opened through xml with padding ==" +_use_sample_img parallels-v2-padded.bz2 +_use_sample_img parallels-padded.xml.bz2 +{ $QEMU_IO -c "read -P 0x11 0 64k" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir + # success, all done echo "*** done" rm -f $seq.full diff --git a/tests/qemu-iotests/076.out b/tests/qemu-iotests/076.out index 628d9bf..46680d8 100644 --- a/tests/qemu-iotests/076.out +++ b/tests/qemu-iotests/076.out @@ -23,4 +23,8 @@ read 65536/65536 bytes at offset 0 == Read from a valid v2 image opened through xml == read 65536/65536 bytes at offset 0 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + +== Read from a valid v2 image opened through xml with padding == +read 65536/65536 bytes at offset 0 +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) *** done diff --git a/tests/qemu-iotests/sample_images/parallels-padded.xml.bz2 b/tests/qemu-iotests/sample_images/parallels-padded.xml.bz2 new file mode 100644 index ..90e481b88e1f4d65256d0511a8509114c6358052 GIT binary patch literal 377 zcmV-<0fzoUT4*^jL0KkKS+JP`&Hw4nhyVtdfq>MRN`P%Vs7e5mydg1cvP<`s zYo%7Ss&>_C5i%-bDqP)#HmEBiP+TU)8WyoiurHMw!l_ZF1>8_3m3N^^QwEd8=8hkN zLO3;VtVoT0C3B518KN#oIV4@)7DfOSbkchYBQVO$9`K;#(G?m(FfikRxR&q(!eB`z zPfQaAWzYbTAj%3PZH#S=y0Cm|7Ih97TD6vNg%Klm)F8;i7;OxS)eka zB8A?w-LbwRX=_?uTcBnnP&Yyj78F)siB literal 0 HcmV?d1 -- 1.9.1
[Qemu-devel] [PATCH 3/6] iotests: simple parallels XML disk descriptor file test added
Signed-off-by: Denis V. Lunev Acked-by: Roman Kagan CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- tests/qemu-iotests/076| 6 ++ tests/qemu-iotests/076.out| 4 tests/qemu-iotests/sample_images/parallels-simple.xml.bz2 | Bin 0 -> 374 bytes 3 files changed, 10 insertions(+) create mode 100644 tests/qemu-iotests/sample_images/parallels-simple.xml.bz2 diff --git a/tests/qemu-iotests/076 b/tests/qemu-iotests/076 index ed2be35..242d0c3 100755 --- a/tests/qemu-iotests/076 +++ b/tests/qemu-iotests/076 @@ -75,6 +75,12 @@ echo "== Read from a valid v2 image ==" _use_sample_img parallels-v2.bz2 { $QEMU_IO -c "read -P 0x11 0 64k" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir +echo +echo "== Read from a valid v2 image opened through xml ==" +_use_sample_img parallels-v2.bz2 +_use_sample_img parallels-simple.xml.bz2 +{ $QEMU_IO -c "read -P 0x11 0 64k" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir + # success, all done echo "*** done" rm -f $seq.full diff --git a/tests/qemu-iotests/076.out b/tests/qemu-iotests/076.out index 32ade08..628d9bf 100644 --- a/tests/qemu-iotests/076.out +++ b/tests/qemu-iotests/076.out @@ -19,4 +19,8 @@ no file open, try 'help open' == Read from a valid v2 image == read 65536/65536 bytes at offset 0 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + +== Read from a valid v2 image opened through xml == +read 65536/65536 bytes at offset 0 +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) *** done diff --git a/tests/qemu-iotests/sample_images/parallels-simple.xml.bz2 b/tests/qemu-iotests/sample_images/parallels-simple.xml.bz2 new file mode 100644 index ..13760bc9f30246ba01f43cdca3d3d90780430f61 GIT binary patch literal 374 zcmV-+0g3)XT4*^jL0KkKSx`RPoB#ls-+)w5Py_$xpWsdazwh08Faddp1^@s613&-( z00UFVg-_KD8fY4NVup`X5ukcZBTOVwCXAT_OhRY?0}wJ9sXCPa+IAr*0!sLV#k$+r zFRIp6TJ>A%(u7Qkn2MJtX_(TWtce9REA{g&qLoa&l#2??q)aKM7Qx|d6glC9JSZw~ z-{Dy+9yLxWKg_2Xm02klNIFQqySl|-EJ?WkLWzOBm`Xui7_);kgut`D^IdcIOOiAVM)WoC{=C8L4k%Ybsug63k@+f`frkh@_Ij10W;xr->@K!oLkGw3Lv$ zS&-sh3*_e*@-+(JnDfSD#7GS;nh3VYlxtRDgM
[Qemu-devel] [PATCH 1/6] configure: add dependency from libxml2
This dependency is required for adequate Parallels images support. Typically the disk consists of several images which are glued by XML disk descriptor. Also XML hides inside several important parameters which are not available in the image header. The patch also adds clause to checkpatch.pl to understand libxml2 types. In the other case there is the following false positive: ERROR: need consistent spacing around '*' (ctx:WxB) #210: FILE: block/parallels.c:232: + !xmlStrcmp(root->name, (const xmlChar *)"Parallels_disk_image")) Signed-off-by: Denis V. Lunev Acked-by: Roman Kagan CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi CC: Paolo Bonzini --- configure | 29 + scripts/checkpatch.pl | 1 + 2 files changed, 30 insertions(+) diff --git a/configure b/configure index a9e4d49..5426752 100755 --- a/configure +++ b/configure @@ -335,6 +335,7 @@ libssh2="" vhdx="" quorum="" numa="" +libxml2="" # parse CC options first for opt do @@ -1129,6 +1130,10 @@ for opt do ;; --enable-numa) numa="yes" ;; + --disable-libxml2) libxml2="no" + ;; + --enable-libxml2) libxml2="yes" + ;; *) echo "ERROR: unknown option $opt" echo "Try '$0 --help' for more information" @@ -1400,6 +1405,8 @@ Advanced options (experts only): --enable-quorum enable quorum block filter support --disable-numa disable libnuma support --enable-numaenable libnuma support + --disable-libxml2disable libxml2 (for Parallels image format) + --enable-libxml2 enable libxml2 (for Parallels image format) NOTE: The object files are built at the place where configure is launched EOF @@ -4097,6 +4104,23 @@ if test -z "$zero_malloc" ; then fi fi +# check for libxml2 +if test "$libxml2" != "no" ; then +if $pkg_config --exists libxml-2.0; then +libxml2="yes" +libxml2_cflags=$($pkg_config --cflags libxml-2.0) +libxml2_libs=$($pkg_config --libs libxml-2.0) +QEMU_CFLAGS="$QEMU_CFLAGS $libxml2_cflags" +libs_softmmu="$libs_softmmu $libxml2_libs" +libs_tools="$libs_tools $libxml2_libs" +else +if test "$libxml2" = "yes"; then +feature_not_found "libxml2" "Install libxml2 devel" +fi +libxml2="no" +fi +fi + # Now we've finished running tests it's OK to add -Werror to the compiler flags if test "$werror" = "yes"; then QEMU_CFLAGS="-Werror $QEMU_CFLAGS" @@ -4340,6 +4364,7 @@ echo "Quorum$quorum" echo "lzo support $lzo" echo "snappy support$snappy" echo "NUMA host support $numa" +echo "libxml2 $libxml2" if test "$sdl_too_old" = "yes"; then echo "-> Your SDL version is too old - please upgrade to have SDL support" @@ -4846,6 +4871,10 @@ if test "$rdma" = "yes" ; then echo "CONFIG_RDMA=y" >> $config_host_mak fi +if test "$libxml2" = "yes" ; then + echo "CONFIG_LIBXML2=y" >> $config_host_mak +fi + # Hold two types of flag: # CONFIG_THREAD_SETNAME_BYTHREAD - we've got a way of setting the name on # a thread we have a handle to diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 053e432..3ddb097 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -242,6 +242,7 @@ our @typeList = ( qr{${Ident}_t}, qr{${Ident}_handler}, qr{${Ident}_handler_fn}, + qr{xml${Ident}}, ); our @modifierList = ( qr{fastcall}, -- 1.9.1
[Qemu-devel] [PATCH 2/6] block/parallels: allow to specify DiskDescriptor.xml instead of image file
Typically Parallels disk bundle consists of several images which are glued by XML disk descriptor. Also XML hides inside several important parameters which are not available in the image header. This patch allows to specify this XML file as a filename for an image to open. It is allowed only to open Compressed images with the only snapshot inside. No additional options are parsed at the moment. The code itself is dumb enough for a while. If XML file is specified, the file is parsed and the image is reopened as bs->file to keep the rest of the driver untouched. This would be changed later with more features added. Signed-off-by: Denis V. Lunev Acked-by: Roman Kagan CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 231 -- 1 file changed, 226 insertions(+), 5 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index 4f9cd8d..201c0f1 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -27,6 +27,11 @@ #include "block/block_int.h" #include "qemu/module.h" +#if CONFIG_LIBXML2 +#include +#include +#endif + /**/ #define HEADER_MAGIC "WithoutFreeSpace" @@ -34,6 +39,10 @@ #define HEADER_VERSION 2 #define HEADER_SIZE 64 +#define PARALLELS_XML 100 +#define PARALLELS_IMAGE 101 + + // always little-endian struct parallels_header { char magic[16]; // "WithoutFreeSpace" @@ -59,6 +68,194 @@ typedef struct BDRVParallelsState { unsigned int off_multiplier; } BDRVParallelsState; +static int parallels_open_image(BlockDriverState *bs, Error **errp); + +#if CONFIG_LIBXML2 +static xmlNodePtr xml_find(xmlNode *node, const char *elem) +{ +xmlNode *child; + +for (child = node->xmlChildrenNode; child != NULL; child = child->next) { +if (!xmlStrcmp(child->name, (const xmlChar *)elem) && +child->type == XML_ELEMENT_NODE) { +return child; +} +} +return NULL; +} + +static xmlNodePtr xml_seek(xmlNode *root, const char *elem) +{ +xmlNode *child = root; +const char *path; +char nodename[128]; +int last = 0; + +path = elem; +if (path[0] == '/') { +path++; +} +if (path[0] == 0) { +return NULL; +} +while (!last) { +const char *p = strchr(path, '/'); +int length; +if (p == NULL) { +length = strlen(path); +last = 1; +} else { +length = p - path; +} +memcpy(nodename, path, length); +nodename[length] = 0; +child = xml_find(child, nodename); +if (child == NULL) { +return NULL; +} +path = ++p; +} +return child; +} + +static const char *xml_get_text(xmlNode *node, const char *name) +{ +xmlNode *child; + +node = xml_seek(node, name); +if (node == NULL) { +return NULL; +} + +for (child = node->xmlChildrenNode; child; child = child->next) { +if (child->type == XML_TEXT_NODE) { +return (const char *)child->content; +} +} +return NULL; +} + +static int parallels_open_xml(BlockDriverState *bs, int flags, Error **errp) +{ +int size, ret; +xmlDoc *doc = NULL; +xmlNode *root, *image; +char *xml = NULL; +const char *data; +char image_path[PATH_MAX]; +Error *local_err = NULL; + +ret = size = bdrv_getlength(bs->file); +if (ret < 0) { +goto fail; +} +/* XML file size should be resonable */ +ret = -EFBIG; +if (size > 65536) { +goto fail; +} + +xml = g_malloc(size + 1); + +ret = bdrv_pread(bs->file, 0, xml, size); +if (ret != size) { +goto fail; +} +xml[size] = 0; + +ret = -EINVAL; +doc = xmlReadMemory(xml, size, NULL, NULL, +XML_PARSE_NOERROR | XML_PARSE_NOWARNING); +if (doc == NULL) { +goto fail; +} +root = xmlDocGetRootElement(doc); +if (root == NULL) { +goto fail; +} +image = xml_seek(root, "/StorageData/Storage/Image"); +data = ""; /* make gcc happy */ +for (size = 0; image != NULL; image = image->next) { +if (image->type != XML_ELEMENT_NODE) { +continue; +} + +size++; +data = xml_get_text(image, "Type"); +if (data != NULL && strcmp(data, "Compressed")) { +error_setg(errp, "Only compressed Parallels images are supported"); +goto done; +} + +data = xml_get_text(image, "File"); +if (data == NULL) { +goto fail; +} +} +/* Images with more than 1 snapshots are not supported at the moment */ +if (size != 1) { +error_setg(errp, "Parallels images with sn
[Qemu-devel] [PATCH 6/6] parallels: change copyright information in the image header
Signed-off-by: Denis V. Lunev Acked-by: Roman Kagan CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/block/parallels.c b/block/parallels.c index d07e4f7..1491211 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -2,8 +2,10 @@ * Block driver for Parallels disk image format * * Copyright (c) 2007 Alex Beregszaszi + * Copyright (c) 2014 Denis V. Lunev * - * This code is based on comparing different disk images created by Parallels. + * This code was originally based on comparing different disk images created + * by Parallels. * * Permission is hereby granted, free of charge, to any person obtaining a copy * of this software and associated documentation files (the "Software"), to deal -- 1.9.1
[Qemu-devel] [PATCH 4/6] block/parallels: support padded Parallels images
Unfortunately, old guest OSes do not align partitions to page size by default. This is true for Windows 2003 and Windows XP. For the time being Parallels was created an optimization for such OSes in its desktop product. Desktop users are not qualified enough to create properly aligned installations. Thus Parallels makes a blind guess on a customer behalf and creates so-called "padded" images if guest OS type is specified as WinXP, Win2k and Win2k3. "Padding" is a value which should be added to guest LBA to obtain sector number inside the image. This results in a shifted images. 0123offset inside image (in 512 byte sectors) +--- +.012guest data (512 byte sectors) +--- The information about this is available in DiskDescriptor.xml ONLY. There is no such data in the image header. There share of such images could be evaluated as 6-8% according to the statistics in my hands. This patch obtains proper value from XML and applies it on reading. Signed-off-by: Denis V. Lunev Acked-by: Roman Kagan CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/block/parallels.c b/block/parallels.c index 201c0f1..d07e4f7 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -66,6 +66,7 @@ typedef struct BDRVParallelsState { unsigned int tracks; unsigned int off_multiplier; +unsigned int padding; } BDRVParallelsState; static int parallels_open_image(BlockDriverState *bs, Error **errp); @@ -144,6 +145,7 @@ static int parallels_open_xml(BlockDriverState *bs, int flags, Error **errp) const char *data; char image_path[PATH_MAX]; Error *local_err = NULL; +BDRVParallelsState *s = bs->opaque; ret = size = bdrv_getlength(bs->file); if (ret < 0) { @@ -173,6 +175,19 @@ static int parallels_open_xml(BlockDriverState *bs, int flags, Error **errp) if (root == NULL) { goto fail; } + +data = xml_get_text(root, "/Disk_Parameters/Padding"); +if (data != NULL) { +char *endptr; +unsigned long pad; + +pad = strtoul(data, &endptr, 0); +if ((endptr != NULL && *endptr != '\0') || pad > UINT_MAX) { +goto fail; +} +s->padding = (uint32_t)pad; +} + image = xml_seek(root, "/StorageData/Storage/Image"); data = ""; /* make gcc happy */ for (size = 0; image != NULL; image = image->next) { @@ -385,6 +400,10 @@ static int64_t seek_to_sector(BlockDriverState *bs, int64_t sector_num) static int parallels_read(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, int nb_sectors) { +BDRVParallelsState *s = bs->opaque; + +sector_num += s->padding; + while (nb_sectors > 0) { int64_t position = seek_to_sector(bs, sector_num); if (position >= 0) { -- 1.9.1
[Qemu-devel] [PATCH v2 0/6] parallels format support improvements
The patchset implements additional compatibility bits for Parallels format: - initial support of parsing of Parallels DiskDeskriptor.xml Typically Parallels disk bundle consists of several images which are glued by XML disk descriptor. Also XML hides inside several important parameters which are not available in the image header. - support for padded Parallels images. For the time being Parallels was created an optimization for such OSes in its desktop product. Desktop users are not qualified enough to create properly aligned installations. Thus Parallels makes a blind guess on a customer behalf and creates so-called "padded" images if guest OS type is specified as WinXP, Win2k and Win2k3. The code uses approach from VMDK support, either image or XML descriptor could be used. Though there is temporary hack in the opening code: BlockDriverState->file is being reopened inside parallels_open. I prefer to keep this code in this state till proper Parallels snapshots support in order to minimize current changes. Changes from v1: - dropped already merged part (original patches 1-3) CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi CC: Roman Kagan
[Qemu-devel] [PATCH 5/6] iotests: padded parallels image test
Unfortunately, old guest OSes do not align partitions to page size by default. This is true for Windows 2003 and Windows XP. "Padding" is a value which should be added to guest LBA to obtain sector number inside the image. This results in a shifted images. 0123offset inside image (in 512 byte sectors) +--- +.012guest data (512 byte sectors) +--- The information about this is available in DiskDescriptor.xml ONLY. There is no such data in the image header. This patch contains very simple image with padding and corresponding XML disk descriptor created in authentic way. Signed-off-by: Denis V. Lunev Acked-by: Roman Kagan CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- tests/qemu-iotests/076| 6 ++ tests/qemu-iotests/076.out| 4 tests/qemu-iotests/sample_images/parallels-padded.xml.bz2 | Bin 0 -> 377 bytes tests/qemu-iotests/sample_images/parallels-v2-padded.bz2 | Bin 0 -> 139 bytes 4 files changed, 10 insertions(+) create mode 100644 tests/qemu-iotests/sample_images/parallels-padded.xml.bz2 create mode 100644 tests/qemu-iotests/sample_images/parallels-v2-padded.bz2 diff --git a/tests/qemu-iotests/076 b/tests/qemu-iotests/076 index 242d0c3..c83b953 100755 --- a/tests/qemu-iotests/076 +++ b/tests/qemu-iotests/076 @@ -81,6 +81,12 @@ _use_sample_img parallels-v2.bz2 _use_sample_img parallels-simple.xml.bz2 { $QEMU_IO -c "read -P 0x11 0 64k" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir +echo +echo "== Read from a valid v2 image opened through xml with padding ==" +_use_sample_img parallels-v2-padded.bz2 +_use_sample_img parallels-padded.xml.bz2 +{ $QEMU_IO -c "read -P 0x11 0 64k" $TEST_IMG; } 2>&1 | _filter_qemu_io | _filter_testdir + # success, all done echo "*** done" rm -f $seq.full diff --git a/tests/qemu-iotests/076.out b/tests/qemu-iotests/076.out index 628d9bf..46680d8 100644 --- a/tests/qemu-iotests/076.out +++ b/tests/qemu-iotests/076.out @@ -23,4 +23,8 @@ read 65536/65536 bytes at offset 0 == Read from a valid v2 image opened through xml == read 65536/65536 bytes at offset 0 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + +== Read from a valid v2 image opened through xml with padding == +read 65536/65536 bytes at offset 0 +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) *** done diff --git a/tests/qemu-iotests/sample_images/parallels-padded.xml.bz2 b/tests/qemu-iotests/sample_images/parallels-padded.xml.bz2 new file mode 100644 index ..90e481b88e1f4d65256d0511a8509114c6358052 GIT binary patch literal 377 zcmV-<0fzoUT4*^jL0KkKS+JP`&Hw4nhyVtdfq>MRN`P%Vs7e5mydg1cvP<`s zYo%7Ss&>_C5i%-bDqP)#HmEBiP+TU)8WyoiurHMw!l_ZF1>8_3m3N^^QwEd8=8hkN zLO3;VtVoT0C3B518KN#oIV4@)7DfOSbkchYBQVO$9`K;#(G?m(FfikRxR&q(!eB`z zPfQaAWzYbTAj%3PZH#S=y0Cm|7Ih97TD6vNg%Klm)F8;i7;OxS)eka zB8A?w-LbwRX=_?uTcBnnP&Yyj78F)siB literal 0 HcmV?d1 -- 1.9.1
[Qemu-devel] [PATCH 3/4] block/parallels: split check for parallels format in parallels_open
and rework error path a bit. There is no difference at the moment, but the code will be definitely shorter when additional processing will be required for WithouFreSpacExt Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index 8f9ec8a..02739cf 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -85,12 +85,10 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, goto fail; } -if (memcmp(ph.magic, HEADER_MAGIC, 16) || -(le32_to_cpu(ph.version) != HEADER_VERSION)) { -error_setg(errp, "Image not in Parallels format"); -ret = -EINVAL; -goto fail; -} +if (le32_to_cpu(ph.version) != HEADER_VERSION) +goto fail_format; +if (memcmp(ph.magic, HEADER_MAGIC, 16)) +goto fail_format; bs->total_sectors = (uint32_t)le64_to_cpu(ph.nb_sectors); @@ -120,6 +118,9 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, qemu_co_mutex_init(&s->lock); return 0; +fail_format: +error_setg(errp, "Image not in Parallels format"); +ret = -EINVAL; fail: g_free(s->catalog_bitmap); return ret; -- 1.9.1
[Qemu-devel] [PATCH 0/4] block/parallels: 2TB+ parallels images support
Parallels has released in the recent updates of Parallels Server 5/6 new addition to his image format. Images with signature WithouFreSpacExt have offsets in the catalog coded not as offsets in sectors (multiple of 512 bytes) but offsets coded in blocks (i.e. header->tracks * 512) In this case to code the virtual disk size for such images nb_sectors field is extended to 64 bits. The reader of older images with signature WithoutFreeSpace must manually zero most valuable bits of nb_sectors on open. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi
[Qemu-devel] [PATCH 2/4] block/parallels: replace tabs with spaces in block/parallels.c
Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index c44df87..8f9ec8a 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -61,11 +61,11 @@ static int parallels_probe(const uint8_t *buf, int buf_size, const char *filenam const struct parallels_header *ph = (const void *)buf; if (buf_size < HEADER_SIZE) - return 0; +return 0; if (!memcmp(ph->magic, HEADER_MAGIC, 16) && - (le32_to_cpu(ph->version) == HEADER_VERSION)) - return 100; +(le32_to_cpu(ph->version) == HEADER_VERSION)) +return 100; return 0; } @@ -115,7 +115,7 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, } for (i = 0; i < s->catalog_size; i++) - le32_to_cpus(&s->catalog_bitmap[i]); +le32_to_cpus(&s->catalog_bitmap[i]); qemu_co_mutex_init(&s->lock); return 0; @@ -135,7 +135,7 @@ static int64_t seek_to_sector(BlockDriverState *bs, int64_t sector_num) /* not allocated */ if ((index > s->catalog_size) || (s->catalog_bitmap[index] == 0)) - return -1; +return -1; return (uint64_t)(s->catalog_bitmap[index] + offset) * 512; } -- 1.9.1
[Qemu-devel] [PATCH 1/4] block/parallels: extend parallels format header with actual data values
Parallels image format has several additional fields inside: - nb_sectors is actually 64 bit wide. Upper 32bits are not used for images with signature "WithoutFreeSpace" and must be explicitely zeroed according to Parallels. They will be used for images with signature "WithouFreSpacExt" - inuse is magic which means that the image is currently opened for read/write or was not closed correctly, the magic is 0x746f6e59 - data_off is the location of the first data block. It can be zero and in this case This patch adds these values to struct parallels_header and adds proper handling of nb_sectors for currently supported WithoutFreeSpace images. WithouFreSpacExt will be covered in the next patch. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index 1a5bd35..c44df87 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -41,8 +41,10 @@ struct parallels_header { uint32_t cylinders; uint32_t tracks; uint32_t catalog_entries; -uint32_t nb_sectors; -char padding[24]; +uint64_t nb_sectors; +uint32_t inuse; +uint32_t data_off; +char padding[12]; } QEMU_PACKED; typedef struct BDRVParallelsState { @@ -90,7 +92,7 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, goto fail; } -bs->total_sectors = le32_to_cpu(ph.nb_sectors); +bs->total_sectors = (uint32_t)le64_to_cpu(ph.nb_sectors); s->tracks = le32_to_cpu(ph.tracks); if (s->tracks == 0) { -- 1.9.1
[Qemu-devel] [PATCH 4/4] block/parallels: 2TB+ parallels images support
Parallels has released in the recent updates of Parallels Server 5/6 new addition to his image format. Images with signature WithouFreSpacExt have offsets in the catalog coded not as offsets in sectors (multiple of 512 bytes) but offsets coded in blocks (i.e. header->tracks * 512) In this case all 64 bits of header->nb_sectors are used for image size. This patch implements support of this for qemu-img. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index 02739cf..d9cb04f 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -30,6 +30,7 @@ /**/ #define HEADER_MAGIC "WithoutFreeSpace" +#define HEADER_MAGIC2 "WithouFreSpacExt" #define HEADER_VERSION 2 #define HEADER_SIZE 64 @@ -54,6 +55,8 @@ typedef struct BDRVParallelsState { unsigned int catalog_size; unsigned int tracks; + +unsigned int off_multiplier; } BDRVParallelsState; static int parallels_probe(const uint8_t *buf, int buf_size, const char *filename) @@ -63,7 +66,8 @@ static int parallels_probe(const uint8_t *buf, int buf_size, const char *filenam if (buf_size < HEADER_SIZE) return 0; -if (!memcmp(ph->magic, HEADER_MAGIC, 16) && +if ((!memcmp(ph->magic, HEADER_MAGIC, 16) || +!memcmp(ph->magic, HEADER_MAGIC2, 16)) && (le32_to_cpu(ph->version) == HEADER_VERSION)) return 100; @@ -85,13 +89,18 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, goto fail; } +bs->total_sectors = le64_to_cpu(ph.nb_sectors); + if (le32_to_cpu(ph.version) != HEADER_VERSION) goto fail_format; -if (memcmp(ph.magic, HEADER_MAGIC, 16)) +if (!memcmp(ph.magic, HEADER_MAGIC, 16)) { +s->off_multiplier = 1; +bs->total_sectors = (uint32_t)bs->total_sectors; +} else if (!memcmp(ph.magic, HEADER_MAGIC2, 16)) +s->off_multiplier = le32_to_cpu(ph.tracks); +else goto fail_format; -bs->total_sectors = (uint32_t)le64_to_cpu(ph.nb_sectors); - s->tracks = le32_to_cpu(ph.tracks); if (s->tracks == 0) { error_setg(errp, "Invalid image: Zero sectors per track"); @@ -137,7 +146,8 @@ static int64_t seek_to_sector(BlockDriverState *bs, int64_t sector_num) /* not allocated */ if ((index > s->catalog_size) || (s->catalog_bitmap[index] == 0)) return -1; -return (uint64_t)(s->catalog_bitmap[index] + offset) * 512; +return +((uint64_t)s->catalog_bitmap[index] * s->off_multiplier + offset) * 512; } static int parallels_read(BlockDriverState *bs, int64_t sector_num, -- 1.9.1
Re: [Qemu-devel] [PATCH 1/4] block/parallels: extend parallels format header with actual data values
On 24/07/14 22:34, Jeff Cody wrote: On Tue, Jul 22, 2014 at 05:19:34PM +0400, Denis V. Lunev wrote: Parallels image format has several additional fields inside: - nb_sectors is actually 64 bit wide. Upper 32bits are not used for images with signature "WithoutFreeSpace" and must be explicitely s/explicitely/explicitly zeroed according to Parallels. They will be used for images with signature "WithouFreSpacExt" - inuse is magic which means that the image is currently opened for read/write or was not closed correctly, the magic is 0x746f6e59 - data_off is the location of the first data block. It can be zero and in this case I think you may have forgotten to finish this sentence :) ok This patch adds these values to struct parallels_header and adds proper handling of nb_sectors for currently supported WithoutFreeSpace images. WithouFreSpacExt will be covered in the next patch. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index 1a5bd35..c44df87 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -41,8 +41,10 @@ struct parallels_header { uint32_t cylinders; uint32_t tracks; uint32_t catalog_entries; -uint32_t nb_sectors; -char padding[24]; +uint64_t nb_sectors; +uint32_t inuse; +uint32_t data_off; +char padding[12]; } QEMU_PACKED; typedef struct BDRVParallelsState { @@ -90,7 +92,7 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, goto fail; } -bs->total_sectors = le32_to_cpu(ph.nb_sectors); +bs->total_sectors = (uint32_t)le64_to_cpu(ph.nb_sectors); I think an explicit bit mask on the upper 32 bits would fit better here than a cast, especially since neither 'bs->total_sectors' nor 'ph.nb_sectors' is a uint32_t. E.g.: bs->total_sectors = 0x & le64_to_cpu(ph.nb_sectors); ok, will do
Re: [Qemu-devel] [PATCH 4/4] block/parallels: 2TB+ parallels images support
On 24/07/14 23:25, Jeff Cody wrote: On Tue, Jul 22, 2014 at 05:19:37PM +0400, Denis V. Lunev wrote: Parallels has released in the recent updates of Parallels Server 5/6 new addition to his image format. Images with signature WithouFreSpacExt have offsets in the catalog coded not as offsets in sectors (multiple of 512 bytes) but offsets coded in blocks (i.e. header->tracks * 512) In this case all 64 bits of header->nb_sectors are used for image size. This patch implements support of this for qemu-img. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index 02739cf..d9cb04f 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -30,6 +30,7 @@ /**/ #define HEADER_MAGIC "WithoutFreeSpace" +#define HEADER_MAGIC2 "WithouFreSpacExt" #define HEADER_VERSION 2 #define HEADER_SIZE 64 @@ -54,6 +55,8 @@ typedef struct BDRVParallelsState { unsigned int catalog_size; unsigned int tracks; + +unsigned int off_multiplier; } BDRVParallelsState; static int parallels_probe(const uint8_t *buf, int buf_size, const char *filename) @@ -63,7 +66,8 @@ static int parallels_probe(const uint8_t *buf, int buf_size, const char *filenam if (buf_size < HEADER_SIZE) return 0; -if (!memcmp(ph->magic, HEADER_MAGIC, 16) && +if ((!memcmp(ph->magic, HEADER_MAGIC, 16) || +!memcmp(ph->magic, HEADER_MAGIC2, 16)) && (le32_to_cpu(ph->version) == HEADER_VERSION)) return 100; @@ -85,13 +89,18 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, goto fail; } +bs->total_sectors = le64_to_cpu(ph.nb_sectors); + bs->total_sectors is a int64_t, and ph.nb_sectors is a uint64_t. Should we do a sanity check here on the max number of sectors? in reality such image will not be usable as we will later have to multiply this count with 512 to calculate offset in the file if (le32_to_cpu(ph.version) != HEADER_VERSION) goto fail_format; -if (memcmp(ph.magic, HEADER_MAGIC, 16)) +if (!memcmp(ph.magic, HEADER_MAGIC, 16)) { +s->off_multiplier = 1; +bs->total_sectors = (uint32_t)bs->total_sectors; (same comment as in patch 1, w.r.t. cast vs. bitmask) ok +} else if (!memcmp(ph.magic, HEADER_MAGIC2, 16)) +s->off_multiplier = le32_to_cpu(ph.tracks); Is there a maximum size in the specification for ph.tracks? no, there is no limitation. Though in reality I have seen the following images only: 252 sectors, 63 sectors, 256 sectors, 2048 sectors and only 2048 was used with this new header. This is just an observation. +else goto fail_format; (same comment as the last patch, w.r.t. braces) -bs->total_sectors = (uint32_t)le64_to_cpu(ph.nb_sectors); - s->tracks = le32_to_cpu(ph.tracks); if (s->tracks == 0) { error_setg(errp, "Invalid image: Zero sectors per track"); @@ -137,7 +146,8 @@ static int64_t seek_to_sector(BlockDriverState *bs, int64_t sector_num) /* not allocated */ if ((index > s->catalog_size) || (s->catalog_bitmap[index] == 0)) return -1; -return (uint64_t)(s->catalog_bitmap[index] + offset) * 512; +return +((uint64_t)s->catalog_bitmap[index] * s->off_multiplier + offset) * 512; Do we need to worry about overflow here, depending on the value of off_multiplier? Also, (and this existed prior to this patch), - we are casting to uint64_t, although the function returns int64_t. good catch. I'll change the declaration to uint64_t and will return 0 from previous if aka if ((index > s->catalog_size) || (s->catalog_bitmap[index] == 0)) It is not possible to obtain 0 as a real offset of any sector as this is an offset of the header. Regarding overflow check. Do you think that we should return error to the upper layer or we could silently fill result data with zeroes as was done originally for the following case 'index > s->catalog_size' ? } static int parallels_read(BlockDriverState *bs, int64_t sector_num, -- 1.9.1
Re: [Qemu-devel] [PATCH 3/4] block/parallels: split check for parallels format in parallels_open
On 24/07/14 22:50, Jeff Cody wrote: On Tue, Jul 22, 2014 at 05:19:36PM +0400, Denis V. Lunev wrote: and rework error path a bit. There is no difference at the moment, but the code will be definitely shorter when additional processing will be required for WithouFreSpacExt Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index 8f9ec8a..02739cf 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -85,12 +85,10 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, goto fail; } -if (memcmp(ph.magic, HEADER_MAGIC, 16) || -(le32_to_cpu(ph.version) != HEADER_VERSION)) { -error_setg(errp, "Image not in Parallels format"); -ret = -EINVAL; -goto fail; -} +if (le32_to_cpu(ph.version) != HEADER_VERSION) +goto fail_format; +if (memcmp(ph.magic, HEADER_MAGIC, 16)) +goto fail_format; QEMU coding style dictates these statements have curly braces, even though they are just one liners. (If you run your patches against scripts/checkpatch.pl, it should catch most style issues). ok, I just used a kernel convention. Will redo this here and in the next path. This is not a problem :) Thank you for pointing this out. I think this patch could also just be squashed into patch 4, if desired. I'd prefer not to do this to have behavior changing and behavior non-changing stuff separated. bs->total_sectors = (uint32_t)le64_to_cpu(ph.nb_sectors); @@ -120,6 +118,9 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, qemu_co_mutex_init(&s->lock); return 0; +fail_format: +error_setg(errp, "Image not in Parallels format"); +ret = -EINVAL; fail: g_free(s->catalog_bitmap); return ret; -- 1.9.1
Re: [Qemu-devel] [PATCH 4/4] block/parallels: 2TB+ parallels images support
On 25/07/14 17:08, Jeff Cody wrote: On Fri, Jul 25, 2014 at 07:51:47AM +0400, Denis V. Lunev wrote: On 24/07/14 23:25, Jeff Cody wrote: On Tue, Jul 22, 2014 at 05:19:37PM +0400, Denis V. Lunev wrote: Parallels has released in the recent updates of Parallels Server 5/6 new addition to his image format. Images with signature WithouFreSpacExt have offsets in the catalog coded not as offsets in sectors (multiple of 512 bytes) but offsets coded in blocks (i.e. header->tracks * 512) In this case all 64 bits of header->nb_sectors are used for image size. This patch implements support of this for qemu-img. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index 02739cf..d9cb04f 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -30,6 +30,7 @@ /**/ #define HEADER_MAGIC "WithoutFreeSpace" +#define HEADER_MAGIC2 "WithouFreSpacExt" #define HEADER_VERSION 2 #define HEADER_SIZE 64 @@ -54,6 +55,8 @@ typedef struct BDRVParallelsState { unsigned int catalog_size; unsigned int tracks; + +unsigned int off_multiplier; } BDRVParallelsState; static int parallels_probe(const uint8_t *buf, int buf_size, const char *filename) @@ -63,7 +66,8 @@ static int parallels_probe(const uint8_t *buf, int buf_size, const char *filenam if (buf_size < HEADER_SIZE) return 0; -if (!memcmp(ph->magic, HEADER_MAGIC, 16) && +if ((!memcmp(ph->magic, HEADER_MAGIC, 16) || +!memcmp(ph->magic, HEADER_MAGIC2, 16)) && (le32_to_cpu(ph->version) == HEADER_VERSION)) return 100; @@ -85,13 +89,18 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, goto fail; } +bs->total_sectors = le64_to_cpu(ph.nb_sectors); + bs->total_sectors is a int64_t, and ph.nb_sectors is a uint64_t. Should we do a sanity check here on the max number of sectors? in reality such image will not be usable as we will later have to multiply this count with 512 to calculate offset in the file if (le32_to_cpu(ph.version) != HEADER_VERSION) goto fail_format; -if (memcmp(ph.magic, HEADER_MAGIC, 16)) +if (!memcmp(ph.magic, HEADER_MAGIC, 16)) { +s->off_multiplier = 1; +bs->total_sectors = (uint32_t)bs->total_sectors; (same comment as in patch 1, w.r.t. cast vs. bitmask) ok +} else if (!memcmp(ph.magic, HEADER_MAGIC2, 16)) +s->off_multiplier = le32_to_cpu(ph.tracks); Is there a maximum size in the specification for ph.tracks? no, there is no limitation. Though in reality I have seen the following images only: 252 sectors, 63 sectors, 256 sectors, 2048 sectors and only 2048 was used with this new header. This is just an observation. +else goto fail_format; (same comment as the last patch, w.r.t. braces) -bs->total_sectors = (uint32_t)le64_to_cpu(ph.nb_sectors); - s->tracks = le32_to_cpu(ph.tracks); if (s->tracks == 0) { error_setg(errp, "Invalid image: Zero sectors per track"); @@ -137,7 +146,8 @@ static int64_t seek_to_sector(BlockDriverState *bs, int64_t sector_num) /* not allocated */ if ((index > s->catalog_size) || (s->catalog_bitmap[index] == 0)) return -1; -return (uint64_t)(s->catalog_bitmap[index] + offset) * 512; +return +((uint64_t)s->catalog_bitmap[index] * s->off_multiplier + offset) * 512; Do we need to worry about overflow here, depending on the value of off_multiplier? Also, (and this existed prior to this patch), - we are casting to uint64_t, although the function returns int64_t. good catch. I'll change the declaration to uint64_t and will return 0 from previous if aka Thanks. I'm not sure that is the best option though - the only place the return value from this function is used is in parallels_read(), which passes the output into bdrv_pread() as the offset. The offset parameter for bdrv_pread/pwrite is an int64_t, not a uint64_t. if ((index > s->catalog_size) || (s->catalog_bitmap[index] == 0)) It is not possible to obtain 0 as a real offset of any sector as this is an offset of the header. Regarding overflow check. Do you think that we should return error to the upper layer or we could silently fill result data with zeroes as was done originally for the following case 'index > s->catalog_size' ? I think it would be best to return error (anything < 0 will also cause bdrv_pread to return -EIO, and it is also checked for in parallels_read). I also don't mean to over-complicate things here for you, which is why I asked about the max value of ph.tracks. If that has a reasonable bounds check, t
[Qemu-devel] [PATCH v2 1/4] parallels: extend parallels format header with actual data values
Parallels image format has several additional fields inside: - nb_sectors is actually 64 bit wide. Upper 32bits are not used for images with signature "WithoutFreeSpace" and must be explicitly zeroed according to Parallels. They will be used for images with signature "WithouFreSpacExt" - inuse is magic which means that the image is currently opened for read/write or was not closed correctly, the magic is 0x746f6e59 - data_off is the location of the first data block. It can be zero and in this case data starts just beyond the header aligned to 512 bytes. Though this field does not matter for read-only driver This patch adds these values to struct parallels_header and adds proper handling of nb_sectors for currently supported WithoutFreeSpace images. WithouFreSpacExt will be covered in next patches. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi CC: Jeff Cody --- block/parallels.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index 1a5bd35..e39 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -41,8 +41,10 @@ struct parallels_header { uint32_t cylinders; uint32_t tracks; uint32_t catalog_entries; -uint32_t nb_sectors; -char padding[24]; +uint64_t nb_sectors; +uint32_t inuse; +uint32_t data_off; +char padding[12]; } QEMU_PACKED; typedef struct BDRVParallelsState { @@ -90,7 +92,7 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, goto fail; } -bs->total_sectors = le32_to_cpu(ph.nb_sectors); +bs->total_sectors = 0x & le64_to_cpu(ph.nb_sectors); s->tracks = le32_to_cpu(ph.tracks); if (s->tracks == 0) { -- 1.9.1
[Qemu-devel] [PATCH v2 2/4] parallels: replace tabs with spaces in block/parallels.c
Signed-off-by: Denis V. Lunev Reviewed-by: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index e39..16d14ad 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -61,11 +61,11 @@ static int parallels_probe(const uint8_t *buf, int buf_size, const char *filenam const struct parallels_header *ph = (const void *)buf; if (buf_size < HEADER_SIZE) - return 0; +return 0; if (!memcmp(ph->magic, HEADER_MAGIC, 16) && - (le32_to_cpu(ph->version) == HEADER_VERSION)) - return 100; +(le32_to_cpu(ph->version) == HEADER_VERSION)) +return 100; return 0; } @@ -115,7 +115,7 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, } for (i = 0; i < s->catalog_size; i++) - le32_to_cpus(&s->catalog_bitmap[i]); +le32_to_cpus(&s->catalog_bitmap[i]); qemu_co_mutex_init(&s->lock); return 0; @@ -135,7 +135,7 @@ static int64_t seek_to_sector(BlockDriverState *bs, int64_t sector_num) /* not allocated */ if ((index > s->catalog_size) || (s->catalog_bitmap[index] == 0)) - return -1; +return -1; return (uint64_t)(s->catalog_bitmap[index] + offset) * 512; } -- 1.9.1
[Qemu-devel] [PATCH v2 3/4] parallels: split check for parallels format in parallels_open
and rework error path a bit. There is no difference at the moment, but the code will be definitely shorter when additional processing will be required for WithouFreSpacExt Signed-off-by: Denis V. Lunev CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index 16d14ad..466705e 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -85,11 +85,11 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, goto fail; } -if (memcmp(ph.magic, HEADER_MAGIC, 16) || -(le32_to_cpu(ph.version) != HEADER_VERSION)) { -error_setg(errp, "Image not in Parallels format"); -ret = -EINVAL; -goto fail; +if (le32_to_cpu(ph.version) != HEADER_VERSION) { +goto fail_format; +} +if (memcmp(ph.magic, HEADER_MAGIC, 16)) { +goto fail_format; } bs->total_sectors = 0x & le64_to_cpu(ph.nb_sectors); @@ -120,6 +120,9 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, qemu_co_mutex_init(&s->lock); return 0; +fail_format: +error_setg(errp, "Image not in Parallels format"); +ret = -EINVAL; fail: g_free(s->catalog_bitmap); return ret; -- 1.9.1
[Qemu-devel] [PATCH v2 0/4] block/parallels: 2TB+ parallels images support
Parallels has released in the recent updates of Parallels Server 5/6 new addition to his image format. Images with signature WithouFreSpacExt have offsets in the catalog coded not as offsets in sectors (multiple of 512 bytes) but offsets coded in blocks (i.e. header->tracks * 512) In this case to code the virtual disk size for such images nb_sectors field is extended to 64 bits. The reader of older images with signature WithoutFreeSpace must manually zero most valuable bits of nb_sectors on open. Changes from v1: - fixed message in patch 1 - added braces to conform qemu coding style in patches 3 & 4 - added check for ph.tracks in patch 4 to avoid offset overflow as suggested by Jeff Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi CC: Jeff Cody
[Qemu-devel] [PATCH v2 4/4] parallels: 2TB+ parallels images support
Parallels has released in the recent updates of Parallels Server 5/6 new addition to his image format. Images with signature WithouFreSpacExt have offsets in the catalog coded not as offsets in sectors (multiple of 512 bytes) but offsets coded in blocks (i.e. header->tracks * 512) In this case all 64 bits of header->nb_sectors are used for image size. This patch implements support of this for qemu-img and also adds specific check for an incorrect image. Images with block size greater than INT_MAX/513 are not supported. The biggest available Parallels image cluster size in the field is 1 Mb. Thus this limit will not hurt anyone. Signed-off-by: Denis V. Lunev CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 25 - 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index 466705e..4414a9d 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -30,6 +30,7 @@ /**/ #define HEADER_MAGIC "WithoutFreeSpace" +#define HEADER_MAGIC2 "WithouFreSpacExt" #define HEADER_VERSION 2 #define HEADER_SIZE 64 @@ -54,6 +55,8 @@ typedef struct BDRVParallelsState { unsigned int catalog_size; unsigned int tracks; + +unsigned int off_multiplier; } BDRVParallelsState; static int parallels_probe(const uint8_t *buf, int buf_size, const char *filename) @@ -63,7 +66,8 @@ static int parallels_probe(const uint8_t *buf, int buf_size, const char *filenam if (buf_size < HEADER_SIZE) return 0; -if (!memcmp(ph->magic, HEADER_MAGIC, 16) && +if ((!memcmp(ph->magic, HEADER_MAGIC, 16) || +!memcmp(ph->magic, HEADER_MAGIC2, 16)) && (le32_to_cpu(ph->version) == HEADER_VERSION)) return 100; @@ -85,21 +89,31 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, goto fail; } +bs->total_sectors = le64_to_cpu(ph.nb_sectors); + if (le32_to_cpu(ph.version) != HEADER_VERSION) { goto fail_format; } -if (memcmp(ph.magic, HEADER_MAGIC, 16)) { +if (!memcmp(ph.magic, HEADER_MAGIC, 16)) { +s->off_multiplier = 1; +bs->total_sectors = 0x & bs->total_sectors; +} else if (!memcmp(ph.magic, HEADER_MAGIC2, 16)) { +s->off_multiplier = le32_to_cpu(ph.tracks); +} else { goto fail_format; } -bs->total_sectors = 0x & le64_to_cpu(ph.nb_sectors); - s->tracks = le32_to_cpu(ph.tracks); if (s->tracks == 0) { error_setg(errp, "Invalid image: Zero sectors per track"); ret = -EINVAL; goto fail; } +if (s->tracks > INT32_MAX/513) { +error_setg(errp, "Invalid image: Too big cluster"); +ret = -EFBIG; +goto fail; +} s->catalog_size = le32_to_cpu(ph.catalog_entries); if (s->catalog_size > INT_MAX / 4) { @@ -139,7 +153,8 @@ static int64_t seek_to_sector(BlockDriverState *bs, int64_t sector_num) /* not allocated */ if ((index > s->catalog_size) || (s->catalog_bitmap[index] == 0)) return -1; -return (uint64_t)(s->catalog_bitmap[index] + offset) * 512; +return +((uint64_t)s->catalog_bitmap[index] * s->off_multiplier + offset) * 512; } static int parallels_read(BlockDriverState *bs, int64_t sector_num, -- 1.9.1
Re: [Qemu-devel] [PATCH v2 0/4] block/parallels: 2TB+ parallels images support
On 28/07/14 20:23, Denis V. Lunev wrote: Parallels has released in the recent updates of Parallels Server 5/6 new addition to his image format. Images with signature WithouFreSpacExt have offsets in the catalog coded not as offsets in sectors (multiple of 512 bytes) but offsets coded in blocks (i.e. header->tracks * 512) In this case to code the virtual disk size for such images nb_sectors field is extended to 64 bits. The reader of older images with signature WithoutFreeSpace must manually zero most valuable bits of nb_sectors on open. Changes from v1: - fixed message in patch 1 - added braces to conform qemu coding style in patches 3 & 4 - added check for ph.tracks in patch 4 to avoid offset overflow as suggested by Jeff Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi CC: Jeff Cody can you pls look/commit. I have some other changes on top of this
Re: [Qemu-devel] [PATCH v2 4/4] parallels: 2TB+ parallels images support
On 07/08/14 18:39, Jeff Cody wrote: On Mon, Jul 28, 2014 at 08:23:55PM +0400, Denis V. Lunev wrote: Parallels has released in the recent updates of Parallels Server 5/6 new addition to his image format. Images with signature WithouFreSpacExt have offsets in the catalog coded not as offsets in sectors (multiple of 512 bytes) but offsets coded in blocks (i.e. header->tracks * 512) In this case all 64 bits of header->nb_sectors are used for image size. This patch implements support of this for qemu-img and also adds specific check for an incorrect image. Images with block size greater than INT_MAX/513 are not supported. The biggest available Parallels image cluster size in the field is 1 Mb. Thus this limit will not hurt anyone. Signed-off-by: Denis V. Lunev CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 25 - 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index 466705e..4414a9d 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -30,6 +30,7 @@ /**/ #define HEADER_MAGIC "WithoutFreeSpace" +#define HEADER_MAGIC2 "WithouFreSpacExt" #define HEADER_VERSION 2 #define HEADER_SIZE 64 @@ -54,6 +55,8 @@ typedef struct BDRVParallelsState { unsigned int catalog_size; unsigned int tracks; + +unsigned int off_multiplier; } BDRVParallelsState; static int parallels_probe(const uint8_t *buf, int buf_size, const char *filename) @@ -63,7 +66,8 @@ static int parallels_probe(const uint8_t *buf, int buf_size, const char *filenam if (buf_size < HEADER_SIZE) return 0; -if (!memcmp(ph->magic, HEADER_MAGIC, 16) && +if ((!memcmp(ph->magic, HEADER_MAGIC, 16) || +!memcmp(ph->magic, HEADER_MAGIC2, 16)) && (le32_to_cpu(ph->version) == HEADER_VERSION)) return 100; @@ -85,21 +89,31 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, goto fail; } +bs->total_sectors = le64_to_cpu(ph.nb_sectors); + if (le32_to_cpu(ph.version) != HEADER_VERSION) { goto fail_format; } -if (memcmp(ph.magic, HEADER_MAGIC, 16)) { +if (!memcmp(ph.magic, HEADER_MAGIC, 16)) { +s->off_multiplier = 1; +bs->total_sectors = 0x & bs->total_sectors; +} else if (!memcmp(ph.magic, HEADER_MAGIC2, 16)) { +s->off_multiplier = le32_to_cpu(ph.tracks); +} else { goto fail_format; } -bs->total_sectors = 0x & le64_to_cpu(ph.nb_sectors); - s->tracks = le32_to_cpu(ph.tracks); if (s->tracks == 0) { error_setg(errp, "Invalid image: Zero sectors per track"); ret = -EINVAL; goto fail; } +if (s->tracks > INT32_MAX/513) { +error_setg(errp, "Invalid image: Too big cluster"); +ret = -EFBIG; +goto fail; +} s->catalog_size = le32_to_cpu(ph.catalog_entries); if (s->catalog_size > INT_MAX / 4) { @@ -139,7 +153,8 @@ static int64_t seek_to_sector(BlockDriverState *bs, int64_t sector_num) /* not allocated */ if ((index > s->catalog_size) || (s->catalog_bitmap[index] == 0)) return -1; -return (uint64_t)(s->catalog_bitmap[index] + offset) * 512; +return +((uint64_t)s->catalog_bitmap[index] * s->off_multiplier + offset) * 512; This still does a cast to uint_64_t, instead of int64_t; not sure it really matters in practice, as we should be safe now from exceeding an int64_t value in the end result. this is safe due to above check for s->tracks > INT32_MAX/513 Actually, original code has exactly the same cast and the situation is exactly the same before the patch (uint32_t value * 1) and after the patch (uint32_t * (something < INT32_MAX/513)) Though I can change the cast to int64_t, I do not see much difference. Should I do this? } static int parallels_read(BlockDriverState *bs, int64_t sector_num, -- 1.9.1
Re: [Qemu-devel] [PATCH v2 4/4] parallels: 2TB+ parallels images support
On 07/08/14 19:14, Jeff Cody wrote: On Thu, Aug 07, 2014 at 07:03:12PM +0400, Denis V. Lunev wrote: On 07/08/14 18:39, Jeff Cody wrote: On Mon, Jul 28, 2014 at 08:23:55PM +0400, Denis V. Lunev wrote: Parallels has released in the recent updates of Parallels Server 5/6 new addition to his image format. Images with signature WithouFreSpacExt have offsets in the catalog coded not as offsets in sectors (multiple of 512 bytes) but offsets coded in blocks (i.e. header->tracks * 512) In this case all 64 bits of header->nb_sectors are used for image size. This patch implements support of this for qemu-img and also adds specific check for an incorrect image. Images with block size greater than INT_MAX/513 are not supported. The biggest available Parallels image cluster size in the field is 1 Mb. Thus this limit will not hurt anyone. Signed-off-by: Denis V. Lunev CC: Jeff Cody CC: Kevin Wolf CC: Stefan Hajnoczi --- block/parallels.c | 25 - 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index 466705e..4414a9d 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -30,6 +30,7 @@ /**/ #define HEADER_MAGIC "WithoutFreeSpace" +#define HEADER_MAGIC2 "WithouFreSpacExt" #define HEADER_VERSION 2 #define HEADER_SIZE 64 @@ -54,6 +55,8 @@ typedef struct BDRVParallelsState { unsigned int catalog_size; unsigned int tracks; + +unsigned int off_multiplier; } BDRVParallelsState; static int parallels_probe(const uint8_t *buf, int buf_size, const char *filename) @@ -63,7 +66,8 @@ static int parallels_probe(const uint8_t *buf, int buf_size, const char *filenam if (buf_size < HEADER_SIZE) return 0; -if (!memcmp(ph->magic, HEADER_MAGIC, 16) && +if ((!memcmp(ph->magic, HEADER_MAGIC, 16) || +!memcmp(ph->magic, HEADER_MAGIC2, 16)) && (le32_to_cpu(ph->version) == HEADER_VERSION)) return 100; @@ -85,21 +89,31 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, goto fail; } +bs->total_sectors = le64_to_cpu(ph.nb_sectors); + if (le32_to_cpu(ph.version) != HEADER_VERSION) { goto fail_format; } -if (memcmp(ph.magic, HEADER_MAGIC, 16)) { +if (!memcmp(ph.magic, HEADER_MAGIC, 16)) { +s->off_multiplier = 1; +bs->total_sectors = 0x & bs->total_sectors; +} else if (!memcmp(ph.magic, HEADER_MAGIC2, 16)) { +s->off_multiplier = le32_to_cpu(ph.tracks); +} else { goto fail_format; } -bs->total_sectors = 0x & le64_to_cpu(ph.nb_sectors); - s->tracks = le32_to_cpu(ph.tracks); if (s->tracks == 0) { error_setg(errp, "Invalid image: Zero sectors per track"); ret = -EINVAL; goto fail; } +if (s->tracks > INT32_MAX/513) { +error_setg(errp, "Invalid image: Too big cluster"); +ret = -EFBIG; +goto fail; +} s->catalog_size = le32_to_cpu(ph.catalog_entries); if (s->catalog_size > INT_MAX / 4) { @@ -139,7 +153,8 @@ static int64_t seek_to_sector(BlockDriverState *bs, int64_t sector_num) /* not allocated */ if ((index > s->catalog_size) || (s->catalog_bitmap[index] == 0)) return -1; -return (uint64_t)(s->catalog_bitmap[index] + offset) * 512; +return +((uint64_t)s->catalog_bitmap[index] * s->off_multiplier + offset) * 512; This still does a cast to uint_64_t, instead of int64_t; not sure it really matters in practice, as we should be safe now from exceeding an int64_t value in the end result. this is safe due to above check for s->tracks > INT32_MAX/513 Actually, original code has exactly the same cast and the situation is exactly the same before the patch (uint32_t value * 1) and after the patch (uint32_t * (something < INT32_MAX/513)) Though I can change the cast to int64_t, I do not see much difference. Should I do this? Right, as I said in practice it should now be safe from exceeding an int64_t (due to the bounds check on s->tracks). I think it is worth changing if someone else requests a respin for another reason, but probably not to do a respin for this on its own. Another question - do you have any sample images? If they compress well (bzip2 does a good job with most blank images) it would be nice to have a couple of parallels images (e.g. one "WithouFreSpacExt" and one "WithoutFreeSpace") in the tests sample_images directory. If you can provide both types of images, I'll amend test 076 to include them. I can go ahead and give my R-b for this patch: no prob, I have access to them. I'll provide a collection within next series. There is some other incompatible stuff add... Reviewed-by: Jeff Cody Thank you :) } static int parallels_read(BlockDriverState *bs, int64_t sector_num, -- 1.9.1
Re: [Qemu-devel] [PATCH v2 4/4] parallels: 2TB+ parallels images support
On 07/08/14 19:22, Denis V. Lunev wrote: Another question - do you have any sample images? If they compress well (bzip2 does a good job with most blank images) it would be nice to have a couple of parallels images (e.g. one "WithouFreSpacExt" and one "WithoutFreeSpace") in the tests sample_images directory. If you can provide both types of images, I'll amend test 076 to include them. I can go ahead and give my R-b for this patch: no prob, I have access to them. I'll provide a collection within next series. There is some other incompatible stuff add... /stuff add.../stuff to add.../
Re: [Qemu-devel] Wierd hack to sound/pci/intel8x0.c
On 11/6/11 6:51 PM, Avi Kivity wrote: The recently merged 228cf79376f1 ("ALSA: intel8x0: Improve performance in virtual environment") is hacky and somewhat wrong. First, the detection code + if (inside_vm< 0) { + /* detect KVM and Parallels virtual environments */ + inside_vm = kvm_para_available(); +#if defined(__i386__) || defined(__x86_64__) + inside_vm = inside_vm || boot_cpu_has(X86_FEATURE_HYPERVISOR); +#endif + } + is incorrect. It detects that you're running in a guest, but that doesn't imply that the device you're accessing is emulated. It may be a host device assigned to the guest; presumably the optimization you apply doesn't work for real devices. Second, the optimization itself looks fishy: spin_lock(&chip->reg_lock); do { civ = igetbyte(chip, ichdev->reg_offset + ICH_REG_OFF_CIV); ptr1 = igetword(chip, ichdev->reg_offset + ichdev->roff_picb); position = ichdev->position; if (ptr1 == 0) { udelay(10); continue; } - if (civ == igetbyte(chip, ichdev->reg_offset + ICH_REG_OFF_CIV)&& - ptr1 == igetword(chip, ichdev->reg_offset + ichdev->roff_picb)) + if (civ != igetbyte(chip, ichdev->reg_offset + ICH_REG_OFF_CIV)) + continue; + if (chip->inside_vm) + break; + if (ptr1 == igetword(chip, ichdev->reg_offset + ichdev->roff_picb)) break; } while (timeout--); Why is the emulated device timing out? Can't the emulation be fixed to behave like real hardware? Last, please copy k...@vger.kernel.org on such issues. The problem is that emulation can not be fixed. How this is working for real hardware? You get data from real sound card register. The scheduling is off at the moment thus you can not be re-scheduled. In the virtual environment the situation is different. Any IO emulation is expensive. The processor is switched from guest to hypervisor and further to emulation process takes a lot of time. This time is enough to obtain different value on next register read. That's why this code is really timed out. Please also note that host scheduler also plays his games and could schedule out VCPU thread. The problem could be potentially fixed reducing precision of PICB emulation, but this results in lower sound quality. This kludge has been written this way in order not to break legacy card for which we do not have an access. The code reading PICB/CIV registers second time was added to fix issues on unknown for now platform and it looks not possible how to find/test against this platform now. We have checked Windows drivers written by Intel/AMD (32/64 bit) and MacOS ones. There is no second reading of CIV/PICB inside. We hope that this is relay needed only for some rare hadware devices. The only thing we can is to improve detection code. Suggestions are welcome. Regards, Den
Re: [Qemu-devel] Wierd hack to sound/pci/intel8x0.c
On 11/6/11 8:31 PM, Avi Kivity wrote: On 11/06/2011 06:15 PM, Denis V. Lunev wrote: On 11/6/11 6:51 PM, Avi Kivity wrote: The recently merged 228cf79376f1 ("ALSA: intel8x0: Improve performance in virtual environment") is hacky and somewhat wrong. First, the detection code + if (inside_vm< 0) { + /* detect KVM and Parallels virtual environments */ + inside_vm = kvm_para_available(); +#if defined(__i386__) || defined(__x86_64__) + inside_vm = inside_vm || boot_cpu_has(X86_FEATURE_HYPERVISOR); +#endif + } + is incorrect. It detects that you're running in a guest, but that doesn't imply that the device you're accessing is emulated. It may be a host device assigned to the guest; presumably the optimization you apply doesn't work for real devices. Second, the optimization itself looks fishy: spin_lock(&chip->reg_lock); do { civ = igetbyte(chip, ichdev->reg_offset + ICH_REG_OFF_CIV); ptr1 = igetword(chip, ichdev->reg_offset + ichdev->roff_picb); position = ichdev->position; if (ptr1 == 0) { udelay(10); continue; } - if (civ == igetbyte(chip, ichdev->reg_offset + ICH_REG_OFF_CIV)&& - ptr1 == igetword(chip, ichdev->reg_offset + ichdev->roff_picb)) + if (civ != igetbyte(chip, ichdev->reg_offset + ICH_REG_OFF_CIV)) + continue; + if (chip->inside_vm) + break; + if (ptr1 == igetword(chip, ichdev->reg_offset + ichdev->roff_picb)) break; } while (timeout--); Why is the emulated device timing out? Can't the emulation be fixed to behave like real hardware? Last, please copy k...@vger.kernel.org on such issues. The problem is that emulation can not be fixed. How this is working for real hardware? You get data from real sound card register. The scheduling is off at the moment thus you can not be re-scheduled. In the virtual environment the situation is different. Any IO emulation is expensive. The processor is switched from guest to hypervisor and further to emulation process takes a lot of time. This time is enough to obtain different value on next register read. That's why this code is really timed out. Please also note that host scheduler also plays his games and could schedule out VCPU thread. Note on kvm this is rare, since the guest thread and the emulator thread are the same. It is the same for Parallels too, but it can be scheduled out anyway. The most important part here is context switches cost, which is actually high and enough to result in different PICB value. The problem could be potentially fixed reducing precision of PICB emulation, but this results in lower sound quality. This kludge has been written this way in order not to break legacy card for which we do not have an access. The code reading PICB/CIV registers second time was added to fix issues on unknown for now platform and it looks not possible how to find/test against this platform now. We have checked Windows drivers written by Intel/AMD (32/64 bit) and MacOS ones. There is no second reading of CIV/PICB inside. We hope that this is relay needed only for some rare hadware devices. Ok, so if I understand correctly, this loop is a hack for broken hardware, and this patch basically unhacks it back, assuming that the emulated (or assigned) hardware is not broken. The only thing we can is to improve detection code. Suggestions are welcome. I think it's fine to assume that you're not assigning a 2004 era sound card to a guest. So I think the code is fine as it is, and can only suggest to add a comment explaining the mess. Thanks for explaining. no prob Regard, Den
Re: [Qemu-devel] Wierd hack to sound/pci/intel8x0.c
On 11/6/11 8:47 PM, Takashi Iwai wrote: At Sun, 06 Nov 2011 18:31:42 +0200, Avi Kivity wrote: On 11/06/2011 06:15 PM, Denis V. Lunev wrote: On 11/6/11 6:51 PM, Avi Kivity wrote: The recently merged 228cf79376f1 ("ALSA: intel8x0: Improve performance in virtual environment") is hacky and somewhat wrong. First, the detection code + if (inside_vm< 0) { + /* detect KVM and Parallels virtual environments */ + inside_vm = kvm_para_available(); +#if defined(__i386__) || defined(__x86_64__) + inside_vm = inside_vm || boot_cpu_has(X86_FEATURE_HYPERVISOR); +#endif + } + is incorrect. It detects that you're running in a guest, but that doesn't imply that the device you're accessing is emulated. It may be a host device assigned to the guest; presumably the optimization you apply doesn't work for real devices. Second, the optimization itself looks fishy: spin_lock(&chip->reg_lock); do { civ = igetbyte(chip, ichdev->reg_offset + ICH_REG_OFF_CIV); ptr1 = igetword(chip, ichdev->reg_offset + ichdev->roff_picb); position = ichdev->position; if (ptr1 == 0) { udelay(10); continue; } - if (civ == igetbyte(chip, ichdev->reg_offset + ICH_REG_OFF_CIV)&& - ptr1 == igetword(chip, ichdev->reg_offset + ichdev->roff_picb)) + if (civ != igetbyte(chip, ichdev->reg_offset + ICH_REG_OFF_CIV)) + continue; + if (chip->inside_vm) + break; + if (ptr1 == igetword(chip, ichdev->reg_offset + ichdev->roff_picb)) break; } while (timeout--); Why is the emulated device timing out? Can't the emulation be fixed to behave like real hardware? Last, please copy k...@vger.kernel.org on such issues. The problem is that emulation can not be fixed. How this is working for real hardware? You get data from real sound card register. The scheduling is off at the moment thus you can not be re-scheduled. In the virtual environment the situation is different. Any IO emulation is expensive. The processor is switched from guest to hypervisor and further to emulation process takes a lot of time. This time is enough to obtain different value on next register read. That's why this code is really timed out. Please also note that host scheduler also plays his games and could schedule out VCPU thread. Note on kvm this is rare, since the guest thread and the emulator thread are the same. The problem could be potentially fixed reducing precision of PICB emulation, but this results in lower sound quality. This kludge has been written this way in order not to break legacy card for which we do not have an access. The code reading PICB/CIV registers second time was added to fix issues on unknown for now platform and it looks not possible how to find/test against this platform now. We have checked Windows drivers written by Intel/AMD (32/64 bit) and MacOS ones. There is no second reading of CIV/PICB inside. We hope that this is relay needed only for some rare hadware devices. Ok, so if I understand correctly, this loop is a hack for broken hardware, and this patch basically unhacks it back, assuming that the emulated (or assigned) hardware is not broken. Exactly. The loop itself is still needed, but the check can be less restrictive. The only thing we can is to improve detection code. Suggestions are welcome. I think it's fine to assume that you're not assigning a 2004 era sound card to a guest. So I think the code is fine as it is, and can only suggest to add a comment explaining the mess. True. Denis, care to submit a patch? sure! I'll do that tomorrow
Re: [Qemu-devel] [PULL 05/42] block/raw-posix: refactor handle_aiocb_write_zeroes a bit
On 12/02/15 05:29, Peter Maydell wrote: On 6 February 2015 at 16:40, Kevin Wolf wrote: From: "Denis V. Lunev" move code dealing with a block device to a separate function. This will allow to implement additional processing for ordinary files. +static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) +{ +BDRVRawState *s = aiocb->bs->opaque; + +if (aiocb->aio_type & QEMU_AIO_BLKDEV) { +return handle_aiocb_write_zeroes_block(aiocb); +} + +#ifdef CONFIG_XFS +if (s->is_xfs) { +return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes); +} +#endif + +return -ENOTSUP; +} Hi. This patch has introduced a new compiler warning on OSX: block/raw-posix.c:947:19: warning: unused variable 's' [-Wunused-variable] BDRVRawState *s = aiocb->bs->opaque; ^ and indeed on any host which doesn't define any of CONFIG_XFS, CONFIG_FALLOCATE, CONFIG_FALLOCATE_PUNCH_HOLE or CONFIG_FALLOCATE_ZERO_RANGE. What's your preferred fix? Surrounding the variable declaration with #if defined(CONFIG_XFS) || defined(CONFIG_FALLOCATE) would work but I dunno if some less ugly approach is possible. [I don't yet build OSX with -Werror because we haven't fixed some of the deprecation warnings about the audio APIs, but I would like to eventually...] thanks -- PMM we can just add (void)s; immediately after the declaration. On the other hand, CONFIG_FALLOCATE_PUNCH_HOLE and CONFIG_FALLOCATE_ZERO_RANGE are completely impossible without CONFIG_FALLOCATE thus #if defined(CONFIG_FALLOCATE) or defined(CONFIG_XFS) would be enough. Regards, Den
[Qemu-devel] [PATCH 1/1] block/raw-posix: fix compilation warning on OSX
block/raw-posix.c:947:19: warning: unused variable 's' [-Wunused-variable] BDRVRawState *s = aiocb->bs->opaque; This variable is used only when on of the following macros are defined CONFIG_XFS, CONFIG_FALLOCATE, CONFIG_FALLOCATE_PUNCH_HOLE or CONFIG_FALLOCATE_ZERO_RANGE. Fortunately, CONFIG_FALLOCATE_PUNCH_HOLE and CONFIG_FALLOCATE_ZERO_RANGE could be defined only along with CONFIG_FALLOCATE. Therefore checking for CONFIG_XFS or CONFIG_FALLOCATE would be enough. Signed-off-by: Denis V. Lunev CC: Peter Maydell CC: Kevin Wolf --- block/raw-posix.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/block/raw-posix.c b/block/raw-posix.c index e474c17..bf7d6f7 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -944,7 +944,9 @@ static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) { +#if defined(CONFIG_FALLOCATE) || defined(CONFIG_XFS) BDRVRawState *s = aiocb->bs->opaque; +#endif if (aiocb->aio_type & QEMU_AIO_BLKDEV) { return handle_aiocb_write_zeroes_block(aiocb); -- 1.9.1
Re: [Qemu-devel] [PATCH 0/3] qemu guest agent: support guest-file-* command for Windows
On 06/02/15 20:59, Denis V. Lunev wrote: This was a part of patchset implemented guest-exec command. It was suggested to submit it separately by Michael. The set contains small refactoring to fix mingw 4.9.1 compilation and safe part of the rework of posix file interface plus main Windows file commands implementation. Changes from previous version: - compilation is fixed by gropping strtok_r at all - error_setg_win32 is used for error decoding Signed-off-by: Olga Krishtal Signed-off-by: Denis V. Lunev CC: Michael Roth ping
Re: [Qemu-devel] [PATCH v4 0/1] block: enforce minimal 4096 alignment in qemu_blockalign
On 06/02/15 20:37, Denis V. Lunev wrote: The following sequence int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); for (i = 0; i < 10; i++) write(fd, buf, 4096); iperforms 5% better if buf is aligned to 4096 bytes rather then to 512 bytes. I have used the following program to test #define _GNU_SOURCE #include #include #include #include #include #include int main(int argc, char *argv[]) { int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); void *buf; int i = 0, align = atoi(argv[2]); do { buf = memalign(align, 4096); if (align >= 4096) break; if ((unsigned long)buf & 4095) break; i++; } while (1); printf("%d %p\n", i, buf); memset(buf, 0x11, 4096); for (i = 0; i < 10; i++) { lseek(fd, SEEK_CUR, 4096); write(fd, buf, 4096); } close(fd); return 0; } for in in `seq 1 30` ; do a.out aa ; done The file was placed into 8 GB partition on HDD below to avoid speed change due to different offset on disk. Results are reliable: - 189 vs 180 seconds on Linux 3.16 The following setups have been tested: 1) ext4 with block size equals to 1024 over 512/512 physical/logical sector size SSD disk 2) ext4 with block size equals to 4096 over 512/512 physical/logical sector size SSD disk 3) ext4 with block size equals to 4096 over 512/4096 physical/logical sector size rotational disk (WDC WD20EZRX) 4) xfs with block size equals to 4096 over 512/512 physical/logical sector size SSD disk The difference is quite reliable and the same 5%. qemu-io -n -c 'write -P 0xaa 0 1G' 1.img for image in qcow2 format is 1% faster. Changes from v3: - portable way to calculate system page size used - 512/4096 values are replaced with proper macros/values Changes from v2: - opt_mem_alignment is split to opt_mem_alignment for bounce buffering and min_mem_alignment to check buffers coming from guest. Changes from v1: - enforces 4096 alignment in qemu_(try_)blockalign, avoid touching of bdrv_qiov_is_aligned path not to enforce additional bounce buffering as suggested by Paolo - reduces 10% to 5% in patch description to better fit 180 vs 189 difference Signed-off-by: Denis V. Lunev CC: Paolo Bonzini CC: Kevin Wolf ping
Re: [Qemu-devel] [PATCH 1/2] block, raw-posix: replace 512/4096 constants with proper macros/values
On 16/02/15 13:32, Kevin Wolf wrote: Am 06.02.2015 um 18:37 hat Denis V. Lunev geschrieben: Signed-off-by: Denis V. Lunev CC: Paolo Bonzini CC: Kevin Wolf --- block.c | 10 +- block/raw-posix.c | 16 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/block.c b/block.c index d45e4dd..e98d651 100644 --- a/block.c +++ b/block.c @@ -225,8 +225,8 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs, size_t bdrv_opt_mem_align(BlockDriverState *bs) { if (!bs || !bs->drv) { -/* 4k should be on the safe side */ -return 4096; +/* page size should be on the safe side */ +return getpagesize(); Can we make this MAX(4096, getpagesize())? The 4k aren't chosen because of buffer alignment in RAM, but because of a disk sector size of 4k is the highest common value. } return bs->bl.opt_mem_alignment; @@ -543,7 +543,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp) bs->bl.max_transfer_length = bs->file->bl.max_transfer_length; bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment; } else { -bs->bl.opt_mem_alignment = 512; +bs->bl.opt_mem_alignment = BDRV_SECTOR_SIZE; I wouldn't make this conversion. The value happens to be the same, but BDRV_SECTOR_SIZE is just the arbitrary unit that the block layer uses internally for things like sector_num/nb_sectors. The 512 in this place, however, is the minimum alignment that hardware could require, and logically independent of BDRV_SECTOR_SIZE. Similar considerations apply to the other conversions of 512 in this patch. They would all be better left as literal 512 (unless you want to introduce new constants for their specific purpose). } if (bs->backing_hd) { @@ -965,8 +965,8 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file, } bs->open_flags = flags; -bs->guest_block_size = 512; -bs->request_alignment = 512; +bs->guest_block_size = BDRV_SECTOR_SIZE; +bs->request_alignment = BDRV_SECTOR_SIZE; bs->zero_beyond_eof = true; open_flags = bdrv_open_flags(bs, flags); bs->read_only = !(open_flags & BDRV_O_RDWR); diff --git a/block/raw-posix.c b/block/raw-posix.c index 853ffa6..9848f83 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -122,7 +122,6 @@ reopen it to see if the disk has been changed */ #define FD_OPEN_TIMEOUT (10) -#define MAX_BLOCKSIZE 4096 typedef struct BDRVRawState { int fd; @@ -222,6 +221,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) BDRVRawState *s = bs->opaque; char *buf; unsigned int sector_size; +size_t page_size = getpagesize(); /* For /dev/sg devices the alignment is not really used. With buffered I/O, we don't have any restrictions. */ @@ -264,9 +264,9 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) /* If we could not get the sizes so far, we can only guess them */ if (!s->buf_align) { size_t align; -buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE); -for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { -if (pread(fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) { +buf = qemu_memalign(page_size, 2 * page_size); +for (align = BDRV_SECTOR_SIZE; align <= page_size; align <<= 1) { +if (pread(fd, buf + align, page_size, 0) >= 0) { s->buf_align = align; break; } Here, I'd prefer MAX(getpagesize(), MAX_BLOCKSIZE) as well. Kevin this makes sense. OK
Re: [Qemu-devel] [PATCH 2/2] block: align bounce buffers to page
On 16/02/15 13:59, Kevin Wolf wrote: Am 06.02.2015 um 18:37 hat Denis V. Lunev geschrieben: The following sequence int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); for (i = 0; i < 10; i++) write(fd, buf, 4096); performs 5% better if buf is aligned to 4096 bytes. The difference is quite reliable. On the other hand we do not want at the moment to enforce bounce buffering if guest request is aligned to 512 bytes. The patch introduces new concept: minimal memory alignment for bounce buffers. Original so called "optimal" value is actually minimal required value for aligment. Optimal should be set to page size by default. There is no driver which should change this default at the moment. Signed-off-by: Denis V. Lunev CC: Paolo Bonzini CC: Kevin Wolf I don't think the subject lines describes the patch correctly. IIUC, patch 1 introduces a hard enforcement of page size alignment, and this patch only loosens it again. Perhaps the order of patches should be changed so that min_mem_alignment is introduced first (without changes in the behaviour), and only then opt_mem_alignment is increased. Kevin ok :)
Re: [Qemu-devel] [PATCH 03/10] guest agent: guest-file-open: refactoring
On 17/02/15 18:27, Eric Blake wrote: On 02/16/2015 08:14 PM, Michael Roth wrote: From: Simon Zolin Moved the code that sets non-blocking flag on fd into a separate function. Signed-off-by: Simon Zolin Reviewed-by: Roman Kagan Signed-off-by: Denis V. Lunev CC: Michael Roth CC: Eric Blake Signed-off-by: Michael Roth --- qga/commands-posix.c | 31 +++ 1 file changed, 23 insertions(+), 8 deletions(-) diff --git a/qga/commands-posix.c b/qga/commands-posix.c index 57409d0..ed527a3 100644 --- a/qga/commands-posix.c +++ b/qga/commands-posix.c @@ -376,13 +376,33 @@ safe_open_or_create(const char *path, const char *mode, Error **errp) return NULL; } +static int guest_file_toggle_flags(int fd, int flags, bool set, Error **err) +{ Why are you reinventing qemu_set_nonblock()? because we are uneducated :) Anyway, qemu_set_nonblock() does not handle error and resides in a strange header aka "include/qemu/sockets.h" Technically I can switch to it immediately. Though error check condition will be lost. What is better at your opinion? a) return error from qemu_set_nonblock()/qemu_set_block() b) drop error check here. The descriptor is just opened and we know that it is valid. I could not imagine real error other than broken descriptor for this exact fcntl. Regards, Den
Re: [Qemu-devel] [PATCH 03/10] guest agent: guest-file-open: refactoring
On 17/02/15 20:56, Michael Roth wrote: Quoting Denis V. Lunev (2015-02-17 10:06:49) On 17/02/15 18:27, Eric Blake wrote: On 02/16/2015 08:14 PM, Michael Roth wrote: From: Simon Zolin Moved the code that sets non-blocking flag on fd into a separate function. Signed-off-by: Simon Zolin Reviewed-by: Roman Kagan Signed-off-by: Denis V. Lunev CC: Michael Roth CC: Eric Blake Signed-off-by: Michael Roth --- qga/commands-posix.c | 31 +++ 1 file changed, 23 insertions(+), 8 deletions(-) diff --git a/qga/commands-posix.c b/qga/commands-posix.c index 57409d0..ed527a3 100644 --- a/qga/commands-posix.c +++ b/qga/commands-posix.c @@ -376,13 +376,33 @@ safe_open_or_create(const char *path, const char *mode, Error **errp) return NULL; } +static int guest_file_toggle_flags(int fd, int flags, bool set, Error **err) +{ Why are you reinventing qemu_set_nonblock()? because we are uneducated :) Anyway, qemu_set_nonblock() does not handle error and resides in a strange header aka "include/qemu/sockets.h" Technically I can switch to it immediately. Though error check condition will be lost. What is better at your opinion? a) return error from qemu_set_nonblock()/qemu_set_block() I think making the existing functions a non-error-checking wrapper around qemu_set_{block,nonblock}_error or something would be best. These are meant to be os-agnostic utility functions though, but in the case of qemu-ga the win32 variant won't work as expected, so I'm not sure it's a good idea to rely on them. If the lack of it's usage in net/tap* compared to other parts of QEMU that build on w32 is any indication, that seems to be the pattern followed by other users. In any case, since I was actually the one who re-invented it, and this code just moves it to another function, I think we can address it as a seperate patch and leave the PULL intact (unless there are other objections). OK to me. I'll check Win32 version with Olga to be somewhat sure with upcoming cleanup.
Re: [Qemu-devel] [RESEND PATCH v3 0/2] balloon: add a feature bit to let Guest OS deflate virtio_balloon on OOM
On 29/01/15 17:24, Denis V. Lunev wrote: Excessive virtio_balloon inflation can cause invocation of OOM-killer, when Linux is under severe memory pressure. Various mechanisms are responsible for correct virtio_balloon memory management. Nevertheless it is often the case that these control tools does not have enough time to react on fast changing memory load. As a result OS runs out of memory and invokes OOM-killer. The balancing of memory by use of the virtio balloon should not cause the termination of processes while there are pages in the balloon. Now there is no way for virtio balloon driver to free memory at the last moment before some process get killed by OOM-killer. This does not provide a security breach as balloon itself is running inside Guest OS and is working in the cooperation with the host. Thus some improvements from Guest side should be considered as normal. To solve the problem, introduce a virtio_balloon callback which is expected to be called from the oom notifier call chain in out_of_memory() function. If virtio balloon could release some memory, it will make the system to return and retry the allocation that forced the out of memory killer to run. This behavior should be enabled if and only if appropriate feature bit is set on the device. It is off by default. This functionality was recently merged into vanilla Linux (actually in linux-next at the moment) commit 5a10b7dbf904bfe01bb9fcc6298f7df09eed77d5 Author: Raushaniya Maksudova Date: Mon Nov 10 09:36:29 2014 +1030 This patch adds respective control bits into QEMU. It introduces deflate-on-oom option for baloon device which do the trick. Changes from v2: - fixed mistake with bit number in virtio_balloon_get_features Changes from v1: - From: in patch 1 according to the original ownership - feature processing in patch 2 as suggested by Michael. It could be done without additional field, but this will require to move the property level up, i.e. to PCI & CCW level. Signed-off-by: Raushaniya Maksudova Signed-off-by: Denis V. Lunev CC: Anthony Liguori CC: Michael S. Tsirkin ping any opinion on this? At least patch 1 should be committed.
Re: [Qemu-devel] [RESEND PATCH 1/2] balloon: call qdev_alias_all_properties for proxy dev in balloon class init
On 19/02/15 12:25, Michael S. Tsirkin wrote: On Thu, Jan 29, 2015 at 05:24:41PM +0300, Denis V. Lunev wrote: The idea is that all other virtio devices are calling this helper to merge properties of the proxy device. This is the only difference in between this helper and code in inside virtio_instance_init_common. The patch should not cause any harm as property list in generic balloon code is empty. This also allows to avoid some dummy errors like fixed by this commit 91ba21208839643603e7f7fa5864723c3f371ebe Author: Gonglei Date: Tue Sep 30 14:10:35 2014 +0800 virtio-balloon: fix virtio-balloon child refcount in transports Signed-off-by: Denis V. Lunev Signed-off-by: Raushaniya Maksudova Revieved-by: Cornelia Huck CC: Christian Borntraeger CC: Anthony Liguori CC: Michael S. Tsirkin --- hw/s390x/virtio-ccw.c | 5 ++--- hw/virtio/virtio-pci.c | 5 ++--- 2 files changed, 4 insertions(+), 6 deletions(-) diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c index ea236c9..82da894 100644 --- a/hw/s390x/virtio-ccw.c +++ b/hw/s390x/virtio-ccw.c @@ -899,9 +899,8 @@ static void balloon_ccw_stats_set_poll_interval(Object *obj, struct Visitor *v, static void virtio_ccw_balloon_instance_init(Object *obj) { VirtIOBalloonCcw *dev = VIRTIO_BALLOON_CCW(obj); -object_initialize(&dev->vdev, sizeof(dev->vdev), TYPE_VIRTIO_BALLOON); -object_property_add_child(obj, "virtio-backend", OBJECT(&dev->vdev), NULL); -object_unref(OBJECT(&dev->vdev)); +virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev), +TYPE_VIRTIO_BALLOON); object_property_add(obj, "guest-stats", "guest statistics", balloon_ccw_stats_get_all, NULL, NULL, dev, NULL); diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c index dde1d73..745324b 100644 --- a/hw/virtio/virtio-pci.c +++ b/hw/virtio/virtio-pci.c @@ -1316,9 +1316,8 @@ static void virtio_balloon_pci_class_init(ObjectClass *klass, void *data) static void virtio_balloon_pci_instance_init(Object *obj) { VirtIOBalloonPCI *dev = VIRTIO_BALLOON_PCI(obj); -object_initialize(&dev->vdev, sizeof(dev->vdev), TYPE_VIRTIO_BALLOON); -object_property_add_child(obj, "virtio-backend", OBJECT(&dev->vdev), NULL); -object_unref(OBJECT(&dev->vdev)); +virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev), +TYPE_VIRTIO_BALLOON); object_property_add(obj, "guest-stats", "guest statistics", balloon_pci_stats_get_all, NULL, NULL, dev, NULL); OK, but what about this guest-stats property? Should it get the same treatment? -- 1.9.1 hmm, IMHO no. init_common is actually do the following void virtio_instance_init_common(Object *proxy_obj, void *data, size_t vdev_size, const char *vdev_name) { DeviceState *vdev = data; object_initialize(vdev, vdev_size, vdev_name); object_property_add_child(proxy_obj, "virtio-backend", OBJECT(vdev), NULL); object_unref(OBJECT(vdev)); qdev_alias_all_properties(vdev, proxy_obj); } on the other hand there is the following code in s390 static void s390_virtio_net_instance_init(Object *obj) { VirtIONetS390 *dev = VIRTIO_NET_S390(obj); virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev), TYPE_VIRTIO_NET); object_property_add_alias(obj, "bootindex", OBJECT(&dev->vdev), "bootindex", &error_abort); } which does not contain guest-stats property.
Re: [Qemu-devel] [RESEND PATCH 1/2] balloon: call qdev_alias_all_properties for proxy dev in balloon class init
On 19/02/15 12:39, Michael S. Tsirkin wrote: On Thu, Feb 19, 2015 at 12:36:37PM +0300, Denis V. Lunev wrote: On 19/02/15 12:25, Michael S. Tsirkin wrote: On Thu, Jan 29, 2015 at 05:24:41PM +0300, Denis V. Lunev wrote: The idea is that all other virtio devices are calling this helper to merge properties of the proxy device. This is the only difference in between this helper and code in inside virtio_instance_init_common. The patch should not cause any harm as property list in generic balloon code is empty. This also allows to avoid some dummy errors like fixed by this commit 91ba21208839643603e7f7fa5864723c3f371ebe Author: Gonglei Date: Tue Sep 30 14:10:35 2014 +0800 virtio-balloon: fix virtio-balloon child refcount in transports Signed-off-by: Denis V. Lunev Signed-off-by: Raushaniya Maksudova Revieved-by: Cornelia Huck CC: Christian Borntraeger CC: Anthony Liguori CC: Michael S. Tsirkin --- hw/s390x/virtio-ccw.c | 5 ++--- hw/virtio/virtio-pci.c | 5 ++--- 2 files changed, 4 insertions(+), 6 deletions(-) diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c index ea236c9..82da894 100644 --- a/hw/s390x/virtio-ccw.c +++ b/hw/s390x/virtio-ccw.c @@ -899,9 +899,8 @@ static void balloon_ccw_stats_set_poll_interval(Object *obj, struct Visitor *v, static void virtio_ccw_balloon_instance_init(Object *obj) { VirtIOBalloonCcw *dev = VIRTIO_BALLOON_CCW(obj); -object_initialize(&dev->vdev, sizeof(dev->vdev), TYPE_VIRTIO_BALLOON); -object_property_add_child(obj, "virtio-backend", OBJECT(&dev->vdev), NULL); -object_unref(OBJECT(&dev->vdev)); +virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev), +TYPE_VIRTIO_BALLOON); object_property_add(obj, "guest-stats", "guest statistics", balloon_ccw_stats_get_all, NULL, NULL, dev, NULL); diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c index dde1d73..745324b 100644 --- a/hw/virtio/virtio-pci.c +++ b/hw/virtio/virtio-pci.c @@ -1316,9 +1316,8 @@ static void virtio_balloon_pci_class_init(ObjectClass *klass, void *data) static void virtio_balloon_pci_instance_init(Object *obj) { VirtIOBalloonPCI *dev = VIRTIO_BALLOON_PCI(obj); -object_initialize(&dev->vdev, sizeof(dev->vdev), TYPE_VIRTIO_BALLOON); -object_property_add_child(obj, "virtio-backend", OBJECT(&dev->vdev), NULL); -object_unref(OBJECT(&dev->vdev)); +virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev), +TYPE_VIRTIO_BALLOON); object_property_add(obj, "guest-stats", "guest statistics", balloon_pci_stats_get_all, NULL, NULL, dev, NULL); OK, but what about this guest-stats property? Should it get the same treatment? -- 1.9.1 hmm, IMHO no. init_common is actually do the following void virtio_instance_init_common(Object *proxy_obj, void *data, size_t vdev_size, const char *vdev_name) { DeviceState *vdev = data; object_initialize(vdev, vdev_size, vdev_name); object_property_add_child(proxy_obj, "virtio-backend", OBJECT(vdev), NULL); object_unref(OBJECT(vdev)); qdev_alias_all_properties(vdev, proxy_obj); } on the other hand there is the following code in s390 static void s390_virtio_net_instance_init(Object *obj) { VirtIONetS390 *dev = VIRTIO_NET_S390(obj); virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev), TYPE_VIRTIO_NET); object_property_add_alias(obj, "bootindex", OBJECT(&dev->vdev), "bootindex", &error_abort); } which does not contain guest-stats property. But why doesn't it? Seems like an obvious omission? no it is not cfind . | xargs fgrep "guest-stats" ./hw/virtio/virtio-pci.c: object_property_get(OBJECT(&dev->vdev), v, "guest-stats", errp); ./hw/virtio/virtio-pci.c: object_property_get(OBJECT(&dev->vdev), v, "guest-stats-polling-interval", ./hw/virtio/virtio-pci.c: object_property_set(OBJECT(&dev->vdev), v, "guest-stats-polling-interval", ./hw/virtio/virtio-pci.c:object_property_add(obj, "guest-stats", "guest statistics", ./hw/virtio/virtio-pci.c:object_property_add(obj, "guest-stats-polling-interval", "int", ./hw/virtio/virtio-balloon.c:visit_start_struct(v, NULL, "guest-stats", name, 0, &err); ./hw/virtio/virtio-balloon.c:object_property_add(OBJECT(dev), "guest-stats", "guest statistics", ./hw/virtio/virtio-balloon.c:object_property_add(OBJECT(dev), "guest-stats-polling-interval", "int", ./hw/s390x/virtio-ccw.c: obje
Re: [Qemu-devel] [RESEND PATCH 1/2] balloon: call qdev_alias_all_properties for proxy dev in balloon class init
On 19/02/15 13:17, Michael S. Tsirkin wrote: On Thu, Feb 19, 2015 at 12:46:34PM +0300, Denis V. Lunev wrote: On 19/02/15 12:39, Michael S. Tsirkin wrote: On Thu, Feb 19, 2015 at 12:36:37PM +0300, Denis V. Lunev wrote: On 19/02/15 12:25, Michael S. Tsirkin wrote: On Thu, Jan 29, 2015 at 05:24:41PM +0300, Denis V. Lunev wrote: The idea is that all other virtio devices are calling this helper to merge properties of the proxy device. This is the only difference in between this helper and code in inside virtio_instance_init_common. The patch should not cause any harm as property list in generic balloon code is empty. This also allows to avoid some dummy errors like fixed by this commit 91ba21208839643603e7f7fa5864723c3f371ebe Author: Gonglei Date: Tue Sep 30 14:10:35 2014 +0800 virtio-balloon: fix virtio-balloon child refcount in transports Signed-off-by: Denis V. Lunev Signed-off-by: Raushaniya Maksudova Revieved-by: Cornelia Huck CC: Christian Borntraeger CC: Anthony Liguori CC: Michael S. Tsirkin --- hw/s390x/virtio-ccw.c | 5 ++--- hw/virtio/virtio-pci.c | 5 ++--- 2 files changed, 4 insertions(+), 6 deletions(-) diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c index ea236c9..82da894 100644 --- a/hw/s390x/virtio-ccw.c +++ b/hw/s390x/virtio-ccw.c @@ -899,9 +899,8 @@ static void balloon_ccw_stats_set_poll_interval(Object *obj, struct Visitor *v, static void virtio_ccw_balloon_instance_init(Object *obj) { VirtIOBalloonCcw *dev = VIRTIO_BALLOON_CCW(obj); -object_initialize(&dev->vdev, sizeof(dev->vdev), TYPE_VIRTIO_BALLOON); -object_property_add_child(obj, "virtio-backend", OBJECT(&dev->vdev), NULL); -object_unref(OBJECT(&dev->vdev)); +virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev), +TYPE_VIRTIO_BALLOON); object_property_add(obj, "guest-stats", "guest statistics", balloon_ccw_stats_get_all, NULL, NULL, dev, NULL); diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c index dde1d73..745324b 100644 --- a/hw/virtio/virtio-pci.c +++ b/hw/virtio/virtio-pci.c @@ -1316,9 +1316,8 @@ static void virtio_balloon_pci_class_init(ObjectClass *klass, void *data) static void virtio_balloon_pci_instance_init(Object *obj) { VirtIOBalloonPCI *dev = VIRTIO_BALLOON_PCI(obj); -object_initialize(&dev->vdev, sizeof(dev->vdev), TYPE_VIRTIO_BALLOON); -object_property_add_child(obj, "virtio-backend", OBJECT(&dev->vdev), NULL); -object_unref(OBJECT(&dev->vdev)); +virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev), +TYPE_VIRTIO_BALLOON); object_property_add(obj, "guest-stats", "guest statistics", balloon_pci_stats_get_all, NULL, NULL, dev, NULL); OK, but what about this guest-stats property? Should it get the same treatment? -- 1.9.1 hmm, IMHO no. init_common is actually do the following void virtio_instance_init_common(Object *proxy_obj, void *data, size_t vdev_size, const char *vdev_name) { DeviceState *vdev = data; object_initialize(vdev, vdev_size, vdev_name); object_property_add_child(proxy_obj, "virtio-backend", OBJECT(vdev), NULL); object_unref(OBJECT(vdev)); qdev_alias_all_properties(vdev, proxy_obj); } on the other hand there is the following code in s390 static void s390_virtio_net_instance_init(Object *obj) { VirtIONetS390 *dev = VIRTIO_NET_S390(obj); virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev), TYPE_VIRTIO_NET); object_property_add_alias(obj, "bootindex", OBJECT(&dev->vdev), "bootindex", &error_abort); } which does not contain guest-stats property. But why doesn't it? Seems like an obvious omission? no it is not cfind . | xargs fgrep "guest-stats" ./hw/virtio/virtio-pci.c: object_property_get(OBJECT(&dev->vdev), v, "guest-stats", errp); ./hw/virtio/virtio-pci.c: object_property_get(OBJECT(&dev->vdev), v, "guest-stats-polling-interval", ./hw/virtio/virtio-pci.c: object_property_set(OBJECT(&dev->vdev), v, "guest-stats-polling-interval", ./hw/virtio/virtio-pci.c:object_property_add(obj, "guest-stats", "guest statistics", ./hw/virtio/virtio-pci.c:object_property_add(obj, "guest-stats-polling-interval", "int", ./hw/virtio/virtio-balloon.c:visit_start_struct(v, NULL, "guest-stats", name, 0, &err); ./hw/virtio/virtio-balloon.c:object_property_add(OBJECT(dev), "guest-stats", "guest statistics", ./hw/virtio/virtio-balloon.c:object_property_a
Re: [Qemu-devel] [PATCH 03/10] guest agent: guest-file-open: refactoring
On 17/02/15 20:59, Eric Blake wrote: On 02/17/2015 10:56 AM, Michael Roth wrote: In any case, since I was actually the one who re-invented it, and this code just moves it to another function, I think we can address it as a seperate patch and leave the PULL intact (unless there are other objections). Agree that a separate cleanup patch would be okay. OK. There is a portable way to detect whether the descriptor is a pipe or socket on Windows. Thus generic helper is possible. We'll provide you with one. ETA is something around next week.
[Qemu-devel] [PATCH 1/2] balloon: call qdev_alias_all_properties for proxy dev in balloon class init
The idea is that all other virtio devices are calling this helper to merge properties of the proxy device. This is the only difference in between this helper and code in inside virtio_instance_init_common. The patch should not cause any harm as property list in generic balloon code is empty. This also allows to avoid some dummy errors like fixed by this commit 91ba21208839643603e7f7fa5864723c3f371ebe Author: Gonglei Date: Tue Sep 30 14:10:35 2014 +0800 virtio-balloon: fix virtio-balloon child refcount in transports Signed-off-by: Denis V. Lunev Signed-off-by: Raushaniya Maksudova Revieved-by: Cornelia Huck CC: Christian Borntraeger CC: Anthony Liguori CC: Michael S. Tsirkin Signed-off-by: Denis V. Lunev --- hw/s390x/virtio-ccw.c | 5 ++--- hw/virtio/virtio-pci.c | 5 ++--- 2 files changed, 4 insertions(+), 6 deletions(-) diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c index 3fee4aa..ffbb9c2 100644 --- a/hw/s390x/virtio-ccw.c +++ b/hw/s390x/virtio-ccw.c @@ -898,9 +898,8 @@ static void balloon_ccw_stats_set_poll_interval(Object *obj, struct Visitor *v, static void virtio_ccw_balloon_instance_init(Object *obj) { VirtIOBalloonCcw *dev = VIRTIO_BALLOON_CCW(obj); -object_initialize(&dev->vdev, sizeof(dev->vdev), TYPE_VIRTIO_BALLOON); -object_property_add_child(obj, "virtio-backend", OBJECT(&dev->vdev), NULL); -object_unref(OBJECT(&dev->vdev)); +virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev), +TYPE_VIRTIO_BALLOON); object_property_add(obj, "guest-stats", "guest statistics", balloon_ccw_stats_get_all, NULL, NULL, dev, NULL); diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c index 6dd41b9..e7baf7b 100644 --- a/hw/virtio/virtio-pci.c +++ b/hw/virtio/virtio-pci.c @@ -1274,9 +1274,8 @@ static void virtio_balloon_pci_class_init(ObjectClass *klass, void *data) static void virtio_balloon_pci_instance_init(Object *obj) { VirtIOBalloonPCI *dev = VIRTIO_BALLOON_PCI(obj); -object_initialize(&dev->vdev, sizeof(dev->vdev), TYPE_VIRTIO_BALLOON); -object_property_add_child(obj, "virtio-backend", OBJECT(&dev->vdev), NULL); -object_unref(OBJECT(&dev->vdev)); +virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev), +TYPE_VIRTIO_BALLOON); object_property_add(obj, "guest-stats", "guest statistics", balloon_pci_stats_get_all, NULL, NULL, dev, NULL); -- 1.9.1
[Qemu-devel] [PATCH 2/2] balloon: add a feature bit to let Guest OS deflate balloon on oom
Excessive virtio_balloon inflation can cause invocation of OOM-killer, when Linux is under severe memory pressure. Various mechanisms are responsible for correct virtio_balloon memory management. Nevertheless it is often the case that these control tools does not have enough time to react on fast changing memory load. As a result OS runs out of memory and invokes OOM-killer. The balancing of memory by use of the virtio balloon should not cause the termination of processes while there are pages in the balloon. Now there is no way for virtio balloon driver to free memory at the last moment before some process get killed by OOM-killer. This does not provide a security breach as balloon itself is running inside Guest OS and is working in the cooperation with the host. Thus some improvements from Guest side should be considered as normal. To solve the problem, introduce a virtio_balloon callback which is expected to be called from the oom notifier call chain in out_of_memory() function. If virtio balloon could release some memory, it will make the system return and retry the allocation that forced the out of memory killer to run. This behavior should be enabled if and only if appropriate feature bit is set on the device. It is off by default. This functionality was recently merged into vanilla Linux. commit 5a10b7dbf904bfe01bb9fcc6298f7df09eed77d5 Author: Raushaniya Maksudova Date: Mon Nov 10 09:36:29 2014 +1030 This patch adds respective control bits into QEMU. It introduces deflate-on-oom option for balloon device which does the trick. Signed-off-by: Denis V. Lunev CC: Raushaniya Maksudova CC: Anthony Liguori CC: Michael S. Tsirkin --- hw/virtio/virtio-balloon.c | 6 -- include/hw/virtio/virtio-balloon.h | 1 + 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c index 21e449a..cbc5f7f 100644 --- a/hw/virtio/virtio-balloon.c +++ b/hw/virtio/virtio-balloon.c @@ -305,8 +305,8 @@ static void virtio_balloon_set_config(VirtIODevice *vdev, static uint32_t virtio_balloon_get_features(VirtIODevice *vdev, uint32_t f) { -f |= (1 << VIRTIO_BALLOON_F_STATS_VQ); -return f; +VirtIOBalloon *dev = VIRTIO_BALLOON(vdev); +return f | (1u << VIRTIO_BALLOON_F_STATS_VQ) | dev->host_features; } static void virtio_balloon_stat(void *opaque, BalloonInfo *info) @@ -409,6 +409,8 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp) } static Property virtio_balloon_properties[] = { +DEFINE_PROP_BIT("deflate-on-oom", VirtIOBalloon, host_features, +VIRTIO_BALLOON_F_DEFLATE_ON_OOM, false), DEFINE_PROP_END_OF_LIST(), }; diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h index 4ab8f54..7f49b1f 100644 --- a/include/hw/virtio/virtio-balloon.h +++ b/include/hw/virtio/virtio-balloon.h @@ -36,6 +36,7 @@ typedef struct VirtIOBalloon { QEMUTimer *stats_timer; int64_t stats_last_update; int64_t stats_poll_interval; +uint32_t host_features; } VirtIOBalloon; #endif -- 1.9.1
Re: [Qemu-devel] [PATCH v2 0/8] qemu: guest agent: implement guest-exec command for Linux
On 31/12/14 16:06, Denis V. Lunev wrote: hese patches for guest-agent add the functionality to execute commands on a guest UNIX machine. These patches add the following interfaces: guest-pipe-open guest-exec guest-exec-status With these interfaces it's possible to: * Open an anonymous pipe and work with it as with a file using already implemented interfaces guest-file-{read,write,flush,close}. * Execute a binary or a script on a guest machine. We can pass arbitrary arguments and environment to a new child process. * Pass redirected IO from/to stdin, stdout, stderr from a child process to a local file or an anonymous pipe. * Check the status of a child process, get its exit status if it exited or get signal number if it was killed. We plan to add support for Windows in the near future using the same interfaces we introduce here. Example of usage: {"execute": "guest-pipe-open", "arguments":{"mode": "r"}} {'return': 1000} {"execute":"guest-exec", "arguments":{ "path": "id", "params": ["user"], "env": ["MYENV=myvalue"], "handle_stdout": 1000 }}' {"return": 2636} {"execute": "guest-exec-status", "arguments": {"pid": 2636}} {"return":{"exit":0,"handle_stderr":-1,"handle_stdin":-1,"handle_stdout":1000,"signal":-1}} {"execute": "guest-file-read", "arguments": {"handle": 1000, "count":128}} {"return":{"count":58,"buf-b64":"dWlk...","eof":true}} {"execute": "guest-file-close", "arguments": {"handle": 1000}} {"return":{}} These patches are based on the patches proposed by Michael Roth in 2011: http://lists.gnu.org/archive/html/qemu-devel/2011-12/msg00722.html We made several modifications to the interfaces proposed by Michael to simplify their usage and we also added several new functions: * Arguments to an executable file are passed as an array of strings. Before that we had to pass parameters like this: 'params': [ {'param': '-b'}, {'param': '-n1'} ] Now we may pass them just as an array of strings: 'params': [ '-b', '-n1' ] * Environment can be passed to a child process. "env": ["MYENV=myvalue"] * Removed "detach" argument from "guest-exec" - it never waits for a child process to signal. With this change it's possible to return just PID from "guest-exec", a user must call "guest-exec-status" to get the status of a child. * Removed "wait" argument from "guest-exec-status" - waiting for a child process to signal is dangerous, because it may block the whole qemu-ga. Instead, the command polls just once a status of a child. * "guest-exec-status" returns exit status of a child process or a signal number, in case a process was killed. * Simplified the command "guest-pipe-open" - the way how it stores the pipe fd's. No additional logic is needed about which end of pipe should be closed with "guest-file-close". This way we avoid modifying the interface of the existing "guest-file-close". In the conversation about the original patches there was a suggestion to merge "path" and "params" into one single list. But we didn't do so, because having separate "path" is similar to exec*() family functions in UNIX. That way it looks more consistent. Changes from v1: - Windows version of the patchset is added - SIGPIPE processing is added for Unix version of the patchset - added guest_exec_file_busy() to prevent GuestFileHandle* object from being deleted in case it's used in guest-exec command. Signed-off-by: Semen Zolin Signed-off-by: Olga Krishtal Signed-off-by: Denis V. Lunev ping
[Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files
fallocate() works fine and could handle properly with arbitrary size requests. There is no sense to reduce the amount of space to fallocate. The bigger is the size, the better is the performance as the amount of journal updates is reduced. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 17 + 1 file changed, 17 insertions(+) diff --git a/block/raw-posix.c b/block/raw-posix.c index fa05239..e0b35c9 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -292,6 +292,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) } } +static void raw_probe_max_write_zeroes(BlockDriverState *bs) +{ +BDRVRawState *s = bs->opaque; +struct stat st; + +if (fstat(s->fd, &st) < 0) { +return; /* no problem, keep default value */ +} +if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { +return; +} +bs->bl.max_write_zeroes = INT_MAX; +} + static void raw_parse_flags(int bdrv_flags, int *open_flags) { assert(open_flags != NULL); @@ -598,6 +612,7 @@ static int raw_reopen_prepare(BDRVReopenState *state, /* Fail already reopen_prepare() if we can't get a working O_DIRECT * alignment with the new fd. */ if (raw_s->fd != -1) { +raw_probe_max_write_zeroes(state->bs); raw_probe_alignment(state->bs, raw_s->fd, &local_err); if (local_err) { qemu_close(raw_s->fd); @@ -651,6 +666,8 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) raw_probe_alignment(bs, s->fd, errp); bs->bl.opt_mem_alignment = s->buf_align; + +raw_probe_max_write_zeroes(bs); } static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb) -- 1.9.1
[Qemu-devel] [PATCH 3/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit
move code dealing with a block device to a separate function. This will allow to implement additional processing for ordinary files. Pls note, that xfs_code has been moved before checking for s->has_write_zeroes as xfs_write_zeroes does not touch this flag inside. This makes code a bit more consistent. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 48 +--- 1 file changed, 29 insertions(+), 19 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 2aa268a..24e1fab 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -914,41 +914,51 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len) } #endif -static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) +static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) { -int ret = -EOPNOTSUPP; +int ret = -ENOTSUP; BDRVRawState *s = aiocb->bs->opaque; -if (s->has_write_zeroes == 0) { +if (!s->has_write_zeroes) { return -ENOTSUP; } -if (aiocb->aio_type & QEMU_AIO_BLKDEV) { #ifdef BLKZEROOUT -do { -uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes }; -if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) { -return 0; -} -} while (errno == EINTR); - -ret = -errno; -#endif -} else { -#ifdef CONFIG_XFS -if (s->is_xfs) { -return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes); +do { +uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes }; +if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) { +return 0; } +} while (errno == EINTR); + +ret = translate_err(-errno); #endif -} -ret = translate_err(ret); if (ret == -ENOTSUP) { s->has_write_zeroes = false; } return ret; } +static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) +{ +int ret = -ENOTSUP; +BDRVRawState *s = aiocb->bs->opaque; + +if (aiocb->aio_type & QEMU_AIO_BLKDEV) { +return handle_aiocb_write_zeroes_block(aiocb); +} + +#ifdef CONFIG_XFS +if (s->is_xfs) { +return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes); +} +#endif + +s->has_write_zeroes = false; +return ret; +} + static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb) { int ret = -EOPNOTSUPP; -- 1.9.1
[Qemu-devel] [PATCH 1/7] block/raw-posix: create translate_err helper to merge errno values
actually the code if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP || ret == -ENOTTY) { ret = -ENOTSUP; } is present twice and will be added a couple more times. Create helper for this. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index e51293a..24300d0 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -893,6 +893,15 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes) } #endif +static int translate_err(int err) +{ +if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP || +err == -ENOTTY) { +err = -ENOTSUP; +} +return err; +} + static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) { int ret = -EOPNOTSUPP; @@ -921,10 +930,9 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) #endif } -if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP || -ret == -ENOTTY) { +ret = translate_err(ret); +if (ret == -ENOTSUP) { s->has_write_zeroes = false; -ret = -ENOTSUP; } return ret; } @@ -968,10 +976,9 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb) #endif } -if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP || -ret == -ENOTTY) { +ret = translate_err(ret); +if (ret == -ENOTSUP) { s->has_discard = false; -ret = -ENOTSUP; } return ret; } -- 1.9.1
[Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
There is a possibility that we are extending our image and thus writing zeroes beyond the end of the file. In this case we do not need to care about the hole to make sure that there is no data in the file under this offset (pre-condition to fallocate(0) to work). We could simply call fallocate(0). This improves the performance of writing zeroes even on really old platforms which do not have even FALLOC_FL_PUNCH_HOLE. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index c039bef..fa05239 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -60,7 +60,7 @@ #define FS_NOCOW_FL 0x0080 /* Do not cow file */ #endif #endif -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) +#ifdef CONFIG_FALLOCATE #include #endif #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__) @@ -902,7 +902,7 @@ static int translate_err(int err) return err; } -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) +#ifdef CONFIG_FALLOCATE static int do_fallocate(int fd, int mode, off_t offset, off_t len) { do { @@ -981,6 +981,12 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) } #endif +#ifdef CONFIG_FALLOCATE +if (aiocb->aio_offset >= aiocb->bs->total_sectors << BDRV_SECTOR_BITS) { +return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes); +} +#endif + s->has_write_zeroes = false; return ret; } -- 1.9.1
[Qemu-devel] [PATCH 4/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes
This efficiently writes zeroes on Linux if the kernel is capable enough. FALLOC_FL_ZERO_RANGE correctly handles all cases, including and not including file expansion. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 16 ++-- configure | 19 +++ 2 files changed, 33 insertions(+), 2 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 24e1fab..3c35b2f 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -60,7 +60,7 @@ #define FS_NOCOW_FL 0x0080 /* Do not cow file */ #endif #endif -#ifdef CONFIG_FALLOCATE_PUNCH_HOLE +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) #include #endif #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__) @@ -902,7 +902,7 @@ static int translate_err(int err) return err; } -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) static int do_fallocate(int fd, int mode, off_t offset, off_t len) { do { @@ -955,6 +955,18 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) } #endif +#ifdef CONFIG_FALLOCATE_ZERO_RANGE +if (s->has_write_zeroes) { +ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE, + aiocb->aio_offset, aiocb->aio_nbytes); +if (ret == 0 || ret != -ENOTSUP) { +return ret; +} +s->has_write_zeroes = false; +return ret; +} +#endif + s->has_write_zeroes = false; return ret; } diff --git a/configure b/configure index f185dd0..e00e03a 100755 --- a/configure +++ b/configure @@ -3335,6 +3335,22 @@ if compile_prog "" "" ; then fallocate_punch_hole=yes fi +# check that fallocate supports range zeroing inside the file +fallocate_zero_range=no +cat > $TMPC << EOF +#include +#include + +int main(void) +{ +fallocate(0, FALLOC_FL_ZERO_RANGE, 0, 0); +return 0; +} +EOF +if compile_prog "" "" ; then + fallocate_zero_range=yes +fi + # check for posix_fallocate posix_fallocate=no cat > $TMPC << EOF @@ -4567,6 +4583,9 @@ fi if test "$fallocate_punch_hole" = "yes" ; then echo "CONFIG_FALLOCATE_PUNCH_HOLE=y" >> $config_host_mak fi +if test "$fallocate_zero_range" = "yes" ; then + echo "CONFIG_FALLOCATE_ZERO_RANGE=y" >> $config_host_mak +fi if test "$posix_fallocate" = "yes" ; then echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak fi -- 1.9.1
[Qemu-devel] [PATCH v4 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c
These patches eliminate data writes completely on Linux if fallocate FALLOC_FL_ZERO_RANGE or FALLOC_FL_PUNCH_HOLE are supported on underlying filesystem. I have performed several tests with non-aligned fallocate calls and in all cases (with non-aligned fallocates) Linux performs fine, i.e. areas are zeroed correctly. Checks were made on Linux 3.16.0-28-generic #38-Ubuntu SMP This should seriously increase performance of bdrv_write_zeroes Changes from v3: - dropped original patch 1, equivalent stuff was merged already - reordered patches as suggested by Fam - fixes spelling errors as suggested by Fam - fixed not initialized value in handle_aiocb_write_zeroes - fixed wrong error processing from do_fallocate(FALLOC_FL_ZERO_RANGE) Changes from v2: - added Peter Lieven to CC - added CONFIG_FALLOCATE check to call do_fallocate in patch 7 - dropped patch 1 as NACK-ed - added processing of very large data areas in bdrv_co_write_zeroes (new patch 1) - set bl.max_write_zeroes to INT_MAX in raw-posix.c for regular files (new patch 8) Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng
[Qemu-devel] [PATCH 2/7] block/raw-posix: create do_fallocate helper
The pattern do { if (fallocate(s->fd, mode, offset, len) == 0) { return 0; } } while (errno == EINTR); ret = translate_err(-errno); will be commonly useful in next patches. Create helper for it. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 22 ++ 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 24300d0..2aa268a 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -902,6 +902,18 @@ static int translate_err(int err) return err; } +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) +static int do_fallocate(int fd, int mode, off_t offset, off_t len) +{ +do { +if (fallocate(fd, mode, offset, len) == 0) { +return 0; +} +} while (errno == EINTR); +return translate_err(-errno); +} +#endif + static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) { int ret = -EOPNOTSUPP; @@ -965,14 +977,8 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb) #endif #ifdef CONFIG_FALLOCATE_PUNCH_HOLE -do { -if (fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, - aiocb->aio_offset, aiocb->aio_nbytes) == 0) { -return 0; -} -} while (errno == EINTR); - -ret = -errno; +ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + aiocb->aio_offset, aiocb->aio_nbytes); #endif } -- 1.9.1
[Qemu-devel] [PATCH 5/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes
This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported. Unfortunately, FALLOC_FL_ZERO_RANGE is supported on really modern systems and only for a couple of filesystems. FALLOC_FL_PUNCH_HOLE is much more mature. The sequence of 2 operations FALLOC_FL_PUNCH_HOLE and 0 is necessary due to the following reasons: - FALLOC_FL_PUNCH_HOLE creates a hole in the file, the file becomes sparse. In order to retain original functionality we must allocate disk space afterwards. This is done using fallocate(0) call - fallocate(0) without preceeding FALLOC_FL_PUNCH_HOLE will do nothing if called above already allocated areas of the file, i.e. the content will not be zeroed This should increase the performance a bit for not-so-modern kernels. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/block/raw-posix.c b/block/raw-posix.c index 3c35b2f..c039bef 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -967,6 +967,20 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) } #endif +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE +if (s->has_discard) { +ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + aiocb->aio_offset, aiocb->aio_nbytes); +if (ret < 0) { +if (ret == -ENOTSUP) { +s->has_discard = false; +} +return ret; +} +return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes); +} +#endif + s->has_write_zeroes = false; return ret; } -- 1.9.1
Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files
On 27/01/15 21:05, Max Reitz wrote: On 2015-01-27 at 08:51, Denis V. Lunev wrote: fallocate() works fine and could handle properly with arbitrary size requests. Maybe "could properly handle arbitrary size requests" (or "...arbitrarily sized requests")? There is no sense to reduce the amount of space to fallocate. The bigger is the size, the better is the performance as the amount of journal updates is reduced. True for fallocate(). But is it true for xfs_write_zeroes(), too? I guess so, but I don't know. If it does, the patch looks good to me. Max Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 17 + 1 file changed, 17 insertions(+) thank you very much for a review. I will proceed with these findings, they look quite reasonable.
Re: [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
On 27/01/15 20:57, Max Reitz wrote: On 2015-01-27 at 08:51, Denis V. Lunev wrote: There is a possibility that we are extending our image and thus writing zeroes beyond the end of the file. In this case we do not need to care about the hole to make sure that there is no data in the file under this offset (pre-condition to fallocate(0) to work). We could simply call fallocate(0). This improves the performance of writing zeroes even on really old platforms which do not have even FALLOC_FL_PUNCH_HOLE. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index c039bef..fa05239 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -60,7 +60,7 @@ #define FS_NOCOW_FL 0x0080 /* Do not cow file */ #endif #endif -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) +#ifdef CONFIG_FALLOCATE This change doesn't seem right; CONFIG_FALLOCATE is set if posix_fallocate() is available, not for the Linux-specific fallocate() from linux/falloc.h. here is a check for fallocate and posix_fallocate in configure script # check for fallocate fallocate=no cat > $TMPC << EOF #include int main(void) { fallocate(0, 0, 0, 0); return 0; } EOF if compile_prog "" "" ; then fallocate=yes fi ... # check for posix_fallocate posix_fallocate=no cat > $TMPC << EOF #include int main(void) { posix_fallocate(0, 0, 0); return 0; } EOF if compile_prog "" "" ; then posix_fallocate=yes fi ... if test "$fallocate" = "yes" ; then echo "CONFIG_FALLOCATE=y" >> $config_host_mak fi ... if test "$posix_fallocate" = "yes" ; then echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak fi Thus my check looks correct to me. #include #endif #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__) @@ -902,7 +902,7 @@ static int translate_err(int err) return err; } -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) +#ifdef CONFIG_FALLOCATE Same here. static int do_fallocate(int fd, int mode, off_t offset, off_t len) { do { @@ -981,6 +981,12 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) } #endif +#ifdef CONFIG_FALLOCATE +if (aiocb->aio_offset >= aiocb->bs->total_sectors << BDRV_SECTOR_BITS) { +return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes); +} +#endif + This seems fine though, but as I've asked in patch 5: Do we want to have a "has_fallocate"? Other than that, this is the first usage of bs->total_sectors in this file; raw_co_get_block_status() does a similar check, but it uses bdrv_getlength() instead. If bs->total_sectors is correct, bdrv_getlength() will actually do nothing but return bs->total_sectors * BDRV_SECTOR_SIZE; it will only do more (that is, update bs->total_sectors) if it is not correct to use bs->total_sectors (and I feel like it may not be correct because BlockDriver.has_variable_length is true). Max ok, will do
Re: [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
On 27/01/15 21:24, Max Reitz wrote: On 2015-01-27 at 13:19, Denis V. Lunev wrote: On 27/01/15 20:57, Max Reitz wrote: On 2015-01-27 at 08:51, Denis V. Lunev wrote: There is a possibility that we are extending our image and thus writing zeroes beyond the end of the file. In this case we do not need to care about the hole to make sure that there is no data in the file under this offset (pre-condition to fallocate(0) to work). We could simply call fallocate(0). This improves the performance of writing zeroes even on really old platforms which do not have even FALLOC_FL_PUNCH_HOLE. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index c039bef..fa05239 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -60,7 +60,7 @@ #define FS_NOCOW_FL 0x0080 /* Do not cow file */ #endif #endif -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) +#ifdef CONFIG_FALLOCATE This change doesn't seem right; CONFIG_FALLOCATE is set if posix_fallocate() is available, not for the Linux-specific fallocate() from linux/falloc.h. here is a check for fallocate and posix_fallocate in configure script # check for fallocate fallocate=no cat > $TMPC << EOF #include int main(void) { fallocate(0, 0, 0, 0); return 0; } EOF if compile_prog "" "" ; then fallocate=yes fi ... # check for posix_fallocate posix_fallocate=no cat > $TMPC << EOF #include int main(void) { posix_fallocate(0, 0, 0); return 0; } EOF if compile_prog "" "" ; then posix_fallocate=yes fi ... if test "$fallocate" = "yes" ; then echo "CONFIG_FALLOCATE=y" >> $config_host_mak fi ... if test "$posix_fallocate" = "yes" ; then echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak fi Thus my check looks correct to me. Oh, sorry, I somehow mixed those checks. You're right. Very well then; maybe you want to mention this change in the commit message, though? Max no prob
Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files
On 27/01/15 21:05, Max Reitz wrote: On 2015-01-27 at 08:51, Denis V. Lunev wrote: fallocate() works fine and could handle properly with arbitrary size requests. Maybe "could properly handle arbitrary size requests" (or "...arbitrarily sized requests")? There is no sense to reduce the amount of space to fallocate. The bigger is the size, the better is the performance as the amount of journal updates is reduced. True for fallocate(). But is it true for xfs_write_zeroes(), too? I guess so, but I don't know. If it does, the patch looks good to me. Max checked. xfs_ioc_space (ioctl handler) calls exactly the same xfs_zero_file_space as performed by xfs_file_fallocate on fallocate path. I will reflect this in the description Regards, Den
[Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files
fallocate() works fine and could handle properly with arbitrary size requests. There is no sense to reduce the amount of space to fallocate. The bigger is the size, the better is the performance as the amount of journal updates is reduced. The patch changes behavior for both generic filesystem and XFS codepaths, which are different in handle_aiocb_write_zeroes. The implementation of falocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly the same thus the change is fine for both ways. Signed-off-by: Denis V. Lunev CC: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 17 + 1 file changed, 17 insertions(+) diff --git a/block/raw-posix.c b/block/raw-posix.c index 3db911a..ec38fee 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -293,6 +293,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) } } +static void raw_probe_max_write_zeroes(BlockDriverState *bs) +{ +BDRVRawState *s = bs->opaque; +struct stat st; + +if (fstat(s->fd, &st) < 0) { +return; /* no problem, keep default value */ +} +if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { +return; +} +bs->bl.max_write_zeroes = INT_MAX; +} + static void raw_parse_flags(int bdrv_flags, int *open_flags) { assert(open_flags != NULL); @@ -600,6 +614,7 @@ static int raw_reopen_prepare(BDRVReopenState *state, /* Fail already reopen_prepare() if we can't get a working O_DIRECT * alignment with the new fd. */ if (raw_s->fd != -1) { +raw_probe_max_write_zeroes(state->bs); raw_probe_alignment(state->bs, raw_s->fd, &local_err); if (local_err) { qemu_close(raw_s->fd); @@ -653,6 +668,8 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) raw_probe_alignment(bs, s->fd, errp); bs->bl.opt_mem_alignment = s->buf_align; + +raw_probe_max_write_zeroes(bs); } static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb) -- 1.9.1
[Qemu-devel] [PATCH 3/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit
move code dealing with a block device to a separate function. This will allow to implement additional processing for ordinary files. Please note, that xfs_code has been moved before checking for s->has_write_zeroes as xfs_write_zeroes does not touch this flag inside. This makes code a bit more consistent. Signed-off-by: Denis V. Lunev Reviewed-by: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 46 +++--- 1 file changed, 27 insertions(+), 19 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 2aa268a..d3910fd 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -914,41 +914,49 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len) } #endif -static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) +static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) { -int ret = -EOPNOTSUPP; +int ret = -ENOTSUP; BDRVRawState *s = aiocb->bs->opaque; -if (s->has_write_zeroes == 0) { +if (!s->has_write_zeroes) { return -ENOTSUP; } -if (aiocb->aio_type & QEMU_AIO_BLKDEV) { #ifdef BLKZEROOUT -do { -uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes }; -if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) { -return 0; -} -} while (errno == EINTR); - -ret = -errno; -#endif -} else { -#ifdef CONFIG_XFS -if (s->is_xfs) { -return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes); +do { +uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes }; +if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) { +return 0; } +} while (errno == EINTR); + +ret = translate_err(-errno); #endif -} -ret = translate_err(ret); if (ret == -ENOTSUP) { s->has_write_zeroes = false; } return ret; } +static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) +{ +BDRVRawState *s = aiocb->bs->opaque; + +if (aiocb->aio_type & QEMU_AIO_BLKDEV) { +return handle_aiocb_write_zeroes_block(aiocb); +} + +#ifdef CONFIG_XFS +if (s->is_xfs) { +return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes); +} +#endif + +return -ENOTSUP; +} + static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb) { int ret = -EOPNOTSUPP; -- 1.9.1
[Qemu-devel] [PATCH 2/7] block/raw-posix: create do_fallocate helper
The pattern do { if (fallocate(s->fd, mode, offset, len) == 0) { return 0; } } while (errno == EINTR); ret = translate_err(-errno); will be commonly useful in next patches. Create helper for it. Signed-off-by: Denis V. Lunev Reviewed-by: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 22 ++ 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 24300d0..2aa268a 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -902,6 +902,18 @@ static int translate_err(int err) return err; } +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) +static int do_fallocate(int fd, int mode, off_t offset, off_t len) +{ +do { +if (fallocate(fd, mode, offset, len) == 0) { +return 0; +} +} while (errno == EINTR); +return translate_err(-errno); +} +#endif + static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) { int ret = -EOPNOTSUPP; @@ -965,14 +977,8 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb) #endif #ifdef CONFIG_FALLOCATE_PUNCH_HOLE -do { -if (fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, - aiocb->aio_offset, aiocb->aio_nbytes) == 0) { -return 0; -} -} while (errno == EINTR); - -ret = -errno; +ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + aiocb->aio_offset, aiocb->aio_nbytes); #endif } -- 1.9.1
[Qemu-devel] [PATCH v5 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c
These patches eliminate data writes completely on Linux if fallocate FALLOC_FL_ZERO_RANGE or FALLOC_FL_PUNCH_HOLE are supported on underlying filesystem. I have performed several tests with non-aligned fallocate calls and in all cases (with non-aligned fallocates) Linux performs fine, i.e. areas are zeroed correctly. Checks were made on Linux 3.16.0-28-generic #38-Ubuntu SMP This should seriously increase performance of bdrv_write_zeroes Changes from v4: - comments to patches are improved by Max Reitz suggestions - replaced 'return ret' with 'return -ENOTSUP' handle_aiocb_write_zeroes The idea is that we can reach that point only if original ret was equal to -ENOTSUP - added has_fallocate flag to indicate that fallocate is working for given BDS - checked file length with bdrv_getlength Changes from v3: - dropped original patch 1, equivalent stuff was merged already - reordered patches as suggested by Fam - fixes spelling errors as suggested by Fam - fixed not initialized value in handle_aiocb_write_zeroes - fixed wrong error processing from do_fallocate(FALLOC_FL_ZERO_RANGE) Changes from v2: - added Peter Lieven to CC - added CONFIG_FALLOCATE check to call do_fallocate in patch 7 - dropped patch 1 as NACK-ed - added processing of very large data areas in bdrv_co_write_zeroes (new patch 1) - set bl.max_write_zeroes to INT_MAX in raw-posix.c for regular files (new patch 8) Signed-off-by: Denis V. Lunev CC: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng
[Qemu-devel] [PATCH 5/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes
This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported. Unfortunately, FALLOC_FL_ZERO_RANGE is supported on really modern systems and only for a couple of filesystems. FALLOC_FL_PUNCH_HOLE is much more mature. The sequence of 2 operations FALLOC_FL_PUNCH_HOLE and 0 is necessary due to the following reasons: - FALLOC_FL_PUNCH_HOLE creates a hole in the file, the file becomes sparse. In order to retain original functionality we must allocate disk space afterwards. This is done using fallocate(0) call - fallocate(0) without preceeding FALLOC_FL_PUNCH_HOLE will do nothing if called above already allocated areas of the file, i.e. the content will not be zeroed This should increase the performance a bit for not-so-modern kernels. Signed-off-by: Denis V. Lunev CC: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/block/raw-posix.c b/block/raw-posix.c index 5a777e7..2e24829 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -965,6 +965,21 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) } #endif +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE +if (s->has_discard) { +int ret = do_fallocate(s->fd, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + aiocb->aio_offset, aiocb->aio_nbytes); +if (ret == 0) { +ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes); +} +if (ret != -ENOTSUP) { +return ret; +} +s->has_discard = false; +} +#endif + return -ENOTSUP; } -- 1.9.1
[Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
There is a possibility that we are extending our image and thus writing zeroes beyond the end of the file. In this case we do not need to care about the hole to make sure that there is no data in the file under this offset (pre-condition to fallocate(0) to work). We could simply call fallocate(0). This improves the performance of writing zeroes even on really old platforms which do not have even FALLOC_FL_PUNCH_HOLE. Before the patch do_fallocate was used when either CONFIG_FALLOCATE_PUNCH_HOLE or CONFIG_FALLOCATE_ZERO_RANGE are defined. Now the story is different. CONFIG_FALLOCATE is defined when Linux fallocate is defined, posix_fallocate is completely different story (CONFIG_POSIX_FALLOCATE). CONFIG_FALLOCATE is mandatory prerequite for both CONFIG_FALLOCATE_PUNCH_HOLE and CONFIG_FALLOCATE_ZERO_RANGE thus we are on the safe side. Signed-off-by: Denis V. Lunev CC: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 2e24829..3db911a 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -147,6 +147,7 @@ typedef struct BDRVRawState { bool has_discard:1; bool has_write_zeroes:1; bool discard_zeroes:1; +bool has_fallocate; bool needs_alignment; } BDRVRawState; @@ -452,6 +453,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options, } if (S_ISREG(st.st_mode)) { s->discard_zeroes = true; +s->has_fallocate = true; } if (S_ISBLK(st.st_mode)) { #ifdef BLKDISCARDZEROES @@ -902,7 +904,7 @@ static int translate_err(int err) return err; } -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) +#ifdef CONFIG_FALLOCATE static int do_fallocate(int fd, int mode, off_t offset, off_t len) { do { @@ -980,6 +982,16 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) } #endif +#ifdef CONFIG_FALLOCATE +if (s->has_fallocate && aiocb->aio_offset >= bdrv_getlength(aiocb->bs)) { +int ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes); +if (ret == 0 || ret != -ENOTSUP) { +return ret; +} +s->has_fallocate = false; +} +#endif + return -ENOTSUP; } -- 1.9.1
[Qemu-devel] [PATCH 1/7] block/raw-posix: create translate_err helper to merge errno values
actually the code if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP || ret == -ENOTTY) { ret = -ENOTSUP; } is present twice and will be added a couple more times. Create helper for this. Signed-off-by: Denis V. Lunev Reviewed-by: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index e51293a..24300d0 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -893,6 +893,15 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes) } #endif +static int translate_err(int err) +{ +if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP || +err == -ENOTTY) { +err = -ENOTSUP; +} +return err; +} + static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) { int ret = -EOPNOTSUPP; @@ -921,10 +930,9 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) #endif } -if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP || -ret == -ENOTTY) { +ret = translate_err(ret); +if (ret == -ENOTSUP) { s->has_write_zeroes = false; -ret = -ENOTSUP; } return ret; } @@ -968,10 +976,9 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb) #endif } -if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP || -ret == -ENOTTY) { +ret = translate_err(ret); +if (ret == -ENOTSUP) { s->has_discard = false; -ret = -ENOTSUP; } return ret; } -- 1.9.1
[Qemu-devel] [PATCH 4/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes
This efficiently writes zeroes on Linux if the kernel is capable enough. FALLOC_FL_ZERO_RANGE correctly handles all cases, including and not including file expansion. Signed-off-by: Denis V. Lunev CC: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 15 +-- configure | 19 +++ 2 files changed, 32 insertions(+), 2 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index d3910fd..5a777e7 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -60,7 +60,7 @@ #define FS_NOCOW_FL 0x0080 /* Do not cow file */ #endif #endif -#ifdef CONFIG_FALLOCATE_PUNCH_HOLE +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) #include #endif #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__) @@ -902,7 +902,7 @@ static int translate_err(int err) return err; } -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) static int do_fallocate(int fd, int mode, off_t offset, off_t len) { do { @@ -954,6 +954,17 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) } #endif +#ifdef CONFIG_FALLOCATE_ZERO_RANGE +if (s->has_write_zeroes) { +int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE, + aiocb->aio_offset, aiocb->aio_nbytes); +if (ret == 0 || ret != -ENOTSUP) { +return ret; +} +s->has_write_zeroes = false; +} +#endif + return -ENOTSUP; } diff --git a/configure b/configure index f185dd0..e00e03a 100755 --- a/configure +++ b/configure @@ -3335,6 +3335,22 @@ if compile_prog "" "" ; then fallocate_punch_hole=yes fi +# check that fallocate supports range zeroing inside the file +fallocate_zero_range=no +cat > $TMPC << EOF +#include +#include + +int main(void) +{ +fallocate(0, FALLOC_FL_ZERO_RANGE, 0, 0); +return 0; +} +EOF +if compile_prog "" "" ; then + fallocate_zero_range=yes +fi + # check for posix_fallocate posix_fallocate=no cat > $TMPC << EOF @@ -4567,6 +4583,9 @@ fi if test "$fallocate_punch_hole" = "yes" ; then echo "CONFIG_FALLOCATE_PUNCH_HOLE=y" >> $config_host_mak fi +if test "$fallocate_zero_range" = "yes" ; then + echo "CONFIG_FALLOCATE_ZERO_RANGE=y" >> $config_host_mak +fi if test "$posix_fallocate" = "yes" ; then echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak fi -- 1.9.1
[Qemu-devel] [PATCH 1/1] block: change default memory alignment for block requests to 4096
The following sequence int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); for (i = 0; i < 10; i++) write(fd, buf, 4096); performs 10% better if buf is aligned to 4096 bytes rather then to 512 bytes on HDD with 512/4096 logical/physical sector size. The difference is quite reliable. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi --- block.c | 4 ++-- block/raw-posix.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/block.c b/block.c index d45e4dd..bc5d1e7 100644 --- a/block.c +++ b/block.c @@ -543,7 +543,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp) bs->bl.max_transfer_length = bs->file->bl.max_transfer_length; bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment; } else { -bs->bl.opt_mem_alignment = 512; +bs->bl.opt_mem_alignment = 4096; } if (bs->backing_hd) { @@ -966,7 +966,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file, bs->open_flags = flags; bs->guest_block_size = 512; -bs->request_alignment = 512; +bs->request_alignment = 4096; bs->zero_beyond_eof = true; open_flags = bdrv_open_flags(bs, flags); bs->read_only = !(open_flags & BDRV_O_RDWR); diff --git a/block/raw-posix.c b/block/raw-posix.c index ec38fee..d1b3388 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -266,7 +266,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) if (!s->buf_align) { size_t align; buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE); -for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { +for (align = 4096; align <= MAX_BLOCKSIZE; align <<= 1) { if (pread(fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) { s->buf_align = align; break; @@ -278,7 +278,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) if (!bs->request_alignment) { size_t align; buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE); -for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { +for (align = 4096; align <= MAX_BLOCKSIZE; align <<= 1) { if (pread(fd, buf, align, 0) >= 0) { bs->request_alignment = align; break; -- 1.9.1
[Qemu-devel] [PATCH 0/1] block: change default memory alignment for block requests
The following sequence int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); for (i = 0; i < 10; i++) write(fd, buf, 4096); performs 10% better if buf is aligned to 4096 bytes rather then to 512 bytes on HDD with 512/4096 logical/physical sector size. The difference is quite reliable. I have used the following program to test #define _GNU_SOURCE #include #include #include #include #include #include int main(int argc, char *argv[]) { int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); void *buf; int i = 0; do { buf = memalign(512, 4096); <--- replace 512 with 4096 if ((unsigned long)buf & 4095) break; i++; } while (1); printf("%d\n", i); memset(buf, 0x11, 4096); for (i = 0; i < 10; i++) write(fd, buf, 4096); close(fd); return 0; } time for in in `seq 1 30` ; do a.out aa ; done The file was placed into 8 GB partition on HDD below to avoid speed change due to different offset on disk. Results are reliable: - 189 vs 180 seconds on Linux 3.16 Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi hades ~/src/qemu # hdparm -I /dev/sdg /dev/sdg: ATA device, with non-removable media Model Number: WDC WD20EZRX-07D8PB0 Serial Number: WD-WCC4M5LVSAEP Firmware Revision: 80.00A80 Transport: Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0 Standards: Supported: 9 8 7 6 5 Likely used: 9 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBAuser addressable sectors: 268435455 LBA48 user addressable sectors: 3907029168 Logical Sector size: 512 bytes Physical Sector size: 4096 bytes device size with M = 1024*1024: 1907729 MBytes device size with M = 1000*1000: 2000398 MBytes (2000 GB) cache/buffer size = unknown Nominal Media Rotation Rate: 5400 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, with device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: *SMART feature set Security Mode feature set *Power Management feature set *Write cache *Look-ahead *Host Protected Area feature set *WRITE_BUFFER command *READ_BUFFER command *NOP cmd *DOWNLOAD_MICROCODE Power-Up In Standby feature set *SET_FEATURES required to spinup after power up SET_MAX security extension *48-bit Address feature set *Device Configuration Overlay feature set *Mandatory FLUSH_CACHE *FLUSH_CACHE_EXT *SMART error logging *SMART self-test *General Purpose Logging feature set *64-bit World wide name *WRITE_UNCORRECTABLE_EXT command *{READ,WRITE}_DMA_EXT_GPL commands *Segmented DOWNLOAD_MICROCODE *Gen1 signaling speed (1.5Gb/s) *Gen2 signaling speed (3.0Gb/s) *Gen3 signaling speed (6.0Gb/s) *Native Command Queueing (NCQ) *Host-initiated interface power management *Phy event counters *NCQ priority information *READ_LOG_DMA_EXT equivalent to READ_LOG_EXT *DMA Setup Auto-Activate optimization Device-initiated interface power management *Software settings preservation *SMART Command Transport (SCT) feature set *SCT Write Same (AC2) *SCT Features Control (AC4) *SCT Data Tables (AC5) unknown 206[12] (vendor specific) unknown 206[13] (vendor specific) unknown 206[14] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked frozen not expired: security count supported: enhanced erase 276min for SECURITY ERASE UNIT. 276min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 50014ee2b5da838c NAA : 5 IEEE OUI: 0014ee Unique ID : 2b5da838c Checksum: correct hades ~/src/qemu #
Re: [Qemu-devel] [PATCH 1/1] block: change default memory alignment for block requests to 4096
On 28/01/15 21:49, Denis V. Lunev wrote: The following sequence int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); for (i = 0; i < 10; i++) write(fd, buf, 4096); performs 10% better if buf is aligned to 4096 bytes rather then to 512 bytes on HDD with 512/4096 logical/physical sector size. The difference is quite reliable. Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi --- block.c | 4 ++-- block/raw-posix.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/block.c b/block.c index d45e4dd..bc5d1e7 100644 --- a/block.c +++ b/block.c @@ -543,7 +543,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp) bs->bl.max_transfer_length = bs->file->bl.max_transfer_length; bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment; } else { -bs->bl.opt_mem_alignment = 512; +bs->bl.opt_mem_alignment = 4096; } if (bs->backing_hd) { @@ -966,7 +966,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file, bs->open_flags = flags; bs->guest_block_size = 512; -bs->request_alignment = 512; +bs->request_alignment = 4096; bs->zero_beyond_eof = true; open_flags = bdrv_open_flags(bs, flags); bs->read_only = !(open_flags & BDRV_O_RDWR); diff --git a/block/raw-posix.c b/block/raw-posix.c index ec38fee..d1b3388 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -266,7 +266,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) if (!s->buf_align) { size_t align; buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE); -for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { +for (align = 4096; align <= MAX_BLOCKSIZE; align <<= 1) { if (pread(fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) { s->buf_align = align; break; @@ -278,7 +278,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) if (!bs->request_alignment) { size_t align; buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE); -for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { +for (align = 4096; align <= MAX_BLOCKSIZE; align <<= 1) { if (pread(fd, buf, align, 0) >= 0) { bs->request_alignment = align; break; sorry, the patch is wrong. It breaks 'make check-block'. I will redo it and perform more testing. request-alignment related changes are wrong :( I have run tests without them but added them as a obvious last minute addition.
Re: [Qemu-devel] [PATCH 1/1] block: change default memory alignment for block requests to 4096
On 28/01/15 23:07, Paolo Bonzini wrote: On 28/01/2015 19:49, Denis V. Lunev wrote: The following sequence int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); for (i = 0; i < 10; i++) write(fd, buf, 4096); performs 10% better if buf is aligned to 4096 bytes rather then to 512 bytes on HDD with 512/4096 logical/physical sector size. The difference is quite reliable. The 10% difference, however, is probably not enough to cover the cost of providing a bounce buffer if a guest is (rightfully) using a 512-byte aligned buffer: bs->bl.opt_mem_alignment is in fact badly named and it should be bs->bl.min_mem_alignment instead. Instead, you probably should patch bdrv_opt_mem_align to return at least 4096, and leave the detection logic intact. This will let qemu_blockalign return a properly aligned buffer to qemu-img and other in-process allocations, without negatively affecting the guest. Thanks, Paolo ok, this looks good to me :) Signed-off-by: Denis V. Lunev CC: Kevin Wolf CC: Stefan Hajnoczi --- block.c | 4 ++-- block/raw-posix.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/block.c b/block.c index d45e4dd..bc5d1e7 100644 --- a/block.c +++ b/block.c @@ -543,7 +543,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp) bs->bl.max_transfer_length = bs->file->bl.max_transfer_length; bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment; } else { -bs->bl.opt_mem_alignment = 512; +bs->bl.opt_mem_alignment = 4096; } if (bs->backing_hd) { @@ -966,7 +966,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file, bs->open_flags = flags; bs->guest_block_size = 512; -bs->request_alignment = 512; +bs->request_alignment = 4096; bs->zero_beyond_eof = true; open_flags = bdrv_open_flags(bs, flags); bs->read_only = !(open_flags & BDRV_O_RDWR); diff --git a/block/raw-posix.c b/block/raw-posix.c index ec38fee..d1b3388 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -266,7 +266,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) if (!s->buf_align) { size_t align; buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE); -for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { +for (align = 4096; align <= MAX_BLOCKSIZE; align <<= 1) { if (pread(fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) { s->buf_align = align; break; @@ -278,7 +278,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) if (!bs->request_alignment) { size_t align; buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE); -for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { +for (align = 4096; align <= MAX_BLOCKSIZE; align <<= 1) { if (pread(fd, buf, align, 0) >= 0) { bs->request_alignment = align; break;
[Qemu-devel] [PATCH v2 0/1] block: enforce minimal 4096 alignment in qemu_blockalign
The following sequence int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); for (i = 0; i < 10; i++) write(fd, buf, 4096); performs 5% better if buf is aligned to 4096 bytes rather then to 512 bytes on HDD with 512/4096 logical/physical sector size. The difference is quite reliable. I have used the following program to test #define _GNU_SOURCE #include #include #include #include #include #include int main(int argc, char *argv[]) { int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); void *buf; int i = 0; do { buf = memalign(512, 4096); <--- replace 512 with 4096 if ((unsigned long)buf & 4095) break; i++; } while (1); printf("%d\n", i); memset(buf, 0x11, 4096); for (i = 0; i < 10; i++) write(fd, buf, 4096); close(fd); return 0; } time for in in `seq 1 30` ; do a.out aa ; done The file was placed into 8 GB partition on HDD below to avoid speed change due to different offset on disk. Results are reliable: - 189 vs 180 seconds on Linux 3.16 Changes from v1: - enforces 4096 alignment in qemu_(try_)blockalign, avoid touching of bdrv_qiov_is_aligned path not to enforce additional bounce buffering as suggested by Paolo - reduces 10% to 5% in patch description to better fit 180 vs 189 difference Signed-off-by: Denis V. Lunev CC: Paolo Bonzini CC: Kevin Wolf CC: Stefan Hajnoczi hades ~/src/qemu # hdparm -I /dev/sdg /dev/sdg: ATA device, with non-removable media Model Number: WDC WD20EZRX-07D8PB0 Serial Number: WD-WCC4M5LVSAEP Firmware Revision: 80.00A80 Transport: Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0 Standards: Supported: 9 8 7 6 5 Likely used: 9 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBAuser addressable sectors: 268435455 LBA48 user addressable sectors: 3907029168 Logical Sector size: 512 bytes Physical Sector size: 4096 bytes device size with M = 1024*1024: 1907729 MBytes device size with M = 1000*1000: 2000398 MBytes (2000 GB) cache/buffer size = unknown Nominal Media Rotation Rate: 5400 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, with device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: *SMART feature set Security Mode feature set *Power Management feature set *Write cache *Look-ahead *Host Protected Area feature set *WRITE_BUFFER command *READ_BUFFER command *NOP cmd *DOWNLOAD_MICROCODE Power-Up In Standby feature set *SET_FEATURES required to spinup after power up SET_MAX security extension *48-bit Address feature set *Device Configuration Overlay feature set *Mandatory FLUSH_CACHE *FLUSH_CACHE_EXT *SMART error logging *SMART self-test *General Purpose Logging feature set *64-bit World wide name *WRITE_UNCORRECTABLE_EXT command *{READ,WRITE}_DMA_EXT_GPL commands *Segmented DOWNLOAD_MICROCODE *Gen1 signaling speed (1.5Gb/s) *Gen2 signaling speed (3.0Gb/s) *Gen3 signaling speed (6.0Gb/s) *Native Command Queueing (NCQ) *Host-initiated interface power management *Phy event counters *NCQ priority information *READ_LOG_DMA_EXT equivalent to READ_LOG_EXT *DMA Setup Auto-Activate optimization Device-initiated interface power management *Software settings preservation *SMART Command Transport (SCT) feature set *SCT Write Same (AC2) *SCT Features Control (AC4) *SCT Data Tables (AC5) unknown 206[12] (vendor specific) unknown 206[13] (vendor specific) unknown 206[14] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked frozen not expired: security count supported: enhanced erase 276min for SECURITY ERASE UNIT. 276min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 50014ee2b5da838c NAA : 5 IEEE OUI: 0014ee Unique ID : 2b5da838c Checksum: correct hades ~/src/qemu #
[Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment in qemu_blockalign
The following sequence int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); for (i = 0; i < 10; i++) write(fd, buf, 4096); performs 5% better if buf is aligned to 4096 bytes rather then to 512 bytes on HDD with 512/4096 logical/physical sector size. The difference is quite reliable. On the other hand we do not want at the moment to enforce bounce buffering if guest request is aligned to 512 bytes. This patch forces page alignment when we really forced to perform memory allocation. Signed-off-by: Denis V. Lunev CC: Paolo Bonzini CC: Kevin Wolf CC: Stefan Hajnoczi --- block.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/block.c b/block.c index d45e4dd..38cf73f 100644 --- a/block.c +++ b/block.c @@ -5293,7 +5293,11 @@ void bdrv_set_guest_block_size(BlockDriverState *bs, int align) void *qemu_blockalign(BlockDriverState *bs, size_t size) { -return qemu_memalign(bdrv_opt_mem_align(bs), size); +size_t align = bdrv_opt_mem_align(bs); +if (align < 4096) { +align = 4096; +} +return qemu_memalign(align, size); } void *qemu_blockalign0(BlockDriverState *bs, size_t size) @@ -5307,6 +5311,9 @@ void *qemu_try_blockalign(BlockDriverState *bs, size_t size) /* Ensure that NULL is never returned on success */ assert(align > 0); +if (align < 4096) { +align = 4096; +} if (size == 0) { size = align; } -- 1.9.1
Re: [Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment in qemu_blockalign
On 29/01/15 16:18, Kevin Wolf wrote: Am 29.01.2015 um 11:50 hat Denis V. Lunev geschrieben: The following sequence int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); for (i = 0; i < 10; i++) write(fd, buf, 4096); performs 5% better if buf is aligned to 4096 bytes rather then to 512 bytes on HDD with 512/4096 logical/physical sector size. The difference is quite reliable. On the other hand we do not want at the moment to enforce bounce buffering if guest request is aligned to 512 bytes. This patch forces page alignment when we really forced to perform memory allocation. Signed-off-by: Denis V. Lunev CC: Paolo Bonzini CC: Kevin Wolf CC: Stefan Hajnoczi --- block.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/block.c b/block.c index d45e4dd..38cf73f 100644 --- a/block.c +++ b/block.c @@ -5293,7 +5293,11 @@ void bdrv_set_guest_block_size(BlockDriverState *bs, int align) void *qemu_blockalign(BlockDriverState *bs, size_t size) { -return qemu_memalign(bdrv_opt_mem_align(bs), size); +size_t align = bdrv_opt_mem_align(bs); +if (align < 4096) { +align = 4096; +} +return qemu_memalign(align, size); } void *qemu_blockalign0(BlockDriverState *bs, size_t size) @@ -5307,6 +5311,9 @@ void *qemu_try_blockalign(BlockDriverState *bs, size_t size) /* Ensure that NULL is never returned on success */ assert(align > 0); +if (align < 4096) { +align = 4096; +} if (size == 0) { size = align; } This is the wrong place to make this change. First you're duplicating logic in the callers of bdrv_opt_mem_align() instead of making it return the right thing in the first place. This has been actually done in the first iteration. bdrv_opt_mem_align is called actually three times in: qemu_blockalign qemu_try_blockalign bdrv_qiov_is_aligned Paolo says that he does not want to have bdrv_qiov_is_aligned affected to avoid extra bounce buffering. From my point of view this extra bounce buffering is better than unaligned pointer during write to the disk as 512/4096 logical/physical sectors size disks are mainstream now. Though I don't want to specially argue here. Normal guest operations results in page aligned requests and this is not a problem at all. The amount of 512 aligned requests from guest side is quite negligible. Second, you're arguing with numbers from a simple test case for O_DIRECT on Linux, but you're changing the alignment for everyone instead of just the raw-posix driver which is responsible for accessing Linux files. This should not be a real problem. We are allocation memory for the buffer. A little bit stricter alignment is not a big overhead for any libc implementation thus this kludge will not produce any significant overhead. Also, what's the real reason for the performance improvement? Having page alignment? If so, actually querying the page size instead of assuming 4k might be worth a thought. Kevin Most likely the problem comes from the read-modify-write pattern either in kernel or in disk. Actually my experience says that it is a bad idea to supply 512 byte aligned buffer for O_DIRECT IO. ABI technically allows this but in general it is much less tested. Yes, this synthetic test shows some difference here. In terms of qemu-io the result is also visible, but less qemu-img create -f qcow2 ./1.img 64G qemu-io -n -c 'write -P 0xaa 0 1G' 1.img performs 1% better. There is also similar kludge here size_t bdrv_opt_mem_align(BlockDriverState *bs) { if (!bs || !bs->drv) { /* 4k should be on the safe side */ return 4096; } return bs->bl.opt_mem_alignment; } which just uses 4096 constant. Yes, I could agree that queering page size could be a good idea, but I do not know at the moment how to do that. Can you pls share your opinion if you have any. Regards, Den
[Qemu-devel] [RESEND PATCH 1/2] balloon: call qdev_alias_all_properties for proxy dev in balloon class init
The idea is that all other virtio devices are calling this helper to merge properties of the proxy device. This is the only difference in between this helper and code in inside virtio_instance_init_common. The patch should not cause any harm as property list in generic balloon code is empty. This also allows to avoid some dummy errors like fixed by this commit 91ba21208839643603e7f7fa5864723c3f371ebe Author: Gonglei Date: Tue Sep 30 14:10:35 2014 +0800 virtio-balloon: fix virtio-balloon child refcount in transports Signed-off-by: Denis V. Lunev Signed-off-by: Raushaniya Maksudova Revieved-by: Cornelia Huck CC: Christian Borntraeger CC: Anthony Liguori CC: Michael S. Tsirkin --- hw/s390x/virtio-ccw.c | 5 ++--- hw/virtio/virtio-pci.c | 5 ++--- 2 files changed, 4 insertions(+), 6 deletions(-) diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c index ea236c9..82da894 100644 --- a/hw/s390x/virtio-ccw.c +++ b/hw/s390x/virtio-ccw.c @@ -899,9 +899,8 @@ static void balloon_ccw_stats_set_poll_interval(Object *obj, struct Visitor *v, static void virtio_ccw_balloon_instance_init(Object *obj) { VirtIOBalloonCcw *dev = VIRTIO_BALLOON_CCW(obj); -object_initialize(&dev->vdev, sizeof(dev->vdev), TYPE_VIRTIO_BALLOON); -object_property_add_child(obj, "virtio-backend", OBJECT(&dev->vdev), NULL); -object_unref(OBJECT(&dev->vdev)); +virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev), +TYPE_VIRTIO_BALLOON); object_property_add(obj, "guest-stats", "guest statistics", balloon_ccw_stats_get_all, NULL, NULL, dev, NULL); diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c index dde1d73..745324b 100644 --- a/hw/virtio/virtio-pci.c +++ b/hw/virtio/virtio-pci.c @@ -1316,9 +1316,8 @@ static void virtio_balloon_pci_class_init(ObjectClass *klass, void *data) static void virtio_balloon_pci_instance_init(Object *obj) { VirtIOBalloonPCI *dev = VIRTIO_BALLOON_PCI(obj); -object_initialize(&dev->vdev, sizeof(dev->vdev), TYPE_VIRTIO_BALLOON); -object_property_add_child(obj, "virtio-backend", OBJECT(&dev->vdev), NULL); -object_unref(OBJECT(&dev->vdev)); +virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev), +TYPE_VIRTIO_BALLOON); object_property_add(obj, "guest-stats", "guest statistics", balloon_pci_stats_get_all, NULL, NULL, dev, NULL); -- 1.9.1
[Qemu-devel] [RESEND PATCH v3 0/2] balloon: add a feature bit to let Guest OS deflate virtio_balloon on OOM
Excessive virtio_balloon inflation can cause invocation of OOM-killer, when Linux is under severe memory pressure. Various mechanisms are responsible for correct virtio_balloon memory management. Nevertheless it is often the case that these control tools does not have enough time to react on fast changing memory load. As a result OS runs out of memory and invokes OOM-killer. The balancing of memory by use of the virtio balloon should not cause the termination of processes while there are pages in the balloon. Now there is no way for virtio balloon driver to free memory at the last moment before some process get killed by OOM-killer. This does not provide a security breach as balloon itself is running inside Guest OS and is working in the cooperation with the host. Thus some improvements from Guest side should be considered as normal. To solve the problem, introduce a virtio_balloon callback which is expected to be called from the oom notifier call chain in out_of_memory() function. If virtio balloon could release some memory, it will make the system to return and retry the allocation that forced the out of memory killer to run. This behavior should be enabled if and only if appropriate feature bit is set on the device. It is off by default. This functionality was recently merged into vanilla Linux (actually in linux-next at the moment) commit 5a10b7dbf904bfe01bb9fcc6298f7df09eed77d5 Author: Raushaniya Maksudova Date: Mon Nov 10 09:36:29 2014 +1030 This patch adds respective control bits into QEMU. It introduces deflate-on-oom option for baloon device which do the trick. Changes from v2: - fixed mistake with bit number in virtio_balloon_get_features Changes from v1: - From: in patch 1 according to the original ownership - feature processing in patch 2 as suggested by Michael. It could be done without additional field, but this will require to move the property level up, i.e. to PCI & CCW level. Signed-off-by: Raushaniya Maksudova Signed-off-by: Denis V. Lunev CC: Anthony Liguori CC: Michael S. Tsirkin
[Qemu-devel] [RESEND PATCH 2/2] balloon: add a feature bit to let Guest OS deflate balloon on oom
Excessive virtio_balloon inflation can cause invocation of OOM-killer, when Linux is under severe memory pressure. Various mechanisms are responsible for correct virtio_balloon memory management. Nevertheless it is often the case that these control tools does not have enough time to react on fast changing memory load. As a result OS runs out of memory and invokes OOM-killer. The balancing of memory by use of the virtio balloon should not cause the termination of processes while there are pages in the balloon. Now there is no way for virtio balloon driver to free memory at the last moment before some process get killed by OOM-killer. This does not provide a security breach as balloon itself is running inside Guest OS and is working in the cooperation with the host. Thus some improvements from Guest side should be considered as normal. To solve the problem, introduce a virtio_balloon callback which is expected to be called from the oom notifier call chain in out_of_memory() function. If virtio balloon could release some memory, it will make the system to return and retry the allocation that forced the out of memory killer to run. This behavior should be enabled if and only if appropriate feature bit is set on the device. It is off by default. This functionality was recently merged into vanilla Linux. commit 5a10b7dbf904bfe01bb9fcc6298f7df09eed77d5 Author: Raushaniya Maksudova Date: Mon Nov 10 09:36:29 2014 +1030 This patch adds respective control bits into QEMU. It introduces deflate-on-oom option for baloon device which do the trick. Signed-off-by: Denis V. Lunev CC: Raushaniya Maksudova CC: Anthony Liguori CC: Michael S. Tsirkin --- hw/virtio/virtio-balloon.c | 6 -- include/hw/virtio/virtio-balloon.h | 2 ++ 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c index 7bfbb75..a54d026 100644 --- a/hw/virtio/virtio-balloon.c +++ b/hw/virtio/virtio-balloon.c @@ -305,8 +305,8 @@ static void virtio_balloon_set_config(VirtIODevice *vdev, static uint32_t virtio_balloon_get_features(VirtIODevice *vdev, uint32_t f) { -f |= (1 << VIRTIO_BALLOON_F_STATS_VQ); -return f; +VirtIOBalloon *dev = VIRTIO_BALLOON(vdev); +return f | (1u << VIRTIO_BALLOON_F_STATS_VQ) | dev->host_features; } static void virtio_balloon_stat(void *opaque, BalloonInfo *info) @@ -409,6 +409,8 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp) } static Property virtio_balloon_properties[] = { +DEFINE_PROP_BIT("deflate-on-oom", VirtIOBalloon, host_features, +VIRTIO_BALLOON_F_DEFLATE_ON_OOM, false), DEFINE_PROP_END_OF_LIST(), }; diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h index f863bfe..2e1ccd9 100644 --- a/include/hw/virtio/virtio-balloon.h +++ b/include/hw/virtio/virtio-balloon.h @@ -30,6 +30,7 @@ /* The feature bitmap for virtio balloon */ #define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */ #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory stats virtqueue */ +#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 @@ -67,6 +68,7 @@ typedef struct VirtIOBalloon { QEMUTimer *stats_timer; int64_t stats_last_update; int64_t stats_poll_interval; +uint32_t host_features; } VirtIOBalloon; #endif -- 1.9.1
Re: [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
On 30/01/15 01:50, Max Reitz wrote: On 2015-01-28 at 13:38, Denis V. Lunev wrote: There is a possibility that we are extending our image and thus writing zeroes beyond the end of the file. In this case we do not need to care about the hole to make sure that there is no data in the file under this offset (pre-condition to fallocate(0) to work). We could simply call fallocate(0). This improves the performance of writing zeroes even on really old platforms which do not have even FALLOC_FL_PUNCH_HOLE. Before the patch do_fallocate was used when either CONFIG_FALLOCATE_PUNCH_HOLE or CONFIG_FALLOCATE_ZERO_RANGE are defined. Now the story is different. CONFIG_FALLOCATE is defined when Linux fallocate is defined, posix_fallocate is completely different story (CONFIG_POSIX_FALLOCATE). CONFIG_FALLOCATE is mandatory prerequite for both CONFIG_FALLOCATE_PUNCH_HOLE and CONFIG_FALLOCATE_ZERO_RANGE thus we are on the safe side. Signed-off-by: Denis V. Lunev CC: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 2e24829..3db911a 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -147,6 +147,7 @@ typedef struct BDRVRawState { bool has_discard:1; bool has_write_zeroes:1; bool discard_zeroes:1; +bool has_fallocate; bool needs_alignment; } BDRVRawState; @@ -452,6 +453,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options, } if (S_ISREG(st.st_mode)) { s->discard_zeroes = true; +s->has_fallocate = true; } if (S_ISBLK(st.st_mode)) { #ifdef BLKDISCARDZEROES @@ -902,7 +904,7 @@ static int translate_err(int err) return err; } -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) +#ifdef CONFIG_FALLOCATE static int do_fallocate(int fd, int mode, off_t offset, off_t len) { do { @@ -980,6 +982,16 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) } #endif +#ifdef CONFIG_FALLOCATE +if (s->has_fallocate && aiocb->aio_offset >= bdrv_getlength(aiocb->bs)) { +int ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes); +if (ret == 0 || ret != -ENOTSUP) { +return ret; +} +s->has_fallocate = false; +} +#endif + return -ENOTSUP; } Now that you do have has_fallocate, I think you should be using it in patch 5 as well. So I think you should either you make this patch add it in the area touched by patch 5, or you introduce has_fallocate in patch 5 already and use it there. Max OK. No problem. I do not think that it is ever possible, but why not? I will reorder these patches in the patchset to minimize changes.
[Qemu-devel] [PATCH 6/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes
This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported. Unfortunately, FALLOC_FL_ZERO_RANGE is supported on really modern systems and only for a couple of filesystems. FALLOC_FL_PUNCH_HOLE is much more mature. The sequence of 2 operations FALLOC_FL_PUNCH_HOLE and 0 is necessary due to the following reasons: - FALLOC_FL_PUNCH_HOLE creates a hole in the file, the file becomes sparse. In order to retain original functionality we must allocate disk space afterwards. This is done using fallocate(0) call - fallocate(0) without preceeding FALLOC_FL_PUNCH_HOLE will do nothing if called above already allocated areas of the file, i.e. the content will not be zeroed This should increase the performance a bit for not-so-modern kernels. Signed-off-by: Denis V. Lunev CC: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/block/raw-posix.c b/block/raw-posix.c index 1c88ad8..7b42f37 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -967,6 +967,25 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) } #endif +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE +if (s->has_discard && s->has_fallocate) { +int ret = do_fallocate(s->fd, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + aiocb->aio_offset, aiocb->aio_nbytes); +if (ret == 0) { +ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes); +if (ret == 0 || ret != -ENOTSUP) { +return ret; +} +s->has_fallocate = false; +} else if (ret != -ENOTSUP) { +return ret; +} else { +s->has_discard = false; +} +} +#endif + #ifdef CONFIG_FALLOCATE if (s->has_fallocate && aiocb->aio_offset >= bdrv_getlength(aiocb->bs)) { int ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes); -- 1.9.1
[Qemu-devel] [PATCH 4/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes
This efficiently writes zeroes on Linux if the kernel is capable enough. FALLOC_FL_ZERO_RANGE correctly handles all cases, including and not including file expansion. Signed-off-by: Denis V. Lunev Reviewed-by: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 15 +-- configure | 19 +++ 2 files changed, 32 insertions(+), 2 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index d3910fd..5a777e7 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -60,7 +60,7 @@ #define FS_NOCOW_FL 0x0080 /* Do not cow file */ #endif #endif -#ifdef CONFIG_FALLOCATE_PUNCH_HOLE +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) #include #endif #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__) @@ -902,7 +902,7 @@ static int translate_err(int err) return err; } -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) static int do_fallocate(int fd, int mode, off_t offset, off_t len) { do { @@ -954,6 +954,17 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) } #endif +#ifdef CONFIG_FALLOCATE_ZERO_RANGE +if (s->has_write_zeroes) { +int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE, + aiocb->aio_offset, aiocb->aio_nbytes); +if (ret == 0 || ret != -ENOTSUP) { +return ret; +} +s->has_write_zeroes = false; +} +#endif + return -ENOTSUP; } diff --git a/configure b/configure index f185dd0..e00e03a 100755 --- a/configure +++ b/configure @@ -3335,6 +3335,22 @@ if compile_prog "" "" ; then fallocate_punch_hole=yes fi +# check that fallocate supports range zeroing inside the file +fallocate_zero_range=no +cat > $TMPC << EOF +#include +#include + +int main(void) +{ +fallocate(0, FALLOC_FL_ZERO_RANGE, 0, 0); +return 0; +} +EOF +if compile_prog "" "" ; then + fallocate_zero_range=yes +fi + # check for posix_fallocate posix_fallocate=no cat > $TMPC << EOF @@ -4567,6 +4583,9 @@ fi if test "$fallocate_punch_hole" = "yes" ; then echo "CONFIG_FALLOCATE_PUNCH_HOLE=y" >> $config_host_mak fi +if test "$fallocate_zero_range" = "yes" ; then + echo "CONFIG_FALLOCATE_ZERO_RANGE=y" >> $config_host_mak +fi if test "$posix_fallocate" = "yes" ; then echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak fi -- 1.9.1
[Qemu-devel] [PATCH 2/7] block/raw-posix: create do_fallocate helper
The pattern do { if (fallocate(s->fd, mode, offset, len) == 0) { return 0; } } while (errno == EINTR); ret = translate_err(-errno); will be commonly useful in next patches. Create helper for it. Signed-off-by: Denis V. Lunev Reviewed-by: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 22 ++ 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 24300d0..2aa268a 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -902,6 +902,18 @@ static int translate_err(int err) return err; } +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) +static int do_fallocate(int fd, int mode, off_t offset, off_t len) +{ +do { +if (fallocate(fd, mode, offset, len) == 0) { +return 0; +} +} while (errno == EINTR); +return translate_err(-errno); +} +#endif + static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) { int ret = -EOPNOTSUPP; @@ -965,14 +977,8 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb) #endif #ifdef CONFIG_FALLOCATE_PUNCH_HOLE -do { -if (fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, - aiocb->aio_offset, aiocb->aio_nbytes) == 0) { -return 0; -} -} while (errno == EINTR); - -ret = -errno; +ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + aiocb->aio_offset, aiocb->aio_nbytes); #endif } -- 1.9.1
[Qemu-devel] [PATCH 5/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
There is a possibility that we are extending our image and thus writing zeroes beyond the end of the file. In this case we do not need to care about the hole to make sure that there is no data in the file under this offset (pre-condition to fallocate(0) to work). We could simply call fallocate(0). This improves the performance of writing zeroes even on really old platforms which do not have even FALLOC_FL_PUNCH_HOLE. Before the patch do_fallocate was used when either CONFIG_FALLOCATE_PUNCH_HOLE or CONFIG_FALLOCATE_ZERO_RANGE are defined. Now the story is different. CONFIG_FALLOCATE is defined when Linux fallocate is defined, posix_fallocate is completely different story (CONFIG_POSIX_FALLOCATE). CONFIG_FALLOCATE is mandatory prerequite for both CONFIG_FALLOCATE_PUNCH_HOLE and CONFIG_FALLOCATE_ZERO_RANGE thus we are on the safe side. Signed-off-by: Denis V. Lunev CC: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 5a777e7..1c88ad8 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -147,6 +147,7 @@ typedef struct BDRVRawState { bool has_discard:1; bool has_write_zeroes:1; bool discard_zeroes:1; +bool has_fallocate; bool needs_alignment; } BDRVRawState; @@ -452,6 +453,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options, } if (S_ISREG(st.st_mode)) { s->discard_zeroes = true; +s->has_fallocate = true; } if (S_ISBLK(st.st_mode)) { #ifdef BLKDISCARDZEROES @@ -902,7 +904,7 @@ static int translate_err(int err) return err; } -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) +#ifdef CONFIG_FALLOCATE static int do_fallocate(int fd, int mode, off_t offset, off_t len) { do { @@ -965,6 +967,16 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) } #endif +#ifdef CONFIG_FALLOCATE +if (s->has_fallocate && aiocb->aio_offset >= bdrv_getlength(aiocb->bs)) { +int ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes); +if (ret == 0 || ret != -ENOTSUP) { +return ret; +} +s->has_fallocate = false; +} +#endif + return -ENOTSUP; } -- 1.9.1
[Qemu-devel] [PATCH v6 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c
I have performed several tests with non-aligned fallocate calls and in all cases (with non-aligned fallocates) Linux performs fine, i.e. areas are zeroed correctly. Checks were made on Linux 3.16.0-28-generic #38-Ubuntu SMP This should seriously increase performance of bdrv_write_zeroes Changes from v5: - changed order between patches 5/6 - check has_fallocate in FALLOC_FL_PUNCH_HOLE branch of the code - minor comment tweaks as pointed out by Max Reitz Changes from v4: - comments to patches are improved by Max Reitz suggestions - replaced 'return ret' with 'return -ENOTSUP' handle_aiocb_write_zeroes The idea is that we can reach that point only if original ret was equal to -ENOTSUP - added has_fallocate flag to indicate that fallocate is working for given BDS - checked file length with bdrv_getlength Changes from v3: - dropped original patch 1, equivalent stuff was merged already - reordered patches as suggested by Fam - fixes spelling errors as suggested by Fam - fixed not initialized value in handle_aiocb_write_zeroes - fixed wrong error processing from do_fallocate(FALLOC_FL_ZERO_RANGE) Changes from v2: - added Peter Lieven to CC - added CONFIG_FALLOCATE check to call do_fallocate in patch 7 - dropped patch 1 as NACK-ed - added processing of very large data areas in bdrv_co_write_zeroes (new patch 1) - set bl.max_write_zeroes to INT_MAX in raw-posix.c for regular files (new patch 8) Signed-off-by: Denis V. Lunev CC: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng
[Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files
fallocate() works fine and could handle properly with arbitrary size requests. There is no sense to reduce the amount of space to fallocate. The bigger is the size, the better is the performance as the amount of journal updates is reduced. The patch changes behavior for both generic filesystem and XFS codepaths, which are different in handle_aiocb_write_zeroes. The implementation of fallocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly the same thus the change is fine for both ways. Signed-off-by: Denis V. Lunev Reviewed-by: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 17 + 1 file changed, 17 insertions(+) diff --git a/block/raw-posix.c b/block/raw-posix.c index 7b42f37..933c778 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -293,6 +293,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) } } +static void raw_probe_max_write_zeroes(BlockDriverState *bs) +{ +BDRVRawState *s = bs->opaque; +struct stat st; + +if (fstat(s->fd, &st) < 0) { +return; /* no problem, keep default value */ +} +if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { +return; +} +bs->bl.max_write_zeroes = INT_MAX; +} + static void raw_parse_flags(int bdrv_flags, int *open_flags) { assert(open_flags != NULL); @@ -600,6 +614,7 @@ static int raw_reopen_prepare(BDRVReopenState *state, /* Fail already reopen_prepare() if we can't get a working O_DIRECT * alignment with the new fd. */ if (raw_s->fd != -1) { +raw_probe_max_write_zeroes(state->bs); raw_probe_alignment(state->bs, raw_s->fd, &local_err); if (local_err) { qemu_close(raw_s->fd); @@ -653,6 +668,8 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) raw_probe_alignment(bs, s->fd, errp); bs->bl.opt_mem_alignment = s->buf_align; + +raw_probe_max_write_zeroes(bs); } static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb) -- 1.9.1
[Qemu-devel] [PATCH 1/7] block/raw-posix: create translate_err helper to merge errno values
actually the code if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP || ret == -ENOTTY) { ret = -ENOTSUP; } is present twice and will be added a couple more times. Create helper for this. Signed-off-by: Denis V. Lunev Reviewed-by: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index e51293a..24300d0 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -893,6 +893,15 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes) } #endif +static int translate_err(int err) +{ +if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP || +err == -ENOTTY) { +err = -ENOTSUP; +} +return err; +} + static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) { int ret = -EOPNOTSUPP; @@ -921,10 +930,9 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) #endif } -if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP || -ret == -ENOTTY) { +ret = translate_err(ret); +if (ret == -ENOTSUP) { s->has_write_zeroes = false; -ret = -ENOTSUP; } return ret; } @@ -968,10 +976,9 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb) #endif } -if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP || -ret == -ENOTTY) { +ret = translate_err(ret); +if (ret == -ENOTSUP) { s->has_discard = false; -ret = -ENOTSUP; } return ret; } -- 1.9.1
[Qemu-devel] [PATCH 3/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit
move code dealing with a block device to a separate function. This will allow to implement additional processing for ordinary files. Please note, that xfs_code has been moved before checking for s->has_write_zeroes as xfs_write_zeroes does not touch this flag inside. This makes code a bit more consistent. Signed-off-by: Denis V. Lunev Reviewed-by: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 46 +++--- 1 file changed, 27 insertions(+), 19 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 2aa268a..d3910fd 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -914,41 +914,49 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len) } #endif -static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) +static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) { -int ret = -EOPNOTSUPP; +int ret = -ENOTSUP; BDRVRawState *s = aiocb->bs->opaque; -if (s->has_write_zeroes == 0) { +if (!s->has_write_zeroes) { return -ENOTSUP; } -if (aiocb->aio_type & QEMU_AIO_BLKDEV) { #ifdef BLKZEROOUT -do { -uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes }; -if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) { -return 0; -} -} while (errno == EINTR); - -ret = -errno; -#endif -} else { -#ifdef CONFIG_XFS -if (s->is_xfs) { -return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes); +do { +uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes }; +if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) { +return 0; } +} while (errno == EINTR); + +ret = translate_err(-errno); #endif -} -ret = translate_err(ret); if (ret == -ENOTSUP) { s->has_write_zeroes = false; } return ret; } +static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) +{ +BDRVRawState *s = aiocb->bs->opaque; + +if (aiocb->aio_type & QEMU_AIO_BLKDEV) { +return handle_aiocb_write_zeroes_block(aiocb); +} + +#ifdef CONFIG_XFS +if (s->is_xfs) { +return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes); +} +#endif + +return -ENOTSUP; +} + static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb) { int ret = -EOPNOTSUPP; -- 1.9.1
Re: [Qemu-devel] [PATCH 2/7] block/raw-posix: create do_fallocate helper
On 30/01/15 11:47, Peter Lieven wrote: Am 30.01.2015 um 09:42 schrieb Denis V. Lunev: The pattern do { if (fallocate(s->fd, mode, offset, len) == 0) { return 0; } } while (errno == EINTR); ret = translate_err(-errno); will be commonly useful in next patches. Create helper for it. Signed-off-by: Denis V. Lunev Reviewed-by: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 22 ++ 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 24300d0..2aa268a 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -902,6 +902,18 @@ static int translate_err(int err) return err; } +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) +static int do_fallocate(int fd, int mode, off_t offset, off_t len) +{ +do { +if (fallocate(fd, mode, offset, len) == 0) { +return 0; +} +} while (errno == EINTR); +return translate_err(-errno); +} +#endif + static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) { int ret = -EOPNOTSUPP; @@ -965,14 +977,8 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb) #endif #ifdef CONFIG_FALLOCATE_PUNCH_HOLE -do { -if (fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, - aiocb->aio_offset, aiocb->aio_nbytes) == 0) { -return 0; -} -} while (errno == EINTR); - -ret = -errno; +ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + aiocb->aio_offset, aiocb->aio_nbytes); You also introduce the conversion via translate_err here. Is that ok? Peter yes, this is redundant but harmless
Re: [Qemu-devel] [PATCH 5/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
On 30/01/15 17:58, Max Reitz wrote: On 2015-01-30 at 03:42, Denis V. Lunev wrote: There is a possibility that we are extending our image and thus writing zeroes beyond the end of the file. In this case we do not need to care about the hole to make sure that there is no data in the file under this offset (pre-condition to fallocate(0) to work). We could simply call fallocate(0). This improves the performance of writing zeroes even on really old platforms which do not have even FALLOC_FL_PUNCH_HOLE. Before the patch do_fallocate was used when either CONFIG_FALLOCATE_PUNCH_HOLE or CONFIG_FALLOCATE_ZERO_RANGE are defined. Now the story is different. CONFIG_FALLOCATE is defined when Linux fallocate is defined, posix_fallocate is completely different story (CONFIG_POSIX_FALLOCATE). CONFIG_FALLOCATE is mandatory prerequite for both CONFIG_FALLOCATE_PUNCH_HOLE and CONFIG_FALLOCATE_ZERO_RANGE thus we are on the safe side. Signed-off-by: Denis V. Lunev CC: Max Reitz CC: Kevin Wolf CC: Stefan Hajnoczi CC: Peter Lieven CC: Fam Zheng --- block/raw-posix.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 5a777e7..1c88ad8 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -147,6 +147,7 @@ typedef struct BDRVRawState { bool has_discard:1; bool has_write_zeroes:1; bool discard_zeroes:1; +bool has_fallocate; bool needs_alignment; } BDRVRawState; @@ -452,6 +453,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options, } if (S_ISREG(st.st_mode)) { s->discard_zeroes = true; +s->has_fallocate = true; This could be moved upwards where has_discard and has_write_zeroes are initialized; but it won't matter in practice, I hope. Thus: Reviewed-by: Max Reitz This does matter as has_discard and has_write_zeroes are bit fields thus I can not insert something useful into the middle of those fields.