[Qemu-devel] metal mesh
Dear Sir or Madam It is glad to write to you with keen hope to open a business relationship with you. I obtained your company name and email address from the Internet. Boli hardware industrial limited is a factory specialized in wire and wire mesh production. Our products mainly include expanded wire metal mesh,window screen,welded wire mesh, square wire mesh, stainless steel wire mesh, hexagonal wire mesh fence, stainless steel wire, black annealed wire, galvanized wire, PVC-coated wire, barbed wire, hot galvanized patented wire, hot dipped galvanized chain etc! For more information, we would like to let you know our company web site as below:http://www.boliwiremesh.com. Hope to hear good news from you. Sincerely Yours, Andy Sales manager Boli hardware industrial limited Tel: 86-0311-87789982 Fax: 86-0311-87769982
[Qemu-devel] [Bug 1490611] Re: Using qemu >=2.2.1 to convert raw->VHD (fixed) adds extra padding to the result file, which Microsoft Azure rejects as invalid
I'm using this version on xenial, andy@bastion:~/temp$ qemu-img -h qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 2004-2008 Fabrice Bellard qemu-img convert -f raw -O vpc -o subformat=fixed,force_size /tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd unfortunately the papapa2.vhd size is 25166336!=25165824 which means it's not aligned in MiB. could you please help? -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1490611 Title: Using qemu >=2.2.1 to convert raw->VHD (fixed) adds extra padding to the result file, which Microsoft Azure rejects as invalid Status in QEMU: Fix Released Status in qemu package in Ubuntu: Fix Released Status in qemu source package in Xenial: Fix Released Bug description: [Impact] * Starting with a raw disk image, using "qemu-img convert" to convert from raw to VHD results in the output VHD file's virtual size being aligned to the nearest 516096 bytes (16 heads x 63 sectors per head x 512 bytes per sector), instead of preserving the input file's size as the output VHD's virtual disk size. * Microsoft Azure requires that disk images (VHDs) submitted for upload have virtual sizes aligned to a megabyte boundary. (Ex. 4096MB, 4097MB, 4098MB, etc. are OK, 4096.5MB is rejected with an error.) This is reflected in Microsoft's documentation: https://azure.microsoft.com /en-us/documentation/articles/virtual-machines-linux-create-upload- vhd-generic/ * The fix for this bug is a backport from upstream. http://git.qemu.org/?p=qemu.git;a=commitdiff;h=fb9245c2610932d33ce14 [Test Case] * This is reproducible with the following set of commands (including the Azure command line tools from https://github.com/Azure/azure- xplat-cli). For the following example, I used qemu version 2.2.1: $ dd if=/dev/zero of=source-disk.img bs=1M count=4096 $ stat source-disk.img File: ‘source-disk.img’ Size: 4294967296 Blocks: 798656 IO Block: 4096 regular file Device: fc01h/64513dInode: 13247963Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1000/ smkent) Gid: ( 1000/ smkent) Access: 2015-08-18 09:48:02.613988480 -0700 Modify: 2015-08-18 09:48:02.825985646 -0700 Change: 2015-08-18 09:48:02.825985646 -0700 Birth: - $ qemu-img convert -f raw -o subformat=fixed -O vpc source-disk.img dest-disk.vhd $ stat dest-disk.vhd File: ‘dest-disk.vhd’ Size: 4296499712 Blocks: 535216 IO Block: 4096 regular file Device: fc01h/64513dInode: 13247964Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1000/ smkent) Gid: ( 1000/ smkent) Access: 2015-08-18 09:50:22.252077624 -0700 Modify: 2015-08-18 09:49:24.424868868 -0700 Change: 2015-08-18 09:49:24.424868868 -0700 Birth: - $ azure vm image create testimage1 dest-disk.vhd -o linux -l "West US" info:Executing command vm image create + Retrieving storage accounts info:VHD size : 4097 MB info:Uploading 4195800.5 KB Requested:100.0% Completed:100.0% Running: 0 Time: 1m 0s Speed: 6744 KB/s info:https://[redacted].blob.core.windows.net/vm-images/dest-disk.vhd was uploaded successfully error: The VHD https://[redacted].blob.core.windows.net/vm-images/dest-disk.vhd has an unsupported virtual size of 4296499200 bytes. The size must be a whole number (in MBs). info:Error information has been recorded to /home/smkent/.azure/azure.err error: vm image create command failed * A fixed qemu-img will not result in an error during azure image creation. It will require passing -o force_size, which will leverage the backported functionality. [Regression Potential] * The upstream fix introduces a qemu-img option (-o force_size) which is unset by default. The regression potential is very low, as a result. ... I also ran the above commands using qemu 2.4.0, which resulted in the same error as the conversion behavior is the same. However, qemu 2.1.1 and earlier (including qemu 2.0.0 installed by Ubuntu 14.04) does not pad the virtual disk size during conversion. Using qemu-img convert from qemu versions <=2.1.1 results in a VHD that is exactly the size of the raw input file plus 512 bytes (for the VHD footer). Those qemu versions do not attempt to realign the disk. As a result, Azure accepts VHD files created using those versions of qemu-img convert for upload. Is there a reason why newer qemu realigns the converted VHD file? It would be useful if an option were added to disable this feature, as current versions of qemu cannot be used to create VHD files for Azure using Microsoft's official instructions. To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1490611/+subscriptions
[Qemu-devel] [Bug 1790268] [NEW] the vhd generated by qemu-img not align with MiB again.
Public bug reported: I'm using this version on xenial, andy@bastion:~/temp$ qemu-img -h qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 2004-2008 Fabrice Bellard steps to repro: dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu bs=1048576 count=24 mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L azure_cfg_dsk sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat sudo -n chown andy /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat mkdir -p /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs sudo -n umount /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat qemu-img convert -f raw -O vpc -o subformat=fixed,force_size /tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd unfortunately the papapa2.vhd size is 25166336!=25165824 which means it's not aligned in MiB. could you please help? ** Affects: qemu Importance: Undecided Status: New -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1790268 Title: the vhd generated by qemu-img not align with MiB again. Status in QEMU: New Bug description: I'm using this version on xenial, andy@bastion:~/temp$ qemu-img -h qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 2004-2008 Fabrice Bellard steps to repro: dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu bs=1048576 count=24 mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L azure_cfg_dsk sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat sudo -n chown andy /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat mkdir -p /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs sudo -n umount /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat qemu-img convert -f raw -O vpc -o subformat=fixed,force_size /tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd unfortunately the papapa2.vhd size is 25166336!=25165824 which means it's not aligned in MiB. could you please help? To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1790268/+subscriptions
[Qemu-devel] [Bug 1790268] Re: the vhd generated by qemu-img not align with MiB again.
last bug report and fixed is https://bugs.launchpad.net/qemu/+bug/1490611 -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1790268 Title: the vhd generated by qemu-img not align with MiB again. Status in QEMU: New Bug description: I'm using this version on xenial, andy@bastion:~/temp$ qemu-img -h qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 2004-2008 Fabrice Bellard steps to repro: dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu bs=1048576 count=24 mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L azure_cfg_dsk sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat sudo -n chown andy /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat mkdir -p /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs sudo -n umount /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat qemu-img convert -f raw -O vpc -o subformat=fixed,force_size /tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd unfortunately the papapa2.vhd size is 25166336!=25165824 which means it's not aligned in MiB. could you please help? To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1790268/+subscriptions
[Qemu-devel] [Bug 1790268] Re: the vhd generated by qemu-img not align with MiB again.
and even the format is raw: andy@bastion:~/temp$ qemu-img info papapa2.vhd image: papapa2.vhd file format: raw virtual size: 24M (25166336 bytes) disk size: 152K -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1790268 Title: the vhd generated by qemu-img not align with MiB again. Status in QEMU: New Bug description: I'm using this version on xenial, andy@bastion:~/temp$ qemu-img -h qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 2004-2008 Fabrice Bellard steps to repro: dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu bs=1048576 count=24 mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L azure_cfg_dsk sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat sudo -n chown andy /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat mkdir -p /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs sudo -n umount /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat qemu-img convert -f raw -O vpc -o subformat=fixed,force_size /tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd unfortunately the papapa2.vhd size is 25166336!=25165824 which means it's not aligned in MiB. could you please help? To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1790268/+subscriptions
[Qemu-devel] [Bug 1790268] Re: the vhd generated by qemu-img not align with MB again.
** Summary changed: - the vhd generated by qemu-img not align with MiB again. + the vhd generated by qemu-img not align with MB again. -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1790268 Title: the vhd generated by qemu-img not align with MB again. Status in QEMU: New Bug description: I'm using this version on xenial, andy@bastion:~/temp$ qemu-img -h qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 2004-2008 Fabrice Bellard steps to repro: dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu bs=1048576 count=24 mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L azure_cfg_dsk sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat sudo -n chown andy /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat mkdir -p /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs sudo -n umount /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat qemu-img convert -f raw -O vpc -o subformat=fixed,force_size /tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd unfortunately the papapa2.vhd size is 25166336!=25165824 which means it's not aligned in MiB. could you please help? To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1790268/+subscriptions
[Qemu-devel] [Bug 1790268] Re: the vhd generated by qemu-img not align with MB again.
** Description changed: I'm using this version on xenial, andy@bastion:~/temp$ qemu-img -h qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 2004-2008 Fabrice Bellard steps to repro: dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu bs=1048576 count=24 mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L azure_cfg_dsk sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat sudo -n chown andy /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat mkdir -p /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs sudo -n umount /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat qemu-img convert -f raw -O vpc -o subformat=fixed,force_size /tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd unfortunately the papapa2.vhd size is 25166336!=25165824 which means it's not aligned in MiB. + and also seems it's not a vhd file: + andy@bastion:~/temp$ qemu-img info papapa2.vhd + image: papapa2.vhd + file format: raw + virtual size: 24M (25166336 bytes) + disk size: 152K + could you please help? -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1790268 Title: the vhd generated by qemu-img not align with MB again. Status in QEMU: New Bug description: I'm using this version on xenial, andy@bastion:~/temp$ qemu-img -h qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 2004-2008 Fabrice Bellard steps to repro: dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu bs=1048576 count=24 mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L azure_cfg_dsk sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat sudo -n chown andy /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat mkdir -p /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs sudo -n umount /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat qemu-img convert -f raw -O vpc -o subformat=fixed,force_size /tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd unfortunately the papapa2.vhd size is 25166336!=25165824 which means it's not aligned in MiB. and also seems it's not a vhd file: andy@bastion:~/temp$ qemu-img info papapa2.vhd image: papapa2.vhd file format: raw virtual size: 24M (25166336 bytes) disk size: 152K could you please help? To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1790268/+subscriptions
[Qemu-devel] [Bug 1790268] Re: the vhd generated by qemu-img not align with MB again.
** Changed in: qemu Status: New => Invalid -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1790268 Title: the vhd generated by qemu-img not align with MB again. Status in QEMU: Invalid Bug description: I'm using this version on xenial, andy@bastion:~/temp$ qemu-img -h qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 2004-2008 Fabrice Bellard steps to repro: dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu bs=1048576 count=24 mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L azure_cfg_dsk sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat sudo -n chown andy /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat mkdir -p /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs sudo -n umount /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat qemu-img convert -f raw -O vpc -o subformat=fixed,force_size /tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd unfortunately the papapa2.vhd size is 25166336!=25165824 which means it's not aligned in MiB. and also seems it's not a vhd file: andy@bastion:~/temp$ qemu-img info papapa2.vhd image: papapa2.vhd file format: raw virtual size: 24M (25166336 bytes) disk size: 152K could you please help? To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1790268/+subscriptions
[Qemu-devel] [Bug 1790268] Re: the vhd generated by qemu-img not align with MB again.
** Changed in: qemu Assignee: (unassigned) => Andy (andyliuliming) -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1790268 Title: the vhd generated by qemu-img not align with MB again. Status in QEMU: Invalid Bug description: I'm using this version on xenial, andy@bastion:~/temp$ qemu-img -h qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 2004-2008 Fabrice Bellard steps to repro: dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu bs=1048576 count=24 mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L azure_cfg_dsk sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat sudo -n chown andy /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat mkdir -p /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs sudo -n umount /tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat qemu-img convert -f raw -O vpc -o subformat=fixed,force_size /tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd unfortunately the papapa2.vhd size is 25166336!=25165824 which means it's not aligned in MiB. and also seems it's not a vhd file: andy@bastion:~/temp$ qemu-img info papapa2.vhd image: papapa2.vhd file format: raw virtual size: 24M (25166336 bytes) disk size: 152K could you please help? To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1790268/+subscriptions
[Qemu-devel] [Bug 1025244] Re: qcow2 image increasing disk size above the virtual limit
Any solution right now? I have a similar problem like Todor Andreev; Our daily backup of some virtual machines (qcow2) looks like that: 1. shutdown the VM 2. create a snapshot via: "qemu-img snapshot -c nameofsnapshot..." 3. boot the VM 4. backup the snapshot to another virtual disk via: "qemu-img convert -f qcow2 -O qcow2 -s nameofsnapshot..." 5. DELETE the snapshot from VM via: qemu-img snapshot -d nameofsnapshot... But the problem is, that our original VM-size growing steadily (although few changes were made) ?! -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1025244 Title: qcow2 image increasing disk size above the virtual limit Status in QEMU: New Status in “qemu-kvm” package in Ubuntu: Triaged Bug description: Using qemu/kvm, qcow2 images, ext4 file systems on both guest and host Host and Guest: Ubuntu server 12.04 64bit To create an image I did this: qemu-img create -f qcow2 -o preallocation=metadata ubuntu-pdc-vda.img 10737418240 (not sure about the exact bytes, but around this) ls -l ubuntu-pdc-vda.img fallocate -l theSizeInBytesFromAbove ubuntu-pdc-vda.img The problem is that the image is growing progressively and has obviously no limit, although I gave it one. The root filesystem's image is the same case: qemu-img info ubuntu-pdc-vda.img image: ubuntu-pdc-vda.img file format: qcow2 virtual size: 10G (10737418240 bytes) disk size: 14G cluster_size: 65536 and for confirmation: du -sh ubuntu-pdc-vda.img 15G ubuntu-pdc-vda.img I made a test and saw that when I delete something from the guest, the real size of the image is not decreasing (I read it is normal). OK, but when I write something again, it doesn't use the freed space, but instead grows the image. So for example: 1. The initial physical size of the image is 1GB. 2. I copy 1GB of data in the guest. It's physical size becomes 2GB. 3. I delete this data (1GB). The physical size of the image remains 2GB. 4. I copy another 1GB of data to the guest. 5. The physical size of the image becomes 3GB. 6. And so on with no limit. It doesn't care if the virtual size is less. Is this normal - the real/physical size of the image to be larger than the virtual limit??? To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1025244/+subscriptions
[Qemu-devel] Which qemu version to use for CentOS6?
Folks, I have a project that needs to run on CentOS 6.3, but the qemu that comes with that is 0.12, and I'm sure that I should apply a newer version. I had backported the qemu 1.0-17 from Fedora 17, but I see that the upstream qemu-1.1 is stable. Are there kernel dependencies that would prevent running qemu-1.1 on the 2.6.32 CentOS 6 kernel? What would you recommend? Andy
Re: [Qemu-devel] Which qemu version to use for CentOS6?
Stefan, That is what I needed to know. Thanks. It was somewhat surprising to see how many patches have been backported to the CentOS6.3 source rpm. Andy -Original Message- From: Stefan Hajnoczi [mailto:stefa...@gmail.com] Sent: Monday, February 25, 2013 9:41 AM To: Andy Cress Cc: qemu-devel@nongnu.org Subject: Re: [Qemu-devel] Which qemu version to use for CentOS6? On Mon, Feb 25, 2013 at 02:24:01PM +, Andy Cress wrote: > I have a project that needs to run on CentOS 6.3, but the qemu that comes > with that is 0.12, and I'm sure that I should apply a newer version. > I had backported the qemu 1.0-17 from Fedora 17, but I see that the upstream > qemu-1.1 is stable. Are there kernel dependencies that would prevent > running qemu-1.1 on the 2.6.32 CentOS 6 kernel? The qemu-kvm in CentOS 6.3 is not a vanilla QEMU 0.12. Look at the source rpm and you'll see a lot of features and fixes have been backported. Is there a feature missing that forced you to build your own qemu-kvm? If not, don't worry about the version number and use the distro package. Stefan
Re: [Qemu-devel] [PATCH 10/17] mm: rmap preparation for remap_anon_pages
On Tue, Oct 7, 2014 at 8:52 AM, Andrea Arcangeli wrote: > On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote: >> mremap like interface, or file+commands protocol interface. I tend to >> like mremap more, that's why I opted for a remap_anon_pages syscall >> kept orthogonal to the userfaultfd functionality (remap_anon_pages >> could be also used standalone as an accelerated mremap in some >> circumstances) but nothing prevents to just embed the same mechanism > > Sorry for the self followup, but something else comes to mind to > elaborate this further. > > In term of interfaces, the most efficient I could think of to minimize > the enter/exit kernel, would be to append the "source address" of the > data received from the network transport, to the userfaultfd_write() > command (by appending 8 bytes to the wakeup command). Said that, > mixing the mechanism to be notified about userfaults with the > mechanism to resolve an userfault to me looks a complication. I kind > of liked to keep the userfaultfd protocol is very simple and doing > just its thing. The userfaultfd doesn't need to know how the userfault > was resolved, even mremap would work theoretically (until we run out > of vmas). I thought it was simpler to keep it that way. However if we > want to resolve the fault with a "write()" syscall this may be the > most efficient way to do it, as we're already doing a write() into the > pseudofd to wakeup the page fault that contains the destination > address, I just need to append the source address to the wakeup command. > > I probably grossly overestimated the benefits of resolving the > userfault with a zerocopy page move, sorry. So if we entirely drop the > zerocopy behavior and the TLB flush of the old page like you > suggested, the way to keep the userfaultfd mechanism decoupled from > the userfault resolution mechanism would be to implement an > atomic-copy syscall. That would work for SIGBUS userfaults too without > requiring a pseudofd then. It would be enough then to call > mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints > that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic > wouldn't page fault or call GUP into the destination address (it can't > otherwise the in-flight partial copy would be visible to the process, > breaking the atomicity of the copy), but it would fill in the > pte/trans_huge_pmd with the same strict behavior that remap_anon_pages > currently has (in turn it would by design bypass the VM_USERFAULT > check and be ideal for resolving userfaults). At the risk of asking a possibly useless question, would it make sense to splice data into a userfaultfd? --Andy > > mcopy_atomic could then be also extended to tmpfs and it would work > without requiring the source page to be a tmpfs page too without > having to convert page types on the fly. > > If I add mcopy_atomic, the patch in subject (10/17) can be dropped of > course so it'd be even less intrusive than the current > remap_anon_pages and it would require zero TLB flush during its > runtime (it would just require an atomic copy). > > So should I try to embed a mcopy_atomic inside userfault_write or can > I expose it to userland as a standalone new syscall? Or should I do > something different? Comments? > > Thanks, > Andrea -- Andy Lutomirski AMA Capital Management, LLC
Re: [Qemu-devel] [PATCH RFC 00/11] qemu: towards virtio-1 host support
On 10/07/2014 07:39 AM, Cornelia Huck wrote: > This patchset aims to get us some way to implement virtio-1 compliant > and transitional devices in qemu. Branch available at > > git://github.com/cohuck/qemu virtio-1 > > I've mainly focused on: > - endianness handling > - extended feature bits > - virtio-ccw new/changed commands At the risk of some distraction, would it be worth thinking about a solution to the IOMMU bypassing mess as part of this? --Andy
Re: [Qemu-devel] tcmu-runner and QEMU
On 08/29/2014 10:22 AM, Benoît Canet wrote: The truth is that QEMU block drivers don't know how to do much on their own so we probably must bring the whole QEMU block layer in a tcmu-runner handler plugin. Woah! Really? ok... Another reason to do this is that the QEMU block layer brings features like taking snapshots or streaming snaphots that a cloud provider would want to keep while exporting QCOW2 as ISCSI or FCOE. Doing these operations is usually done by passing something like "--qmp tcp:localhost,,server,nowait" as a QEMU command line argument then connecting on this JSON processing socket then send orders to QEMU. The LIO TCMU backend and tcmu-runner provide for a configstring that is associated with a given backstore. This is made available to the handler, and sounds like just what qmp needs. I made some patches to split this QMP machinery from the QEMU binary but still I don't know how a tcmu-runner plugin handler would be able to receive this command line configuration. The flow would be: 1) admin configures a LIO backstore of type "user", size 10G, and gives it a configstring like "qmp/tcp:localhost,,server,nowait" 2) admin exports the backstore via whatever LIO-supported fabric(s) (e.g. iSCSI) 3) tcmu-runner is notified of the new user backstore from step 1, finds the handler associated with "qmp", calls handler->open("tcp:localhost,,server,nowait") 4) qmp handler parses string and does whatever it needs to do 5) handler receives SCSI commands as they arrive The second problem is that the QEMU block layer is big and filled with scary stuff like threads and coroutines but I think only trying to write the tcmu-runner handler will tell if it's doable. Yeah, could be tricky but would be pretty cool if it works. Let me know how I can help, or with any questions. Regards -- Andy
Re: [Qemu-devel] tcmu-runner and QEMU
On 08/29/2014 11:51 AM, Benoît Canet wrote: QMP is just a way to control QEMU via a socket: it is not particularly block related. On the other hand bringing the whole block layers into a tcmu-runner handler would mean that there would be _one_ QMP socket opened (by mean of wonderfull QEMU modules static variables :) to control multiple block devices exported. So I think the configuration passed must be done before an individual open occurs: being global to the .so implementing the tcmu-runner handler. But I don't see how to do it with the current API. This discussion leads me to think we need to step back and discuss our requirements. I am looking for flexible backstores for SCSI-based fabrics, with as little new code as possible. I think you are looking for a way to export QEMU block devices over iSCSI and other fabrics? I don't think making a LIO userspace handler into basically a full-fledged secondary QEMU server instance is the way to go. What I think better serves your requirements is to enable QEMU to configure LIO. In a previous email you wrote: Another reason to do this is that the QEMU block layer brings features like taking snapshots or streaming snaphots that a cloud provider would want to keep while exporting QCOW2 as ISCSI or FCOE. Whether a volume is exported over iSCSI or FCoE or not shouldn't affect how it is managed. QMP commands should go to the single QEMU server, which can then optionally configure LIO to export the volume. That leaves us with the issue that we'd need to arbitrate access to the backing file if taking a streaming snapshot (qemu and tcmu-runner processes both accessing the img), but that should be straightforward, or at least work that can be done in a second phase of development. Thoughts? Regards -- Andy p.s. offline Monday.
Re: [Qemu-devel] tcmu-runner and QEMU
On 08/30/2014 09:02 AM, Richard W.M. Jones wrote: On Sat, Aug 30, 2014 at 05:53:43PM +0200, Benoît Canet wrote: If the cloud provider want to be able to boot QCOW2 or QED images on bare metal machines he will need to export QCOW2 or QED images on the network. So far only qemu-nbd allows to do this and it is neither well performing nor really convenient to boot on a bare metal machine. So I think what you want is a `qemu-iscsi'? ie. the same as qemu-nbd, but with an iSCSI frontend (to replace the NBD server). You want qemu to be able to issue SCSI commands over iSCSI? I thought qemu used libiscsi for this, to be the initiator. What Benoit and I have been discussing is the other side, enabling qemu to configure LIO to handle requests from other initiators (either VMs or iron) over iSCSI or FCoE, but backed by qcow2 disk images. The problem being LIO doesn't speak qcow2 yet. I guess so. Are you planning to integrate bits of LIO into qemu, or bits of qemu into LIO? My current thinking is 1) enable qemu to configure the LIO kernel target (it's all straightforward via configfs, but add a nice library to qemu to hide the details) and 2) enable LIO to use qcow2 and other formats besides raw images to back exported LUNs. This is where the LIO userspace passthrough and tcmu-runner come in, because we want to do this in userspace, not as kernel code, so we have to pass SCSI commands up to a userspace helper daemon. The latter has been tried various times, without much success. See the many examples of people trying to make the qemu block driver code into a separate library, and failing. What's been the sticking point? Regards -- Andy
Re: [Qemu-devel] tcmu-runner and QEMU
On 09/02/2014 02:25 AM, Stefan Hajnoczi wrote: The easiest approach is to write a tool similar to qemu-nbd that speaks the userspace target protocol (i.e. mmap the shared memory). If the tcmu setup code is involved, maybe providing a libtcmu with the setup code would be useful. I suspect that other projects may want to integrate userspace target support too. It's easier to let people add it to their codebase rather than hope they bring their codebase into tcmu-runner. What other projects were you thinking of? From my perspective, QEMU is singular. QEMU's block support seems to cover just about everything, even ceph, gluster, and sheepdog! We certainly don't want to duplicate that code so a qemu-lio-tcmu in qemu.git like qemu-nbd, basically statically linking the BlockDriver object files, sounds like the first thing to try. We can make tcmu-runner a library (libtcmu) if it makes sense, but let's do some work to try the current way and see how it goes before "flipping" it. > The qemu-lio tool would live in the QEMU codebase and reuse all the > infrastructure. For example, it could include a QMP monitor just like > the one you are adding to qemu-nbd. Benoit and I talked a little about QMP on another part of the thread... I said I didn't think we needed a QMP monitor in qemu-lio-tcmu, but let me spin up on qemu a little more and I'll be able to speak more intelligently. -- Andy
Re: [Qemu-devel] tcmu-runner and QEMU
On 09/04/2014 06:24 AM, Benoît Canet wrote: There are other commands for snapshots and backup which are issued via QMP. It might even make sense to make the tcmu interface available at run-time in QEMU like the run-time NBD server. This allows you to get at read-only point-in-time snapshots while the guest is accessing the disk. See the nbd-server-start command in qapi/block.json. Stefan Andy: ping I hope we didn't scaried you with our monster block backend and it's associated QMP socket ;) Hi Benoît, No, I've gone off to work on a initial proof-of-concept implementation of a qemu-lio-tcmu.so module, hopefully it'll be ready to look at shortly and then we can shoot arrows at it. :) But in the meantime, do you have a use case or user story for the QMP support that might help me understand better how it might all fit together? Regards -- Andy
[Qemu-devel] [PATCH qemu] i386, linux-headers: Add support for kvm_get_rng_seed
This updates x86's kvm_para.h for the feature bit definition and target-i386/cpu.c for the feature name and default. Signed-off-by: Andy Lutomirski --- linux-headers/asm-x86/kvm_para.h | 2 ++ target-i386/cpu.c| 5 +++-- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/linux-headers/asm-x86/kvm_para.h b/linux-headers/asm-x86/kvm_para.h index e41c5c1..a9b27ce 100644 --- a/linux-headers/asm-x86/kvm_para.h +++ b/linux-headers/asm-x86/kvm_para.h @@ -24,6 +24,7 @@ #define KVM_FEATURE_STEAL_TIME 5 #define KVM_FEATURE_PV_EOI 6 #define KVM_FEATURE_PV_UNHALT 7 +#define KVM_FEATURE_GET_RNG_SEED 8 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. @@ -40,6 +41,7 @@ #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 #define MSR_KVM_STEAL_TIME 0x4b564d03 #define MSR_KVM_PV_EOI_EN 0x4b564d04 +#define MSR_KVM_GET_RNG_SEED 0x4b564d05 struct kvm_steal_time { __u64 steal; diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 8fd1497..4ea7e6c 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -236,7 +236,7 @@ static const char *ext4_feature_name[] = { static const char *kvm_feature_name[] = { "kvmclock", "kvm_nopiodelay", "kvm_mmu", "kvmclock", "kvm_asyncpf", "kvm_steal_time", "kvm_pv_eoi", "kvm_pv_unhalt", -NULL, NULL, NULL, NULL, +"kvm_get_rng_seed", NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, @@ -368,7 +368,8 @@ static uint32_t kvm_default_features[FEATURE_WORDS] = { (1 << KVM_FEATURE_ASYNC_PF) | (1 << KVM_FEATURE_STEAL_TIME) | (1 << KVM_FEATURE_PV_EOI) | -(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT), +(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) | +(1 << KVM_FEATURE_GET_RNG_SEED), [FEAT_1_ECX] = CPUID_EXT_X2APIC, }; -- 1.9.3
[Qemu-devel] [Bug 784977] [NEW] qemu-img convert fails to convert, generates a 512byte file output
Public bug reported: I have a Vmware image, so I have files like 'Ubuntu.vmdk', want to convert to VirtualBox .vdi format using qemu, the first stage of extracting the image with 'qemu-img convert Ubuntu.vmdk output.bin' just generates a 512byte file: {quote} # Disk DescriptorFile version=1 CID=36be9761 parentCID= createType="twoGbMaxExtentSparse" # Extent description RW 4192256 SPARSE "Ubuntu-s001.vmdk" RW 4192256 SPARSE "Ubuntu-s002.vmdk" RW 4192256 SPARSE "Ubuntu-s003.vmdk" RW 4192256 SPARSE "Ubuntu-s004.vmdk" RW 4192256 SPARSE "Ubuntu-s005.vmdk" RW 4192256 SPARSE "Ubuntu-s006.vmdk" RW 4192256 SPARSE "Ubuntu-s007.vmdk" RW 4192256 SPARSE "Ubuntu-s008.vmdk" RW 4192256 SPARSE "Ubuntu-s009.vmdk" RW 4192256 SPARSE "Ubuntu-s010.vmdk" RW 20480 SPARSE "Ubunt {quote} Here is the input Ubuntu.vmdk file: {quote} # Disk DescriptorFile version=1 CID=36be9761 parentCID= createType="twoGbMaxExtentSparse" # Extent description RW 4192256 SPARSE "Ubuntu-s001.vmdk" RW 4192256 SPARSE "Ubuntu-s002.vmdk" RW 4192256 SPARSE "Ubuntu-s003.vmdk" RW 4192256 SPARSE "Ubuntu-s004.vmdk" RW 4192256 SPARSE "Ubuntu-s005.vmdk" RW 4192256 SPARSE "Ubuntu-s006.vmdk" RW 4192256 SPARSE "Ubuntu-s007.vmdk" RW 4192256 SPARSE "Ubuntu-s008.vmdk" RW 4192256 SPARSE "Ubuntu-s009.vmdk" RW 4192256 SPARSE "Ubuntu-s010.vmdk" RW 20480 SPARSE "Ubuntu-s011.vmdk" # The Disk Data Base #DDB ddb.toolsVersion = "7240" ddb.adapterType = "lsilogic" ddb.geometry.sectors = "63" ddb.geometry.heads = "255" ddb.geometry.cylinders = "2610" ddb.virtualHWVersion = "6" {quote} No stack trace or other output was found. Anything I can add (other than the 20G VM image to reproduce and I'll be happy to provide) ** Affects: qemu Importance: Undecided Status: New ** Tags: convert quemu-img vmdk ** Description changed: I have a Vmware image, so I have files like 'Ubuntu.vmdk', want to convert to VirtualBox .vdi format using qemu, the first stage of extracting the image with 'qemu-img convert Ubuntu.vmdk output.bin' just generates a 512byte file: {quote} # Disk DescriptorFile version=1 CID=36be9761 parentCID= createType="twoGbMaxExtentSparse" # Extent description RW 4192256 SPARSE "Ubuntu-s001.vmdk" RW 4192256 SPARSE "Ubuntu-s002.vmdk" RW 4192256 SPARSE "Ubuntu-s003.vmdk" RW 4192256 SPARSE "Ubuntu-s004.vmdk" RW 4192256 SPARSE "Ubuntu-s005.vmdk" RW 4192256 SPARSE "Ubuntu-s006.vmdk" RW 4192256 SPARSE "Ubuntu-s007.vmdk" RW 4192256 SPARSE "Ubuntu-s008.vmdk" RW 4192256 SPARSE "Ubuntu-s009.vmdk" RW 4192256 SPARSE "Ubuntu-s010.vmdk" RW 20480 SPARSE "Ubunt {quote} + Here is the input Ubuntu.vmdk file: + {quote} + # Disk DescriptorFile + version=1 + CID=36be9761 + parentCID= + createType="twoGbMaxExtentSparse" + + # Extent description + RW 4192256 SPARSE "Ubuntu-s001.vmdk" + RW 4192256 SPARSE "Ubuntu-s002.vmdk" + RW 4192256 SPARSE "Ubuntu-s003.vmdk" + RW 4192256 SPARSE "Ubuntu-s004.vmdk" + RW 4192256 SPARSE "Ubuntu-s005.vmdk" + RW 4192256 SPARSE "Ubuntu-s006.vmdk" + RW 4192256 SPARSE "Ubuntu-s007.vmdk" + RW 4192256 SPARSE "Ubuntu-s008.vmdk" + RW 4192256 SPARSE "Ubuntu-s009.vmdk" + RW 4192256 SPARSE "Ubuntu-s010.vmdk" + RW 20480 SPARSE "Ubuntu-s011.vmdk" + + # The Disk Data Base + #DDB + + ddb.toolsVersion = "7240" + ddb.adapterType = "lsilogic" + ddb.geometry.sectors = "63" + ddb.geometry.heads = "255" + ddb.geometry.cylinders = "2610" + ddb.virtualHWVersion = "6" + {quote} + No stack trace or other output was found. Anything I can add (other than the 20G VM image to reproduce and I'll be happy to provide) -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/784977 Title: qemu-img convert fails to convert, generates a 512byte file output Status in QEMU: New Bug description: I have a Vmware image, so I have files like 'Ubuntu.vmdk', want to convert to VirtualBox .vdi format using qemu, the first stage of extracting the image with 'qemu-img convert Ubuntu.vmdk output.bin' just generates a 512byte file: {quote} # Disk DescriptorFile version=1 CID=36be9761 parentCID= createType="twoGbMaxExtentSparse" # Extent description RW 4192256 SPARSE "Ubuntu-s001.vmdk" RW 4192256 SPARSE "Ubuntu-s002.vmdk" RW 4192256 SPARSE "Ubuntu-s003.vmdk" RW 4192256 SPARSE "Ubuntu-s004.vmdk" RW 4192256 SPARSE "Ubuntu-s005.vmdk" RW 4192256 SPARSE "Ubuntu-s006.vmdk" RW 4192256 SPARSE "Ubuntu-s007.vmdk" RW 4192256 SPARSE "Ubuntu-s008.vmdk" RW 4192256 SPARSE "Ubuntu-s009.vmdk" RW 4192256 SPARSE "Ubuntu-s010.vmdk" RW 20480 SPARSE "Ubunt {quote} Here is the input Ubuntu.vmdk file: {quote} # Disk DescriptorFile version=1 CID=36be9761 parentCID= createType="twoGbMaxExtentSparse" # Extent description RW 4192256 SPARSE "Ubuntu-s001.vmdk" RW 4192256 SPARSE "Ubuntu-s002.vmdk" RW 4192256 SPARSE "Ubuntu-s003.vmdk" RW 4192256
[Qemu-devel] Framebuffer corruption in QEMU or Linux's cirrus driver
Running: ./virtme-run --installed-kernel from this virtme commit: https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git/commit/?id=2b409a086d15b7a878c7d5204b1f44a6564a341f results in a bunch of missing lines of text once bootup finishes. Pressing enter a few times gradually fixes it. I don't know whether this is a qemu bug or a Linux bug. I'm seeing this on Fedora's 3.13.7 kernel and on a fairly recent 3.14-rc kernel. For the latter, cirrus is built-in (not a module), I'm running: virtme-run --kimg arch/x86/boot/bzImage and I see more profound corruption. --Andy
Re: [Qemu-devel] Framebuffer corruption in QEMU or Linux's cirrus driver
On Tue, Apr 1, 2014 at 3:09 PM, Andy Lutomirski wrote: > Running: > > ./virtme-run --installed-kernel > > from this virtme commit: > > https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git/commit/?id=2b409a086d15b7a878c7d5204b1f44a6564a341f > > results in a bunch of missing lines of text once bootup finishes. > Pressing enter a few times gradually fixes it. > > I don't know whether this is a qemu bug or a Linux bug. > > I'm seeing this on Fedora's 3.13.7 kernel and on a fairly recent > 3.14-rc kernel. For the latter, cirrus is built-in (not a module), > I'm running: > > virtme-run --kimg arch/x86/boot/bzImage > > and I see more profound corruption. I'm guessing this is a cirrus drm bug. bochs-drm (using virtme-run --installed-kernel --qemu-opts -vga std) does not appear to have the same issue. Neither does qxl. (qxl is painfully slow, though, and it doesn't seem to be using UC memory.) --Andy
[Qemu-devel] Turning off default storage devices?
Currently, -M q35 boots linux quite a bit slower than the default machine type. This seems to be because it takes a few hundred ms to determine that there's nothing attached to the AHCI controller. In virtio setups, there will probably never be anything attached to the AHCI controller. Would it be possible to add something like -machine default_storage=off to turn off default storage devices? This could include the AHCI on q35 and the cdrom and such on pc. There's precedent: -machine usb=off turns off the default USB controllers, which is great for setups that use xhci. Thanks, Andy
Re: [Qemu-devel] Turning off default storage devices?
On Wed, Apr 9, 2014 at 4:53 PM, Peter Crosthwaite wrote: > Hi Andy, > > On Thu, Apr 10, 2014 at 5:55 AM, Andy Lutomirski wrote: >> Currently, -M q35 boots linux quite a bit slower than the default >> machine type. This seems to be because it takes a few hundred ms to >> determine that there's nothing attached to the AHCI controller. >> >> In virtio setups, there will probably never be anything attached to >> the AHCI controller. Would it be possible to add something like >> -machine default_storage=off to turn off default storage devices? >> This could include the AHCI on q35 and the cdrom and such on pc. >> >> There's precedent: -machine usb=off turns off the default USB >> controllers, which is great for setups that use xhci. >> > > Is there a more generic solution to your problem? Can you implement > command line device removal in a non specific way and avoid having to > invent AHCI or even "storage" specific arguments. You could > considering bringing the xhci use case you mentioned under the same > umbrella. An option like -suppress-default-device foobar to turn off the device named foobar would work, but what happens if that device is a bus? Will this just cause QEMU to crash? Maybe the machine code would have to opt in to allowing this kind of suppression, and there could be a general error of you try to suppress a device that can't be suppressed. I can try to code this up, but I know nothing about QEMU internals. I'm just a user :) --Andy
Re: [Qemu-devel] Turning off default storage devices?
On Wed, Apr 9, 2014 at 8:13 PM, Peter Crosthwaite wrote: > On Thu, Apr 10, 2014 at 9:57 AM, Andy Lutomirski wrote: >> On Wed, Apr 9, 2014 at 4:53 PM, Peter Crosthwaite >> wrote: >>> Hi Andy, >>> >>> On Thu, Apr 10, 2014 at 5:55 AM, Andy Lutomirski >>> wrote: >>>> Currently, -M q35 boots linux quite a bit slower than the default >>>> machine type. This seems to be because it takes a few hundred ms to >>>> determine that there's nothing attached to the AHCI controller. >>>> >>>> In virtio setups, there will probably never be anything attached to >>>> the AHCI controller. Would it be possible to add something like >>>> -machine default_storage=off to turn off default storage devices? >>>> This could include the AHCI on q35 and the cdrom and such on pc. >>>> >>>> There's precedent: -machine usb=off turns off the default USB >>>> controllers, which is great for setups that use xhci. >>>> >>> >>> Is there a more generic solution to your problem? Can you implement >>> command line device removal in a non specific way and avoid having to >>> invent AHCI or even "storage" specific arguments. You could >>> considering bringing the xhci use case you mentioned under the same >>> umbrella. >> >> An option like -suppress-default-device foobar to turn off the device >> named foobar would work, but what happens if that device is a bus? > > Lets call that a misuse in the first instance. But in general, when > attaching devices QEMU should be able to gracefully fail on unresolved > deps. So it would be reasonable to work on that assumption given that > every device should be able to handle a missing bus/gpio/interrupt > etc. due to -device misuseability. > >> Will this just cause QEMU to crash? Maybe the machine code would have >> to opt in to allowing this kind of suppression, and there could be a >> general error of you try to suppress a device that can't be >> suppressed. >> > > I would argue that there is no such thing. You may end up with a > useless machine but its still valid to supress something and then by > extension all its dependants are non functional. The q35 code is: /* ahci and SATA device, for q35 1 ahci controller is built-in */ ahci = pci_create_simple_multifunction(host_bus, PCI_DEVFN(ICH9_SATA1_DEV, ICH9_SATA1_FUNC), true, "ich9-ahci"); idebus[0] = qdev_get_child_bus(&ahci->qdev, "ide.0"); idebus[1] = qdev_get_child_bus(&ahci->qdev, "ide.1"); It looks like making pci_create_simple_multifunction return null will crash quite quickly. Even fixing the next two lines will just cause null pointer dereferences later on. Is there a different way to indicate that a device wasn't actually created? --Andy
Re: [Qemu-devel] Turning off default storage devices?
On Mon, Apr 14, 2014 at 1:15 AM, Markus Armbruster wrote: > Peter Crosthwaite writes: > >> Hi Andy, >> >> On Thu, Apr 10, 2014 at 5:55 AM, Andy Lutomirski wrote: >>> Currently, -M q35 boots linux quite a bit slower than the default >>> machine type. This seems to be because it takes a few hundred ms to >>> determine that there's nothing attached to the AHCI controller. >>> >>> In virtio setups, there will probably never be anything attached to >>> the AHCI controller. Would it be possible to add something like >>> -machine default_storage=off to turn off default storage devices? >>> This could include the AHCI on q35 and the cdrom and such on pc. >>> >>> There's precedent: -machine usb=off turns off the default USB >>> controllers, which is great for setups that use xhci. >>> >> >> Is there a more generic solution to your problem? Can you implement >> command line device removal in a non specific way and avoid having to >> invent AHCI or even "storage" specific arguments. You could >> considering bringing the xhci use case you mentioned under the same >> umbrella. > > USB has always been off by default, at least for the boards I'm familiar > with, due to the USB emulation's non-trivial CPU use. > > There's no such thing as a Q35 board without USB in the physical world. > Can't stop us from making a virtual one, of course. > > Likewise, there's no such thing as a Q35 board without AHCI in the > physical world, and again that can't stop us from making a virtual one. > > The difference to USB is that our q35 machines have always had AHCI even > with -nodefaults. You seem to propose adding a switch to disable AHCI, > yet leave it enabled with -nodefaults. > > -nodefaults should give you a board with all the optional components > suppressed. Will this break libvirt, which may expect -nodefaults to still come with an IDE bus? > > On the one hand, I'd rather not add exceptions to -nodefaults "give me > the board with all its optional components suppressed" semantics. > > On the other hand, a few hundred ms are a long time. That's why I proposed a new option. Yes, it's ugly :/ --Andy
[Qemu-devel] [Bug 1349277] Re: AArch64 emulation ignores SPSel=0 when taking (or returning from) an exception at EL1 or greater
** Also affects: qemu (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1349277 Title: AArch64 emulation ignores SPSel=0 when taking (or returning from) an exception at EL1 or greater Status in QEMU: New Status in qemu package in Ubuntu: New Bug description: The AArch64 emulation ignores SPSel=0 when: (1) taking an interrupt from an exception level greater than EL0 (e.g., EL1t), (2) returning from an exception (via ERET) to an exception level greater than EL0 (e.g., EL1t), with SPSR_ELx[SPSel]=0. The attached patch fixes the problem in my application. Background: I'm running a standalone application (toy OS) that is performing preemptive multithreading between threads running at EL1t, with exception handling / context switching occurring at EL1h. This bug causes the stack pointer to be corrupted in the threads running at EL1t (they end up with a version of the EL1h stack pointer (SP_EL1)). Occurs in: qemu-2.1.0-rc1 (found in) commit c60a57ff497667780132a3fcdc1500c83af5d5c0 (current master) To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1349277/+subscriptions
Re: [Qemu-devel] [PATCH v3 1/4] firmware: introduce sysfs driver for QEMU's fw_cfg device
On Sat, Oct 3, 2015 at 4:28 PM, Gabriel L. Somlo wrote: > From: Gabriel Somlo > > Make fw_cfg entries of type "file" available via sysfs. Entries > are listed under /sys/firmware/qemu_fw_cfg/by_key, in folders > named after each entry's selector key. Filename, selector value, > and size read-only attributes are included for each entry. Also, > a "raw" attribute allows retrieval of the full binary content of > each entry. > > This patch also provides a documentation file outlining the > guest-side "hardware" interface exposed by the QEMU fw_cfg device. > What's the status of "by_name"? There's a single (presumably incorrect) mention of it in a comment in this patch. I would prefer if the kernel populated by_name itself rather than deferring that to udev, since I'd like to use this facility in virtme, and I'd like to use fw_cfg very early on boot before I even start udev. --Andy
[Qemu-devel] [Bug 1025244] Re: qcow2 image increasing disk size above the virtual limit
Thanks for your advices. I have no more problems with VM-size since deleting snapshot in shutdown-mode. I reduced the overlarge qcow2-images by converting in qcow2 again (that detects unused sectors and omits this). -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1025244 Title: qcow2 image increasing disk size above the virtual limit Status in QEMU: New Status in “qemu-kvm” package in Ubuntu: Triaged Bug description: Using qemu/kvm, qcow2 images, ext4 file systems on both guest and host Host and Guest: Ubuntu server 12.04 64bit To create an image I did this: qemu-img create -f qcow2 -o preallocation=metadata ubuntu-pdc-vda.img 10737418240 (not sure about the exact bytes, but around this) ls -l ubuntu-pdc-vda.img fallocate -l theSizeInBytesFromAbove ubuntu-pdc-vda.img The problem is that the image is growing progressively and has obviously no limit, although I gave it one. The root filesystem's image is the same case: qemu-img info ubuntu-pdc-vda.img image: ubuntu-pdc-vda.img file format: qcow2 virtual size: 10G (10737418240 bytes) disk size: 14G cluster_size: 65536 and for confirmation: du -sh ubuntu-pdc-vda.img 15G ubuntu-pdc-vda.img I made a test and saw that when I delete something from the guest, the real size of the image is not decreasing (I read it is normal). OK, but when I write something again, it doesn't use the freed space, but instead grows the image. So for example: 1. The initial physical size of the image is 1GB. 2. I copy 1GB of data in the guest. It's physical size becomes 2GB. 3. I delete this data (1GB). The physical size of the image remains 2GB. 4. I copy another 1GB of data to the guest. 5. The physical size of the image becomes 3GB. 6. And so on with no limit. It doesn't care if the virtual size is less. Is this normal - the real/physical size of the image to be larger than the virtual limit??? To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1025244/+subscriptions
Re: [Qemu-devel] [PATCH 2/2] e1000: CTRL.RST emulation
On Tue, Sep 27, 2011 at 8:58 AM, Michael S. Tsirkin wrote: > e1000 spec says CTRL.RST write should have the same effect > as bus reset, except that is preserves PCI Config. > Reset device registers and interrupts. > > Fix suggested by Andy Gospodarek > Similar fix proposed by Anthony PERARD > > Signed-off-by: Michael S. Tsirkin > --- > hw/e1000.c | 9 +++-- > 1 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/hw/e1000.c b/hw/e1000.c > index 87a1104..b51e089 100644 > --- a/hw/e1000.c > +++ b/hw/e1000.c > @@ -241,8 +241,13 @@ static void e1000_reset(void *opaque) > static void > set_ctrl(E1000State *s, int index, uint32_t val) > { > - /* RST is self clearing */ > - s->mac_reg[CTRL] = val & ~E1000_CTRL_RST; > + if (val & E1000_CTRL_RST) { > + e1000_reset(s); > + qemu_set_irq(s->dev.irq[0], 0); > + return; > + } > + > + s->mac_reg[CTRL] = val; > } > > static void > -- > 1.7.5.53.gc233e > Looks good to me. Thanks for following up with this, Michael.
Re: [Qemu-devel] [PATCH 00/10] RFC: userfault
On 07/02/2014 09:50 AM, Andrea Arcangeli wrote: > Hello everyone, > > There's a large CC list for this RFC because this adds two new > syscalls (userfaultfd and remap_anon_pages) and > MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API > or on a completely different API if somebody has better ideas are > welcome now. cc:linux-api -- this is certainly worthy of linux-api discussion. > > The combination of these features are what I would propose to > implement postcopy live migration in qemu, and in general demand > paging of remote memory, hosted in different cloud nodes. > > The MADV_USERFAULT feature should be generic enough that it can > provide the userfaults to the Android volatile range feature too, on > access of reclaimed volatile pages. > > If the access could ever happen in kernel context through syscalls > (not not just from userland context), then userfaultfd has to be used > to make the userfault unnoticeable to the syscall (no error will be > returned). This latter feature is more advanced than what volatile > ranges alone could do with SIGBUS so far (but it's optional, if the > process doesn't call userfaultfd, the regular SIGBUS will fire, if the > fd is closed SIGBUS will also fire for any blocked userfault that was > waiting a userfaultfd_write ack). > > userfaultfd is also a generic enough feature, that it allows KVM to > implement postcopy live migration without having to modify a single > line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all > other GUP features works just fine in combination with userfaults > (userfaults trigger async page faults in the guest scheduler so those > guest processes that aren't waiting for userfaults can keep running in > the guest vcpus). > > remap_anon_pages is the syscall to use to resolve the userfaults (it's > not mandatory, vmsplice will likely still be used in the case of local > postcopy live migration just to upgrade the qemu binary, but > remap_anon_pages is faster and ideal for transferring memory across > the network, it's zerocopy and doesn't touch the vma: it only holds > the mmap_sem for reading). > > The current behavior of remap_anon_pages is very strict to avoid any > chance of memory corruption going unnoticed. mremap is not strict like > that: if there's a synchronization bug it would drop the destination > range silently resulting in subtle memory corruption for > example. remap_anon_pages would return -EEXIST in that case. If there > are holes in the source range remap_anon_pages will return -ENOENT. > > If remap_anon_pages is used always with 2M naturally aligned > addresses, transparent hugepages will not be splitted. In there could > be 4k (or any size) holes in the 2M (or any size) source range, > remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to > relax some of its strict checks (-ENOENT won't be returned if > RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as > a noop on any hole in the source range). This flag is generally useful > when implementing userfaults with THP granularity, but it shouldn't be > set if doing the userfaults with PAGE_SIZE granularity if the > developer wants to benefit from the strict -ENOENT behavior. > > The remap_anon_pages syscall API is not vectored, as I expect it to be > used mainly for demand paging (where there can be just one faulting > range per userfault) or for large ranges (with the THP model as an > alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k > granularity before starting the guest in the destination node) where > vectoring isn't going to provide much performance advantages (thanks > to the THP coarser granularity). > > On the rmap side remap_anon_pages doesn't add much complexity: there's > no need of nonlinear anon vmas to support it because I added the > constraint that it will fail if the mapcount is more than 1. So in > general the source range of remap_anon_pages should be marked > MADV_DONTFORK to prevent any risk of failure if the process ever > forks (like qemu can in some case). > > One part that hasn't been tested is the poll() syscall on the > userfaultfd because the postcopy migration thread currently is more > efficient waiting on blocking read()s (I'll write some code to test > poll() too). I also appended below a patch to trinity to exercise > remap_anon_pages and userfaultfd and it completes trinity > successfully. > > The code can be found here: > > git clone --reference linux > git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault > > The branch is rebased so you can get updates for example with: > > git fetch && git checkout -f origin/userfault > > Comments welcome, thanks! > Andrea > > From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001 > From: Andrea Arcangeli > Date: Wed, 2 Jul 2014 18:32:35 +0200 > Subject: [PATCH] add remap_anon_pages and userfaultfd > > Signed-off-by: Andrea Arcangeli > --- > include/syscall
Re: [Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization
On 07/02/2014 09:50 AM, Andrea Arcangeli wrote: > Once an userfaultfd is created MADV_USERFAULT regions talks through > the userfaultfd protocol with the thread responsible for doing the > memory externalization of the process. > > The protocol starts by userland writing the requested/preferred > USERFAULT_PROTOCOL version into the userfault fd (64bit write), if > kernel knows it, it will ack it by allowing userland to read 64bit > from the userfault fd that will contain the same 64bit > USERFAULT_PROTOCOL version that userland asked. Otherwise userfault > will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it > will have to try again by writing an older protocol version if > suitable for its usage too, and read it back again until it stops > reading -1ULL. After that the userfaultfd protocol starts. > > The protocol consists in the userfault fd reads 64bit in size > providing userland the fault addresses. After a userfault address has > been read and the fault is resolved by userland, the application must > write back 128bits in the form of [ start, end ] range (64bit each) > that will tell the kernel such a range has been mapped. Multiple read > userfaults can be resolved in a single range write. poll() can be used > to know when there are new userfaults to read (POLLIN) and when there > are threads waiting a wakeup through a range write (POLLOUT). > > Signed-off-by: Andrea Arcangeli > +#ifdef CONFIG_PROC_FS > +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f) > +{ > + struct userfaultfd_ctx *ctx = f->private_data; > + int ret; > + wait_queue_t *wq; > + struct userfaultfd_wait_queue *uwq; > + unsigned long pending = 0, total = 0; > + > + spin_lock(&ctx->fault_wqh.lock); > + list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) { > + uwq = container_of(wq, struct userfaultfd_wait_queue, wq); > + if (uwq->pending) > + pending++; > + total++; > + } > + spin_unlock(&ctx->fault_wqh.lock); > + > + ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total); This should show the protocol version, too. > + > +SYSCALL_DEFINE1(userfaultfd, int, flags) > +{ > + int fd, error; > + struct file *file; This looks like it can't be used more than once in a process. That will be unfortunate for libraries. Would it be feasible to either have userfaultfd claim a range of addresses or for a vma to be explicitly associated with a userfaultfd? (In the latter case, giant PROT_NONE MAP_NORESERVE mappings could be used.)
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse wrote: > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell > the truth, and even legacy kernels ought to cope with that. > FSVO 'ought to' where I suspect some of them will actually crash with a > NULL pointer dereference if there's no "catch-all" DMAR unit in the > tables, which puts it back into the same camp as ARM and Power. I think x86 may get a bit of a free pass here. AFAIK the QEMU IOMMU implementation on x86 has always been "experimental", so it just might be okay to change it in a way that causes some older kernels to OOPS. --Andy
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin wrote: > On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote: >> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse >> wrote: >> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell >> > the truth, and even legacy kernels ought to cope with that. >> > FSVO 'ought to' where I suspect some of them will actually crash with a >> > NULL pointer dereference if there's no "catch-all" DMAR unit in the >> > tables, which puts it back into the same camp as ARM and Power. >> >> I think x86 may get a bit of a free pass here. AFAIK the QEMU IOMMU >> implementation on x86 has always been "experimental", so it just might >> be okay to change it in a way that causes some older kernels to OOPS. >> >> --Andy > > Since it's experimental, it might be OK to change *guest kernels* > such that they oops on old QEMU. > But guest kernels were not experimental - so we need a QEMU mode that > makes them work fine. The more functionality is available in this QEMU > mode, the betterm because it's going to be the default for a while. For > the same reason, it is preferable to also have new kernels not crash in > this mode. > People add QEMU features that need new guest kernels all time time. If you enable virtio-scsi and try to boot a guest that's too old, it won't work. So I don't see anything fundamentally wrong with saying that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest kernel is too old. It might be annoying, since old kernels do work on actual Q35 hardware, but it at least seems to be that it might be okay. --Andy
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Apr 19, 2016 2:13 AM, "Michael S. Tsirkin" wrote: > > > I guess you are right in that we should split this part out. > What I wanted is really the combination > PASSTHROUGH && !PLATFORM so that we can say "ok we don't > need to guess, this device actually bypasses the IOMMU". What happens when you use a device like this on Xen or with a similar software translation layer? I think that a "please bypass IOMMU" feature would be better in the PCI, IOMMU, or platform code. For Xen, virtio would still want to use the DMA API, just without translating at the DMAR or hardware level. Doing it in virtio is awkward, because virtio is involved at the device level and the driver level, but the translation might be entirely in between. I think a nicer long-term approach would be to have a way to ask the guest to set up a full 1:1 mapping for performance, but to still handle the case where the guest refuses to do so or where there's more than one translation layer involved. But I agree that this part shouldn't delay the other part of your series. --Andy
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Tue, Apr 19, 2016 at 9:09 AM, Michael S. Tsirkin wrote: > On Tue, Apr 19, 2016 at 09:02:14AM -0700, Andy Lutomirski wrote: >> On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin wrote: >> > On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote: >> >> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse >> >> wrote: >> >> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell >> >> > the truth, and even legacy kernels ought to cope with that. >> >> > FSVO 'ought to' where I suspect some of them will actually crash with a >> >> > NULL pointer dereference if there's no "catch-all" DMAR unit in the >> >> > tables, which puts it back into the same camp as ARM and Power. >> >> >> >> I think x86 may get a bit of a free pass here. AFAIK the QEMU IOMMU >> >> implementation on x86 has always been "experimental", so it just might >> >> be okay to change it in a way that causes some older kernels to OOPS. >> >> >> >> --Andy >> > >> > Since it's experimental, it might be OK to change *guest kernels* >> > such that they oops on old QEMU. >> > But guest kernels were not experimental - so we need a QEMU mode that >> > makes them work fine. The more functionality is available in this QEMU >> > mode, the betterm because it's going to be the default for a while. For >> > the same reason, it is preferable to also have new kernels not crash in >> > this mode. >> > >> >> People add QEMU features that need new guest kernels all time time. >> If you enable virtio-scsi and try to boot a guest that's too old, it >> won't work. So I don't see anything fundamentally wrong with saying >> that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest >> kernel is too old. It might be annoying, since old kernels do work on >> actual Q35 hardware, but it at least seems to be that it might be >> okay. >> >> --Andy > > Yes but we need a mode that makes both old and new kernels work, and > that should be the default for a while. this is what the > IOMMU_PASSTHROUGH flag was about: old kernels ignore it and bypass DMA > API, new kernels go "oh compatibility mode" and bypass the IOMMU > within DMA API. I thought that PLATFORM served that purpose. Woudn't the host advertise PLATFORM support and, if the guest doesn't ack it, the host device would skip translation? Or is that problematic for vfio? > > -- > MST -- Andy Lutomirski AMA Capital Management, LLC
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin wrote: > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote: >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote: >> > >> > > I thought that PLATFORM served that purpose. Woudn't the host >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host >> > > device would skip translation? Or is that problematic for vfio? >> > >> > Exactly that's problematic for security. >> > You can't allow guest driver to decide whether device skips security. >> >> Right. Because fundamentally, this *isn't* a property of the endpoint >> device, and doesn't live in virtio itself. >> >> It's a property of the platform IOMMU, and lives there. > > It's a property of the hypervisor virtio implementation, and lives there. It is now, but QEMU could, in principle, change the way it thinks about it so that virtio devices would use the QEMU DMA API but ask QEMU to pass everything through 1:1. This would be entirely invisible to guests but would make it be a property of the IOMMU implementation. At that point, maybe QEMU could find a (platform dependent) way to tell the guest what's going on. FWIW, as far as I can tell, PPC and SPARC really could, in principle, set up 1:1 mappings in the guest so that the virtio devices would work regardless of whether QEMU is ignoring the IOMMU or not -- I think the only obstacle is that the PPC and SPARC 1:1 mappings are currectly set up with an offset. I don't know too much about those platforms, but presumably the layout could be changed so that 1:1 really was 1:1. --Andy
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin wrote: > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote: >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin wrote: >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote: >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote: >> >> > >> >> > > I thought that PLATFORM served that purpose. Woudn't the host >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host >> >> > > device would skip translation? Or is that problematic for vfio? >> >> > >> >> > Exactly that's problematic for security. >> >> > You can't allow guest driver to decide whether device skips security. >> >> >> >> Right. Because fundamentally, this *isn't* a property of the endpoint >> >> device, and doesn't live in virtio itself. >> >> >> >> It's a property of the platform IOMMU, and lives there. >> > >> > It's a property of the hypervisor virtio implementation, and lives there. >> >> It is now, but QEMU could, in principle, change the way it thinks >> about it so that virtio devices would use the QEMU DMA API but ask >> QEMU to pass everything through 1:1. This would be entirely invisible >> to guests but would make it be a property of the IOMMU implementation. >> At that point, maybe QEMU could find a (platform dependent) way to >> tell the guest what's going on. >> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle, >> set up 1:1 mappings in the guest so that the virtio devices would work >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set >> up with an offset. I don't know too much about those platforms, but >> presumably the layout could be changed so that 1:1 really was 1:1. >> >> --Andy > > Sure. Do you see any reason why the decision to do this can't be > keyed off the virtio feature bit? I can think of three types of virtio host: a) virtio always bypasses the IOMMU. b) virtio never bypasses the IOMMU (unless DMAR tables or similar say it does) -- i.e. virtio works like any other device. c) virtio may bypass the IOMMU depending on what the guest asks it to do. If this is keyed off a virtio feature bit and anyone tries to implement (c), the vfio is going to have a problem. And, if it's keyed off a virtio feature bit, then (a) won't work on Xen or similar setups unless the Xen hypervisor adds a giant and probably unreliable kludge to support it. Meanwhile, 4.6-rc works fine under Xen on a default x86 QEMU configuration, and I'd really like to keep it that way. What could plausibly work using a virtio feature bit is for a device to say "hey, I'm a new device and I support the platform-defined IOMMU mechanism". This bit would be *set* on default IOMMU-less QEMU configurations and on physical virtio PCI cards. The guest could operate accordingly. I'm not sure I see a good way for feature negotiation to work the other direction, though. PPC and SPARC could only set this bit on emulated devices if they know that new guest kernels are in use. --Andy
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin wrote: > On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote: >> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin wrote: >> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote: >> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin >> >> wrote: >> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote: >> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote: >> >> >> > >> >> >> > > I thought that PLATFORM served that purpose. Woudn't the host >> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the >> >> >> > > host >> >> >> > > device would skip translation? Or is that problematic for vfio? >> >> >> > >> >> >> > Exactly that's problematic for security. >> >> >> > You can't allow guest driver to decide whether device skips security. >> >> >> >> >> >> Right. Because fundamentally, this *isn't* a property of the endpoint >> >> >> device, and doesn't live in virtio itself. >> >> >> >> >> >> It's a property of the platform IOMMU, and lives there. >> >> > >> >> > It's a property of the hypervisor virtio implementation, and lives >> >> > there. >> >> >> >> It is now, but QEMU could, in principle, change the way it thinks >> >> about it so that virtio devices would use the QEMU DMA API but ask >> >> QEMU to pass everything through 1:1. This would be entirely invisible >> >> to guests but would make it be a property of the IOMMU implementation. >> >> At that point, maybe QEMU could find a (platform dependent) way to >> >> tell the guest what's going on. >> >> >> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle, >> >> set up 1:1 mappings in the guest so that the virtio devices would work >> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the >> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set >> >> up with an offset. I don't know too much about those platforms, but >> >> presumably the layout could be changed so that 1:1 really was 1:1. >> >> >> >> --Andy >> > >> > Sure. Do you see any reason why the decision to do this can't be >> > keyed off the virtio feature bit? >> >> I can think of three types of virtio host: >> >> a) virtio always bypasses the IOMMU. >> >> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say >> it does) -- i.e. virtio works like any other device. >> >> c) virtio may bypass the IOMMU depending on what the guest asks it to do. > > d) some virtio devices bypass the IOMMU and some don't, > e.g. it's harder to support IOMMU with vhost. > > >> If this is keyed off a virtio feature bit and anyone tries to >> implement (c), the vfio is going to have a problem. And, if it's >> keyed off a virtio feature bit, then (a) won't work on Xen or similar >> setups unless the Xen hypervisor adds a giant and probably unreliable >> kludge to support it. Meanwhile, 4.6-rc works fine under Xen on a >> default x86 QEMU configuration, and I'd really like to keep it that >> way. >> >> What could plausibly work using a virtio feature bit is for a device >> to say "hey, I'm a new device and I support the platform-defined IOMMU >> mechanism". This bit would be *set* on default IOMMU-less QEMU >> configurations and on physical virtio PCI cards. > > And clear on xen. How? QEMU has no idea that the guest is running Xen.
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Apr 20, 2016 6:14 AM, "Michael S. Tsirkin" wrote: > > On Tue, Apr 19, 2016 at 02:07:01PM -0700, Andy Lutomirski wrote: > > On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin wrote: > > > On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote: > > >> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin > > >> wrote: > > >> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote: > > >> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin > > >> >> wrote: > > >> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote: > > >> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote: > > >> >> >> > > > >> >> >> > > I thought that PLATFORM served that purpose. Woudn't the host > > >> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, > > >> >> >> > > the host > > >> >> >> > > device would skip translation? Or is that problematic for > > >> >> >> > > vfio? > > >> >> >> > > > >> >> >> > Exactly that's problematic for security. > > >> >> >> > You can't allow guest driver to decide whether device skips > > >> >> >> > security. > > >> >> >> > > >> >> >> Right. Because fundamentally, this *isn't* a property of the > > >> >> >> endpoint > > >> >> >> device, and doesn't live in virtio itself. > > >> >> >> > > >> >> >> It's a property of the platform IOMMU, and lives there. > > >> >> > > > >> >> > It's a property of the hypervisor virtio implementation, and lives > > >> >> > there. > > >> >> > > >> >> It is now, but QEMU could, in principle, change the way it thinks > > >> >> about it so that virtio devices would use the QEMU DMA API but ask > > >> >> QEMU to pass everything through 1:1. This would be entirely invisible > > >> >> to guests but would make it be a property of the IOMMU implementation. > > >> >> At that point, maybe QEMU could find a (platform dependent) way to > > >> >> tell the guest what's going on. > > >> >> > > >> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle, > > >> >> set up 1:1 mappings in the guest so that the virtio devices would work > > >> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the > > >> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set > > >> >> up with an offset. I don't know too much about those platforms, but > > >> >> presumably the layout could be changed so that 1:1 really was 1:1. > > >> >> > > >> >> --Andy > > >> > > > >> > Sure. Do you see any reason why the decision to do this can't be > > >> > keyed off the virtio feature bit? > > >> > > >> I can think of three types of virtio host: > > >> > > >> a) virtio always bypasses the IOMMU. > > >> > > >> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say > > >> it does) -- i.e. virtio works like any other device. > > >> > > >> c) virtio may bypass the IOMMU depending on what the guest asks it to do. > > > > > > d) some virtio devices bypass the IOMMU and some don't, > > > e.g. it's harder to support IOMMU with vhost. > > > > > > > > >> If this is keyed off a virtio feature bit and anyone tries to > > >> implement (c), the vfio is going to have a problem. And, if it's > > >> keyed off a virtio feature bit, then (a) won't work on Xen or similar > > >> setups unless the Xen hypervisor adds a giant and probably unreliable > > >> kludge to support it. Meanwhile, 4.6-rc works fine under Xen on a > > >> default x86 QEMU configuration, and I'd really like to keep it that > > >> way. > > >> > > >> What could plausibly work using a virtio feature bit is for a device > > >> to say "hey, I'm a new device and I support the platform-defined IOMMU > > >> mechanism". This bit would be *set* on default IOMMU-less QEMU > > >> configurations and on physical virtio PCI cards. > > > > > > And clear on xen. > > > > How? QEMU has no idea that the guest is running Xen. > > I was under impression xen_enabled() is true in QEMU. > Am I wrong? I'd be rather surprised, given that QEMU would have to inspect the guest kernel to figure it out. I'm talking about Xen under QEMU. For example, if you feed QEMU a guest disk image that contains Fedora with the xen packages installed, you can boot it and get a grub menu. If you ask grub to boot Xen, you get Xen. If you ask grub to boot Linux directly, you don't get Xen. I assume xen_enabled is for QEMU under Xen, i.e. QEMU, running under Xen, supplying emulated devices to a Xen domU guest. Since QEMU is seeing the guest address space directly, this should be much the same as QEMU !xen_enabled -- if you boot plain Linux, everything works, but if you do Xen -> QEMU -> HVM guest running Xen PV -> Linux, then virtio drivers in the Xen PV Linux guest need to translate addresses. --Andy > > -- > MST
[Qemu-devel] [Bug 1574346] [NEW] TCG: mov to segment register is incorrectly emulated for AMD CPUs
Public bug reported: In TCG mode, the effect of: xorl %eax, %eax movl %eax, %gs is to mark the GS segment unusable and set its base to zero. After doing this, reading MSR_GS_BASE will return zero and using a GS prefix in long mode will treat the GS base as zero. This is correct for Intel CPUs but is incorrect for AMD CPUs. On an AMD CPU, writing 0 to %gs using mov, pop, or (I think) lgs will leave the base unchanged. To make it easier to use TCG to validate behavior on different CPUs, please consider changing the TCG behavior to match actual CPU behavior when emulating an AMD CPU. ** Affects: qemu Importance: Undecided Status: New -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1574346 Title: TCG: mov to segment register is incorrectly emulated for AMD CPUs Status in QEMU: New Bug description: In TCG mode, the effect of: xorl %eax, %eax movl %eax, %gs is to mark the GS segment unusable and set its base to zero. After doing this, reading MSR_GS_BASE will return zero and using a GS prefix in long mode will treat the GS base as zero. This is correct for Intel CPUs but is incorrect for AMD CPUs. On an AMD CPU, writing 0 to %gs using mov, pop, or (I think) lgs will leave the base unchanged. To make it easier to use TCG to validate behavior on different CPUs, please consider changing the TCG behavior to match actual CPU behavior when emulating an AMD CPU. To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1574346/+subscriptions
Re: [Qemu-devel] [PATCH V2 RFC] fixup! virtio: convert to use DMA api
On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel wrote: > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote: >> One correction: it's a feature of the device in the system. >> There could be a mix of devices bypassing and not >> bypassing the IOMMU. > > No, it really is not. A device can't chose to bypass the IOMMU. But the > IOMMU can chose to let the device bypass. So any fix here belongs > into the platform/iommu code too and not into some driver. > >> Sounds good. And a way to detect appropriate devices could >> be by looking at the feature flag, perhaps? > > Again, no! The way to detect that is to look into the iommu description > structures provided by the firmware. They provide everything necessary > to tell the iommu code which devices are not translated. > Except on PPC and SPARC. As far as I know, those are the only problematic platforms. Is it too late to *disable* QEMU's q35-iommu thingy until it can be fixed to report correct data in the DMAR tables? --Andy
Re: [Qemu-devel] [PATCH V2 RFC] fixup! virtio: convert to use DMA api
On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin wrote: > On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote: >> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel wrote: >> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote: >> >> One correction: it's a feature of the device in the system. >> >> There could be a mix of devices bypassing and not >> >> bypassing the IOMMU. >> > >> > No, it really is not. A device can't chose to bypass the IOMMU. But the >> > IOMMU can chose to let the device bypass. So any fix here belongs >> > into the platform/iommu code too and not into some driver. >> > >> >> Sounds good. And a way to detect appropriate devices could >> >> be by looking at the feature flag, perhaps? >> > >> > Again, no! The way to detect that is to look into the iommu description >> > structures provided by the firmware. They provide everything necessary >> > to tell the iommu code which devices are not translated. >> > >> >> Except on PPC and SPARC. As far as I know, those are the only >> problematic platforms. >> >> Is it too late to *disable* QEMU's q35-iommu thingy until it can be >> fixed to report correct data in the DMAR tables? >> >> --Andy > > Meaning virtio or assigned devices? > For virtio - it's way too late since these are working configurations. > For assigned devices - they don't work on x86 so it doesn't have > to be disabled, it's safe to ignore. I mean actually prevent QEMU from running in q35-iommu mode with any virtio devices attached or maybe even turn off q35-iommu mode entirely [1]. Doesn't it require that the user literally pass the word "experimental" into QEMU right now? It did at some point IIRC. The reason I'm asking is that, other than q35-iommu, QEMU's virtio devices *don't* bypass the IOMMU except on PPC and SPARC, simply because there is no other configuration AFAICT that has virtio and and IOMMU. So maybe the right solution is to fix q35-iommu to use DMAR correctly (thus breaking q35-iommu users with older guest kernels, which hopefully don't actually exist) and to come up with a PPC- and SPARC-specific solution, or maybe OpenFirmware-specific solution, to handle PPC and SPARC down the road. [1] I'm pretty sure I emailed the QEMU list before q35-iommu ever showed up in a release asking the QEMU team to please not do that until this issue was resolved. Sadly, that email was ignored :( --Andy
Re: [Qemu-devel] [PATCH V2 RFC] fixup! virtio: convert to use DMA api
On Wed, Apr 27, 2016 at 7:54 AM, Michael S. Tsirkin wrote: > On Wed, Apr 27, 2016 at 07:43:07AM -0700, Andy Lutomirski wrote: >> On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin wrote: >> > On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote: >> >> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel wrote: >> >> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote: >> >> >> One correction: it's a feature of the device in the system. >> >> >> There could be a mix of devices bypassing and not >> >> >> bypassing the IOMMU. >> >> > >> >> > No, it really is not. A device can't chose to bypass the IOMMU. But the >> >> > IOMMU can chose to let the device bypass. So any fix here belongs >> >> > into the platform/iommu code too and not into some driver. >> >> > >> >> >> Sounds good. And a way to detect appropriate devices could >> >> >> be by looking at the feature flag, perhaps? >> >> > >> >> > Again, no! The way to detect that is to look into the iommu description >> >> > structures provided by the firmware. They provide everything necessary >> >> > to tell the iommu code which devices are not translated. >> >> > >> >> >> >> Except on PPC and SPARC. As far as I know, those are the only >> >> problematic platforms. >> >> >> >> Is it too late to *disable* QEMU's q35-iommu thingy until it can be >> >> fixed to report correct data in the DMAR tables? >> >> >> >> --Andy >> > >> > Meaning virtio or assigned devices? >> > For virtio - it's way too late since these are working configurations. >> > For assigned devices - they don't work on x86 so it doesn't have >> > to be disabled, it's safe to ignore. >> >> I mean actually prevent QEMU from running in q35-iommu mode with any >> virtio devices attached or maybe even turn off q35-iommu mode entirely >> [1]. Doesn't it require that the user literally pass the word >> "experimental" into QEMU right now? It did at some point IIRC. >> >> The reason I'm asking is that, other than q35-iommu, QEMU's virtio >> devices *don't* bypass the IOMMU except on PPC and SPARC, simply >> because there is no other configuration AFAICT that has virtio and and >> IOMMU. So maybe the right solution is to fix q35-iommu to use DMAR >> correctly (thus breaking q35-iommu users with older guest kernels, >> which hopefully don't actually exist) and to come up with a PPC- and >> SPARC-specific solution, or maybe OpenFirmware-specific solution, to >> handle PPC and SPARC down the road. >> >> [1] I'm pretty sure I emailed the QEMU list before q35-iommu ever >> showed up in a release asking the QEMU team to please not do that >> until this issue was resolved. Sadly, that email was ignored :( >> >> --Andy > > Sorry, I didn't make myself clear. > Point is, QEMU is not the only virtio implementation out there. > So we can't know no virtio implementations have an IOMMU as long as > linux supports this IOMMU. > virtio always used physical addresses since it was born and if it > changes that it must do this in a way that does not break existing > users. Is there any non-QEMU virtio implementation can provide an IOMMU-bypassing virtio device on a platform that has a nontrivial IOMMU? --Andy
[Qemu-devel] Re: [PATCH-V4 0/7] virtio-9p:Introducing security model for VirtFS
Venkateswararao Jujjuri (JV) wrote: This patch series introduces the security model for VirtFS. Brief description of this patch series: It introduces two type of security models for VirtFS. They are: mapped and passthrough. The following is common to both security models. * Client's VFS determines/enforces the access control. Largely server should never return EACCESS. * Client sends gid/mode-bit information as part of creation only. Changes from V3 --- o Return NULL instead of exit(1) on failure in virtio_9p_init() o Capitalized sm_passthrough, sm_mappe o Added handling for EINTR for read/write. o Corrected default permissions for mkdir in mapped mode. o Added additional error handling. Changes from V2 --- o Removed warnings resulting from chmod/chown. o Added code to fail normally if secuirty_model option is not specified. Changes from V1 --- o Added support for chmod and chown. o Used chmod/chown to set credentials instead of setuid/setgid. o Fixed a bug where uid used instated of uid. Security model: mapped -- VirtFS server(QEMU) intercepts and maps all the file object create requests. Files on the fileserver will be created with QEMU's user credentials and the client-user's credentials are stored in extended attributes. During getattr() server extracts the client-user's credentials from extended attributes and sends to the client. Given that only the user space extended attributes are available to regular files, special files are created as regular files on the fileserver and the appropriate mode bits are stored in xattrs and will be extracted during getattr. If the extended attributes are missing, server sends back the filesystem stat() unaltered. This provision will make the files created on the fileserver usable to client. Points to be considered * Filesystem will be VirtFS'ized. Meaning, other filesystems may not understand the credentials of the files created under this model. How hard would it be to make this compatible with rsync's --fake-super? (--fake-super already does almost what you're doing, and if you make the formats compatible, then rsync could be used to translate. OTOH, rsyncing a VirtFS-ified filesystem to a remote --fake-super system might have odd side-effects.) --Andy
[Qemu-devel] [Bug 597351] Re: Slow UDP performance with virtio device
** Attachment added: "udp-pong.c" http://launchpadlibrarian.net/50751155/udp-pong.c -- Slow UDP performance with virtio device https://bugs.launchpad.net/bugs/597351 You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. Status in QEMU: New Bug description: I'm working on an app that is very sensitive to round-trip latency between the guest and host, and qemu/kvm seems to be significantly slower than it needs to be. The attached program is a ping/pong over UDP. Call it with a single argument to start a listener/echo server on that port. With three arguments it becomes a counted "pinger" that will exit after a specified number of round trips for performance measurements. For example: $ gcc -o udp-pong udp-pong.c $ ./udp-pong 12345 & # start a listener on port 12345 $ time ./udp-pong 127.0.0.1 12345 100 # time a million round trips When run on the loopback device on a single machine (true on the host or within a guest), I get about 100k/s. When run across a port forward using "user" networking on qemu (or kvm, the performance is the same) and the default rtl8139 driver (both the host and guest are Ubuntu Lucid), I get about 10k/s. This seems very slow, but perhaps unavoidably so? When run in the same configuration using the "virtio" driver, I get only 2k/s. This is almost certainly a bug in the virtio driver, given that it's a paravirtualized device that is 5x slower than the "slow" hardware emulation. I get no meaningful change in performance between kvm/qemu.
[Qemu-devel] [Bug 597351] [NEW] Slow UDP performance with virtio device
Public bug reported: I'm working on an app that is very sensitive to round-trip latency between the guest and host, and qemu/kvm seems to be significantly slower than it needs to be. The attached program is a ping/pong over UDP. Call it with a single argument to start a listener/echo server on that port. With three arguments it becomes a counted "pinger" that will exit after a specified number of round trips for performance measurements. For example: $ gcc -o udp-pong udp-pong.c $ ./udp-pong 12345 & # start a listener on port 12345 $ time ./udp-pong 127.0.0.1 12345 100 # time a million round trips When run on the loopback device on a single machine (true on the host or within a guest), I get about 100k/s. When run across a port forward using "user" networking on qemu (or kvm, the performance is the same) and the default rtl8139 driver (both the host and guest are Ubuntu Lucid), I get about 10k/s. This seems very slow, but perhaps unavoidably so? When run in the same configuration using the "virtio" driver, I get only 2k/s. This is almost certainly a bug in the virtio driver, given that it's a paravirtualized device that is 5x slower than the "slow" hardware emulation. I get no meaningful change in performance between kvm/qemu. ** Affects: qemu Importance: Undecided Status: New -- Slow UDP performance with virtio device https://bugs.launchpad.net/bugs/597351 You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. Status in QEMU: New Bug description: I'm working on an app that is very sensitive to round-trip latency between the guest and host, and qemu/kvm seems to be significantly slower than it needs to be. The attached program is a ping/pong over UDP. Call it with a single argument to start a listener/echo server on that port. With three arguments it becomes a counted "pinger" that will exit after a specified number of round trips for performance measurements. For example: $ gcc -o udp-pong udp-pong.c $ ./udp-pong 12345 & # start a listener on port 12345 $ time ./udp-pong 127.0.0.1 12345 100 # time a million round trips When run on the loopback device on a single machine (true on the host or within a guest), I get about 100k/s. When run across a port forward using "user" networking on qemu (or kvm, the performance is the same) and the default rtl8139 driver (both the host and guest are Ubuntu Lucid), I get about 10k/s. This seems very slow, but perhaps unavoidably so? When run in the same configuration using the "virtio" driver, I get only 2k/s. This is almost certainly a bug in the virtio driver, given that it's a paravirtualized device that is 5x slower than the "slow" hardware emulation. I get no meaningful change in performance between kvm/qemu.
[Qemu-devel] qemu-kvm problem with DOS/4GW extender and EMM386.EXE
pu = 0x pid = 0x1997 [ errorcode = 0x, virt = 0x 0001a072 ] 28471049807815 (+4000) VMENTRY vcpu = 0x pid = 0x1997 28471049811815 (+4000) VMEXITvcpu = 0x pid = 0x1997 [ exitcode = 0x, rip = 0x 2a69 ] 0 (+ 0) CR_READ vcpu = 0x pid = 0x1997 [ CR# = 0, value = 0x 8011 ] 28471049815815 (+4000) VMENTRY vcpu = 0x pid = 0x1997 28471049818815 (+3000) VMEXITvcpu = 0x pid = 0x1997 [ exitcode = 0x0010, rip = 0x 2a73 ] 0 (+ 0) LMSW vcpu = 0x pid = 0x1997 [ value = 0x8010 ] 28471049840815 (+ 22000) VMENTRY vcpu = 0x pid = 0x1997 28471049844815 (+4000) VMEXITvcpu = 0x pid = 0x1997 [ exitcode = 0x007b, rip = 0x 1fd6 ] 0 (+ 0) IO_WRITE vcpu = 0x pid = 0x1997 [ port = 0x0020, size = 1 ] 28471049846815 (+2000) VMENTRY vcpu = 0x pid = 0x1997 28471049849815 (+3000) VMEXITvcpu = 0x pid = 0x1997 [ exitcode = 0x007b, rip = 0x 1fd9 ] 0 (+ 0) IO_READ vcpu = 0x pid = 0x1997 [ port = 0x0020, size = 1 ] 28471049851815 (+2000) VMENTRY vcpu = 0x pid = 0x1997 28471049855815 (+4000) VMEXITvcpu = 0x pid = 0x1997 [ exitcode = 0x, rip = 0x 2a69 ] 0 (+ 0) CR_READ vcpu = 0x pid = 0x1997 [ CR# = 0, value = 0x 8011 ] 28471049858815 (+3000) VMENTRY vcpu = 0x pid = 0x1997 28471049861815 (+3000) VMEXITvcpu = 0x pid = 0x1997 [ exitcode = 0x0010, rip = 0x 2a73 ] 0 (+ 0) LMSW vcpu = 0x pid = 0x1997 [ value = 0x8010 ] 28471049882815 (+ 21000) VMENTRY vcpu = 0x pid = 0x1997 28471049885815 (+3000) VMEXITvcpu = 0x pid = 0x1997 [ exitcode = 0x007b, rip = 0x 1fd6 ] 0 (+ 0) IO_WRITE vcpu = 0x pid = 0x1997 [ port = 0x0020, size = 1 ] 28471049887815 (+2000) VMENTRY vcpu = 0x pid = 0x1997 28471049890815 (+3000) VMEXITvcpu = 0x pid = 0x1997 [ exitcode = 0x007b, rip = 0x 1fd9 ] 0 (+ 0) IO_READ vcpu = 0x pid = 0x1997 [ port = 0x0020, size = 1 ] 28471049892815 (+2000) VMENTRY vcpu = 0x pid = 0x1997 28471049896815 (+4000) VMEXITvcpu = 0x pid = 0x1997 [ exitcode = 0x, rip = 0x 2a69 ] 0 (+ 0) CR_READ vcpu = 0x pid = 0x1997 [ CR# = 0, value = 0x 8011 ] 28471049900815 (+4000) VMENTRY vcpu = 0x pid = 0x1997 28471049903815 (+3000) VMEXITvcpu = 0x pid = 0x1997 [ exitcode = 0x0010, rip = 0x 2a73 ] 0 (+ 0) LMSW vcpu = 0x pid = 0x1997 [ value = 0x8010 ] 28471049933815 (+ 3) VMENTRY vcpu = 0x pid = 0x1997 28471049936815 (+3000) VMEXITvcpu = 0x pid = 0x1997 [ exitcode = 0x007b, rip = 0x 1fd6 ] To me it appears EMM386.EXE enables paging, and the DOS/4GW DOS extender tries to manipulate the PE bit in CR0 with LMSW but doesn't succeed. These programs appear to work fine in VMWare and on real hardware. Any ideas on how to make EMM386.EXE and the DOS/$GW extender work in qemu-kvm? Regards, Andy
Re: [Qemu-devel] qemu-kvm problem with DOS/4GW extender and EMM386.EXE
On Wed, 2010-05-12 at 00:09 +0300, Mohammed Gamal wrote: > On Tue, May 11, 2010 at 11:56 PM, Andy Walls wrote: > > Running an MS-DOS 6.22 image with qemu-kvm on a RedHat Linux OS, I > > noticed the guest OS becomes hung and my dmesg gets spammed with > > > >set_cr0: #GP, set PG flag with a clear PE flag > > > > That message appears to be the linux kernel's kvm emulator griping about > > Paging Enable bit being enabled while the Protection Enable bit is set > > for real mode. (The Intel manual says this should be a protection > > fault). > > > > The program that causes this has the DOS/4GW DOS extender runtime > > compiled into it. > > > > I found that when I don't load the EMM386.EXE memory manager, the > > problem doesn't occur. > > > > Here's a kvmtrace segment of when things are not working: > > > > 0 (+ 0) IO_WRITE vcpu = 0x pid = 0x1997 [ port > > = 0x0070, size = 1 ] > > 28471049668815 (+4000) VMENTRY vcpu = 0x pid = > > 0x1997 > > 28471049671815 (+3000) VMEXITvcpu = 0x pid = > > 0x1997 [ exitcode = 0x004e, rip = 0x 2a18 ] > > 0 (+ 0) PAGE_FAULTvcpu = 0x pid = 0x1997 [ > > errorcode = 0x, virt = 0x 0001ba28 ] > > 28471049675815 (+4000) VMENTRY vcpu = 0x pid = > > 0x1997 > > 28471049678815 (+3000) VMEXITvcpu = 0x pid = > > 0x1997 [ exitcode = 0x004e, rip = 0x 0334 ] > > 0 (+ 0) PAGE_FAULTvcpu = 0x pid = 0x1997 [ > > errorcode = 0x, virt = 0x 00019344 ] > > 28471049681815 (+3000) VMENTRY vcpu = 0x pid = > > 0x1997 > > 28471049685815 (+4000) VMEXITvcpu = 0x pid = > > 0x1997 [ exitcode = 0x, rip = 0x 02a7 ] > > 0 (+ 0) CR_READ vcpu = 0x pid = 0x1997 [ CR# > > = 0, value = 0x 8011 ] > > 28471049688815 (+3000) VMENTRY vcpu = 0x pid = > > 0x1997 > > 28471049691815 (+3000) VMEXITvcpu = 0x pid = > > 0x1997 [ exitcode = 0x0010, rip = 0x 02ae ] > > 0 (+ 0) LMSW vcpu = 0x pid = 0x1997 [ > > value = 0x8011 ] > > 28471049696815 (+5000) VMENTRY vcpu = 0x pid = > > 0x1997 > > 28471049699815 (+3000) VMEXITvcpu = 0x pid = > > 0x1997 [ exitcode = 0x004e, rip = 0x 5593 ] > > 0 (+ 0) PAGE_FAULTvcpu = 0x pid = 0x1997 [ > > errorcode = 0x, virt = 0x 000262e3 ] > > 28471049703815 (+4000) VMENTRY vcpu = 0x pid = > > 0x1997 > > 28471049706815 (+3000) VMEXITvcpu = 0x pid = > > 0x1997 [ exitcode = 0x004e, rip = 0x 44d6 ] > > 0 (+ 0) PAGE_FAULTvcpu = 0x pid = 0x1997 [ > > errorcode = 0x, virt = 0x 00025226 ] > > 28471049709815 (+3000) VMENTRY vcpu = 0x pid = > > 0x1997 > > 28471049713815 (+4000) VMEXITvcpu = 0x pid = > > 0x1997 [ exitcode = 0x004e, rip = 0x 55c0 ] > > 0 (+ 0) PAGE_FAULTvcpu = 0x pid = 0x1997 [ > > errorcode = 0x0002, virt = 0x 00024f79 ] > > 28471049717815 (+4000) VMENTRY vcpu = 0x pid = > > 0x1997 > > 28471049721815 (+4000) VMEXITvcpu = 0x pid = > > 0x1997 [ exitcode = 0x, rip = 0x 2a69 ] > > 0 (+ 0) CR_READ vcpu = 0x pid = 0x1997 [ CR# > > = 0, value = 0x 8011 ] > > 28471049723815 (+2000) VMENTRY vcpu = 0x pid = > > 0x1997 > > 28471049726815 (+3000) VMEXITvcpu = 0x pid = > > 0x1997 [ exitcode = 0x0010, rip = 0x 2a73 ] > > 0 (+ 0) LMSW vcpu = 0x pid = 0x1997 [ > > value = 0x8010 ] > > 28471049781815 (+ 55000) VMENTRY vcpu = 0x pid = > > 0x1997 > > 28471049784815 (+3000) VMEXITvcpu = 0x pid = > > 0x1997 [ exitcode = 0x004e, rip = 0x 1fb8 ] > > 0 (+ 0) PAGE_FAULTvcpu = 0x pid = 0x1997 [ > > errorcode = 0x,
[Qemu-devel] Qemu-win
Hello, Sorry, my english is very bad but i have some problems: When I start die qemu with this command: qemu.exe -hda suse.img -vnc 55 -k de then i became this message: Could not load keymap files '/C/Program Files/Qemu the file in c:\program files\qemu\keymap\ "de" is there. i looked in many forums, can´t find everything can you help me ? by Andy ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
On Tue, Oct 25, 2022, at 8:13 AM, Chao Peng wrote: > This new KVM exit allows userspace to handle memory-related errors. It > indicates an error happens in KVM at guest memory range [gpa, gpa+size). > The flags includes additional information for userspace to handle the > error. Currently bit 0 is defined as 'private memory' where '1' > indicates error happens due to private memory access and '0' indicates > error happens due to shared memory access. > > When private memory is enabled, this new exit will be used for KVM to > exit to userspace for shared <-> private memory conversion in memory > encryption usage. In such usage, typically there are two kind of memory > conversions: > - explicit conversion: happens when guest explicitly calls into KVM > to map a range (as private or shared), KVM then exits to userspace > to perform the map/unmap operations. > - implicit conversion: happens in KVM page fault handler where KVM > exits to userspace for an implicit conversion when the page is in a > different state than requested (private or shared). > > Suggested-by: Sean Christopherson > Co-developed-by: Yu Zhang > Signed-off-by: Yu Zhang > Signed-off-by: Chao Peng > --- > Documentation/virt/kvm/api.rst | 23 +++ > include/uapi/linux/kvm.h | 9 + > 2 files changed, 32 insertions(+) > > diff --git a/Documentation/virt/kvm/api.rst > b/Documentation/virt/kvm/api.rst > index f3fa75649a78..975688912b8c 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6537,6 +6537,29 @@ array field represents return values. The > userspace should update the return > values of SBI call before resuming the VCPU. For more details on > RISC-V SBI > spec refer, https://github.com/riscv/riscv-sbi-doc. > > +:: > + > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0) > + __u32 flags; > + __u32 padding; > + __u64 gpa; > + __u64 size; > + } memory; > + Would it make sense to also have a field for the access type (read, write, execute, etc)? I realize that shared <-> private conversion doesn't strictly need this, but it seems like it could be useful for logging failures and also for avoiding a second immediate fault if the type gets converted but doesn't have the right protection yet. (Obviously, if this were changed, KVM would need the ability to report that it doesn't actually know the mode.) --Andy
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On 7/21/22 14:19, Sean Christopherson wrote: On Thu, Jul 21, 2022, Gupta, Pankaj wrote: I view it as a performance problem because nothing stops KVM from copying from userspace into the private fd during the SEV ioctl(). What's missing is the ability for userspace to directly initialze the private fd, which may or may not avoid an extra memcpy() depending on how clever userspace is. Can you please elaborate more what you see as a performance problem? And possible ways to solve it? Oh, I'm not saying there actually _is_ a performance problem. What I'm saying is that in-place encryption is not a functional requirement, which means it's purely an optimization, and thus we should other bother supporting in-place encryption _if_ it would solve a performane bottleneck. Even if we end up having a performance problem, I think we need to understand the workloads that we want to optimize before getting too excited about designing a speedup. In particular, there's (depending on the specific technology, perhaps, and also architecture) a possible tradeoff between trying to reduce copying and trying to reduce unmapping and the associated flushes. If a user program maps an fd, populates it, and then converts it in place into private memory (especially if it doesn't do it in a single shot), then that memory needs to get unmapped both from the user mm and probably from the kernel direct map. On the flip side, it's possible to imagine an ioctl that does copy-and-add-to-private-fd that uses a private mm and doesn't need any TLB IPIs. All of this is to say that trying to optimize right now seems quite premature to me.
[PATCH] hw/vhost-user-blk: turn on VIRTIO_BLK_F_SIZE_MAX feature for virtio blk device
Turn on pre-defined feature VIRTIO_BLK_F_SIZE_MAX virtio blk device to avoid guest DMA request size is too large to exceed hardware spec. Signed-off-by: Andy Pei --- hw/block/vhost-user-blk.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index ba13cb8..eb1264a 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -252,6 +252,7 @@ static uint64_t vhost_user_blk_get_features(VirtIODevice *vdev, VHostUserBlk *s = VHOST_USER_BLK(vdev); /* Turn on pre-defined features */ +virtio_add_feature(&features, VIRTIO_BLK_F_SIZE_MAX); virtio_add_feature(&features, VIRTIO_BLK_F_SEG_MAX); virtio_add_feature(&features, VIRTIO_BLK_F_GEOMETRY); virtio_add_feature(&features, VIRTIO_BLK_F_TOPOLOGY); -- 1.8.3.1
Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
On 5/19/22 08:37, Chao Peng wrote: Extend the memslot definition to provide guest private memory through a file descriptor(fd) instead of userspace_addr(hva). Such guest private memory(fd) may never be mapped into userspace so no userspace_addr(hva) can be used. Instead add another two new fields (private_fd/private_offset), plus the existing memory_size to represent the private memory range. Such memslot can still have the existing userspace_addr(hva). When use, a single memslot can maintain both private memory through private fd(private_fd/private_offset) and shared memory through hva(userspace_addr). A GPA is considered private by KVM if the memslot has private fd and that corresponding page in the private fd is populated, otherwise, it's shared. So this is a strange API and, IMO, a layering violation. I want to make sure that we're all actually on board with making this a permanent part of the Linux API. Specifically, we end up with a multiplexing situation as you have described. For a given GPA, there are *two* possible host backings: an fd-backed one (from the fd, which is private for now might might end up potentially shared depending on future extensions) and a VMA-backed one. The selection of which one backs the address is made internally by whatever backs the fd. This, IMO, a clear layering violation. Normally, an fd has an associated address space, and pages in that address space can have contents, can be holes that appear to contain all zeros, or could have holes that are inaccessible. If you try to access a hole, you get whatever is in the hole. But now, with this patchset, the fd is more of an overlay and you get *something else* if you try to access through the hole. This results in operations on the fd bubbling up to the KVM mapping in what is, IMO, a strange way. If the user punches a hole, KVM has to modify its mappings such that the GPA goes to whatever VMA may be there. (And update the RMP, the hypervisor's tables, or whatever else might actually control privateness.) Conversely, if the user does fallocate to fill a hole, the guest mapping *to an unrelated page* has to be zapped so that the fd's page shows up. And the RMP needs updating, etc. I am lukewarm on this for a few reasons. 1. This is weird. AFAIK nothing else works like this. Obviously this is subjecting, but "weird" and "layering violation" sometimes translate to "problematic locking". 2. fd-backed private memory can't have normal holes. If I make a memfd, punch a hole in it, and mmap(MAP_SHARED) it, I end up with a page that reads as zero. If I write to it, the page gets allocated. But with this new mechanism, if I punch a hole and put it in a memslot, reads and writes go somewhere else. So what if I actually wanted lazily allocated private zeros? 2b. For a hypothetical future extension in which an fd can also have shared pages (for conversion, for example, or simply because the fd backing might actually be more efficient than indirecting through VMAs and therefore get used for shared memory or entirely-non-confidential VMs), lazy fd-backed zeros sound genuinely useful. 3. TDX hardware capability is not fully exposed. TDX can have a private page and a shared page at GPAs that differ only by the private bit. Sure, no one plans to use this today, but baking this into the user ABI throws away half the potential address space. 3b. Any software solution that works like TDX (which IMO seems like an eminently reasonable design to me) has the same issue. The alternative would be to have some kind of separate table or bitmap (part of the memslot?) that tells KVM whether a GPA should map to the fd. What do you all think?
Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
On Fri, May 20, 2022, at 11:31 AM, Sean Christopherson wrote: > But a dedicated KVM ioctl() to add/remove shared ranges would be easy > to implement > and wouldn't necessarily even need to interact with the memslots. It > could be a > consumer of memslots, e.g. if we wanted to disallow registering regions > without an > associated memslot, but I think we'd want to avoid even that because > things will > get messy during memslot updates, e.g. if dirty logging is toggled or a > shared > memory region is temporarily removed then we wouldn't want to destroy > the tracking. > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray > should be far more efficient. > > One benefit to explicitly tracking this in KVM is that it might be > useful for > software-only protected VMs, e.g. KVM could mark a region in the XArray > as "pending" > based on guest hypercalls to share/unshare memory, and then complete > the transaction > when userspace invokes the ioctl() to complete the share/unshare. That makes sense. If KVM goes this route, perhaps there the allowed states for a GPA should include private, shared, and also private-and-shared. Then anyone who wanted to use the same masked GPA for shared and private on TDX could do so if they wanted to.
Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE
On 1/18/22 05:21, Chao Peng wrote: From: "Kirill A. Shutemov" Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of the file is inaccessible from userspace through ordinary MMU access (e.g., read/write/mmap). However, the file content can be accessed via a different mechanism (e.g. KVM MMU) indirectly. It provides semantics required for KVM guest private memory support that a file descriptor with this seal set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV but may not be accessible from host userspace. At this time only shmem implements this seal. I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE essentially transmutes a memfd into a different type of object. While this can apparently be done successfully and without races (as in this code), it's at least awkward. I think that either creating a special inaccessible memfd should be a single operation that create the correct type of object or there should be a clear justification for why it's a two-step process. (Imagine if the way to create an eventfd would be to call timerfd_create() and then do a special fcntl to turn it into an eventfd but only if it's not currently armed. This would be weird.)
Re: [PATCH v4 04/12] mm/shmem: Support memfile_notifier
On 1/18/22 05:21, Chao Peng wrote: It maintains a memfile_notifier list in shmem_inode_info structure and implements memfile_pfn_ops callbacks defined by memfile_notifier. It then exposes them to memfile_notifier via shmem_get_memfile_notifier_info. We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be allocated by userspace for private memory. If there is no pages allocated at the offset then error should be returned so KVM knows that the memory is not private memory. Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng static int memfile_get_notifier_info(struct inode *inode, struct memfile_notifier_list **list, struct memfile_pfn_ops **ops) { - return -EOPNOTSUPP; + int ret = -EOPNOTSUPP; +#ifdef CONFIG_SHMEM + ret = shmem_get_memfile_notifier_info(inode, list, ops); +#endif + return ret; } +int shmem_get_memfile_notifier_info(struct inode *inode, + struct memfile_notifier_list **list, + struct memfile_pfn_ops **ops) +{ + struct shmem_inode_info *info; + + if (!shmem_mapping(inode->i_mapping)) + return -EINVAL; + + info = SHMEM_I(inode); + *list = &info->memfile_notifiers; + if (ops) + *ops = &shmem_pfn_ops; + + return 0; I can't wrap my head around exactly who is supposed to call these functions and when, but there appears to be a missing check that the inode is actually a shmem inode. What is this code trying to do? It's very abstract.
Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE
On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote: > On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote: >> On 1/18/22 05:21, Chao Peng wrote: >> > From: "Kirill A. Shutemov" >> > >> > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of >> > the file is inaccessible from userspace through ordinary MMU access >> > (e.g., read/write/mmap). However, the file content can be accessed >> > via a different mechanism (e.g. KVM MMU) indirectly. >> > >> > It provides semantics required for KVM guest private memory support >> > that a file descriptor with this seal set is going to be used as the >> > source of guest memory in confidential computing environments such >> > as Intel TDX/AMD SEV but may not be accessible from host userspace. >> > >> > At this time only shmem implements this seal. >> > >> >> I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE >> essentially transmutes a memfd into a different type of object. While this >> can apparently be done successfully and without races (as in this code), >> it's at least awkward. I think that either creating a special inaccessible >> memfd should be a single operation that create the correct type of object or >> there should be a clear justification for why it's a two-step process. > > Now one justification maybe from Stever's comment to patch-00: for ARM > usage it can be used with creating a normal memfd, (partially)populate > it with initial guest memory content (e.g. firmware), and then > F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in > KVM (definitely the current code needs to be changed to support that). Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right? So this won't work. In any case, the whole confidential VM initialization story is a bit buddy. From the earlier emails, it sounds like ARM expects the host to fill in guest memory and measure it. From my recollection of Intel's scheme (which may well be wrong, and I could easily be confusing it with SGX), TDX instead measures what is essentially a transcript of the series of operations that initializes the VM. These are fundamentally not the same thing even if they accomplish the same end goal. For TDX, we unavoidably need an operation (ioctl or similar) that initializes things according to the VM's instructions, and ARM ought to be able to use roughly the same mechanism. Also, if we ever get fancy and teach the page allocator about memory with reduced directmap permissions, it may well be more efficient for userspace to shove data into a memfd via ioctl than it is to mmap it and write the data.
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Mon, Apr 25, 2022 at 1:31 PM Sean Christopherson wrote: > > On Mon, Apr 25, 2022, Andy Lutomirski wrote: > > > > > > On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote: > > > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote: > > >> > > > > >> > > >> 2. Bind the memfile to a VM (or at least to a VM technology). Now it's > > >> in > > >> the initial state appropriate for that VM. > > >> > > >> For TDX, this completely bypasses the cases where the data is > > >> prepopulated > > >> and TDX can't handle it cleanly. > > I believe TDX can handle this cleanly, TDH.MEM.PAGE.ADD doesn't require that > the > source and destination have different HPAs. There's just no pressing need to > support such behavior because userspace is highly motivated to keep the > initial > image small for performance reasons, i.e. burning a few extra pages while > building > the guest is a non-issue. Following up on this, rather belatedly. After re-reading the docs, TDX can populate guest memory using TDH.MEM.PAGE.ADD, but see Intel® TDX Module Base Spec v1.5, section 2.3, step D.4 substeps 1 and 2 here: https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1.5-base-spec-348549001.pdf For each TD page: 1. The host VMM specifies a TDR as a parameter and calls the TDH.MEM.PAGE.ADD function. It copies the contents from the TD image page into the target TD page which is encrypted with the TD ephemeral key. TDH.MEM.PAGE.ADD also extends the TD measurement with the page GPA. 2. The host VMM extends the TD measurement with the contents of the new page by calling the TDH.MR.EXTEND function on each 256- byte chunk of the new TD page. So this is a bit like SGX. There is a specific series of operations that have to be done in precisely the right order to reproduce the intended TD measurement. Otherwise the guest will boot and run until it tries to get a report and then it will have a hard time getting anyone to believe its report. So I don't think the host kernel can get away with host userspace just providing pre-populated memory. Userspace needs to tell the host kernel exactly what sequence of adds, extends, etc to perform and in what order, and the host kernel needs to do precisely what userspace asks it to do. "Here's the contents of memory" doesn't cut it unless the tooling that builds the guest image matches the exact semantics that the host kernel provides. --Andy
Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
On Tue, Jun 14, 2022 at 12:32 AM Chao Peng wrote: > > On Thu, Jun 09, 2022 at 08:29:06PM +, Sean Christopherson wrote: > > On Wed, Jun 08, 2022, Vishal Annapurve wrote: > > > > One argument is that userspace can simply rely on cgroups to detect > > misbehaving > > guests, but (a) those types of OOMs will be a nightmare to debug and (b) an > > OOM > > kill from the host is typically considered a _host_ issue and will be > > treated as > > a missed SLO. > > > > An idea for handling this in the kernel without too much complexity would > > be to > > add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page faults > > from > > allocating pages, i.e. holes can only be filled by an explicit fallocate(). > > Minor > > faults, e.g. due to NUMA balancing stupidity, and major faults due to swap > > would > > still work, but writes to previously unreserved/unallocated memory would > > get a > > SIGSEGV on something it has mapped. That would allow the userspace VMM to > > prevent > > unintentional allocations without having to coordinate unmapping/remapping > > across > > multiple processes. > > Since this is mainly for shared memory and the motivation is catching > misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark > those range backed by private fd as PROT_NONE during the conversion so > subsequence misbehaved accesses will be blocked instead of causing double > allocation silently. This patch series is fairly close to implementing a rather more efficient solution. I'm not familiar enough with hypervisor userspace to really know if this would work, but: What if shared guest memory could also be file-backed, either in the same fd or with a second fd covering the shared portion of a memslot? This would allow changes to the backing store (punching holes, etc) to be some without mmap_lock or host-userspace TLB flushes? Depending on what the guest is doing with its shared memory, userspace might need the memory mapped or it might not. --Andy
Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
On Tue, Jun 14, 2022 at 12:09 PM Sean Christopherson wrote: > > On Tue, Jun 14, 2022, Andy Lutomirski wrote: > > On Tue, Jun 14, 2022 at 12:32 AM Chao Peng > > wrote: > > > > > > On Thu, Jun 09, 2022 at 08:29:06PM +, Sean Christopherson wrote: > > > > On Wed, Jun 08, 2022, Vishal Annapurve wrote: > > > > > > > > One argument is that userspace can simply rely on cgroups to detect > > > > misbehaving > > > > guests, but (a) those types of OOMs will be a nightmare to debug and > > > > (b) an OOM > > > > kill from the host is typically considered a _host_ issue and will be > > > > treated as > > > > a missed SLO. > > > > > > > > An idea for handling this in the kernel without too much complexity > > > > would be to > > > > add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page > > > > faults from > > > > allocating pages, i.e. holes can only be filled by an explicit > > > > fallocate(). Minor > > > > faults, e.g. due to NUMA balancing stupidity, and major faults due to > > > > swap would > > > > still work, but writes to previously unreserved/unallocated memory > > > > would get a > > > > SIGSEGV on something it has mapped. That would allow the userspace VMM > > > > to prevent > > > > unintentional allocations without having to coordinate > > > > unmapping/remapping across > > > > multiple processes. > > > > > > Since this is mainly for shared memory and the motivation is catching > > > misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark > > > those range backed by private fd as PROT_NONE during the conversion so > > > subsequence misbehaved accesses will be blocked instead of causing double > > > allocation silently. > > PROT_NONE, a.k.a. mprotect(), has the same vma downsides as munmap(). > > > This patch series is fairly close to implementing a rather more > > efficient solution. I'm not familiar enough with hypervisor userspace > > to really know if this would work, but: > > > > What if shared guest memory could also be file-backed, either in the > > same fd or with a second fd covering the shared portion of a memslot? > > This would allow changes to the backing store (punching holes, etc) to > > be some without mmap_lock or host-userspace TLB flushes? Depending on > > what the guest is doing with its shared memory, userspace might need > > the memory mapped or it might not. > > That's what I'm angling for with the F_SEAL_FAULT_ALLOCATIONS idea. The > issue, > unless I'm misreading code, is that punching a hole in the shared memory > backing > store doesn't prevent reallocating that hole on fault, i.e. a helper process > that > keeps a valid mapping of guest shared memory can silently fill the hole. > > What we're hoping to achieve is a way to prevent allocating memory without a > very > explicit action from userspace, e.g. fallocate(). Ah, I misunderstood. I thought your goal was to mmap it and prevent page faults from allocating. It is indeed the case (and has been since before quite a few of us were born) that a hole in a sparse file is logically just a bunch of zeros. A way to make a file for which a hole is an actual hole seems like it would solve this problem nicely. It could also be solved more specifically for KVM by making sure that the private/shared mode that userspace programs is strict enough to prevent accidental allocations -- if a GPA is definitively private, shared, neither, or (potentially, on TDX only) both, then a page that *isn't* shared will never be accidentally allocated by KVM. If the shared backing is not mmapped, it also won't be accidentally allocated by host userspace on a stray or careless write. --Andy
Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE
On 2/23/22 04:05, Steven Price wrote: On 23/02/2022 11:49, Chao Peng wrote: On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote: On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote: On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote: On 1/18/22 05:21, Chao Peng wrote: From: "Kirill A. Shutemov" Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of the file is inaccessible from userspace through ordinary MMU access (e.g., read/write/mmap). However, the file content can be accessed via a different mechanism (e.g. KVM MMU) indirectly. It provides semantics required for KVM guest private memory support that a file descriptor with this seal set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV but may not be accessible from host userspace. At this time only shmem implements this seal. I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE essentially transmutes a memfd into a different type of object. While this can apparently be done successfully and without races (as in this code), it's at least awkward. I think that either creating a special inaccessible memfd should be a single operation that create the correct type of object or there should be a clear justification for why it's a two-step process. Now one justification maybe from Stever's comment to patch-00: for ARM usage it can be used with creating a normal memfd, (partially)populate it with initial guest memory content (e.g. firmware), and then F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in KVM (definitely the current code needs to be changed to support that). Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right? So this won't work. Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will need to make sure access to existing mmap-ed area should be prevented, but that is hard. In any case, the whole confidential VM initialization story is a bit buddy. From the earlier emails, it sounds like ARM expects the host to fill in guest memory and measure it. From my recollection of Intel's scheme (which may well be wrong, and I could easily be confusing it with SGX), TDX instead measures what is essentially a transcript of the series of operations that initializes the VM. These are fundamentally not the same thing even if they accomplish the same end goal. For TDX, we unavoidably need an operation (ioctl or similar) that initializes things according to the VM's instructions, and ARM ought to be able to use roughly the same mechanism. Yes, TDX requires a ioctl. Steven may comment on the ARM part. The Arm story is evolving so I can't give a definite answer yet. Our current prototyping works by creating the initial VM content in a memslot as with a normal VM and then calling an ioctl which throws the big switch and converts all the (populated) pages to be protected. At this point the RMM performs a measurement of the data that the VM is being populated with. The above (in our prototype) suffers from all the expected problems with a malicious VMM being able to trick the host kernel into accessing those pages after they have been protected (causing a fault detected by the hardware). The ideal (from our perspective) approach would be to follow the same flow but where the VMM populates a memfd rather than normal anonymous pages. The memfd could then be sealed and the pages converted to protected ones (with the RMM measuring them in the process). The question becomes how is that memfd populated? It would be nice if that could be done using normal operations on a memfd (i.e. using mmap()) and therefore this code could be (relatively) portable. This would mean that any pages mapped from the memfd would either need to block the sealing or be revoked at the time of sealing. The other approach is we could of course implement a special ioctl which effectively does a memcpy into the (created empty and sealed) memfd and does the necessary dance with the RMM to measure the contents. This would match the "transcript of the series of operations" described above - but seems much less ideal from the viewpoint of the VMM. A VMM that supports Other Vendors will need to understand this sort of model regardless. I don't particularly mind the idea of having the kernel consume a normal memfd and spit out a new object, but I find the concept of changing the type of the object in place, even if it has other references, and trying to control all the resulting races to be somewhat alarming. In pseudo-Rust, this is the difference between: fn convert_to_private(in: &mut Memfd) and fn convert_to_private(in: Memfd) -> PrivateMemoryFd This doesn't map particularly nicely to the kernel, though. --Andy\
Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
On Wed, Jul 13, 2022, at 3:35 AM, Gupta, Pankaj wrote: >>>> This is the v7 of this series which tries to implement the fd-based KVM >>>> guest private memory. The patches are based on latest kvm/queue branch >>>> commit: >>>> >>>> b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU >>>> split_desc_cache only by default capacity >>>> >>>> Introduction >>>> >>>> In general this patch series introduce fd-based memslot which provides >>>> guest memory through memory file descriptor fd[offset,size] instead of >>>> hva/size. The fd can be created from a supported memory filesystem >>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM >>> >>> Thinking a bit, As host side fd on tmpfs or shmem will store memory on host >>> page cache instead of mapping pages into userspace address space. Can we hit >>> double (un-coordinated) page cache problem with this when guest page cache >>> is also used? >> >> This is my understanding: in host it will be indeed in page cache (in >> current shmem implementation) but that's just the way it allocates and >> provides the physical memory for the guest. In guest, guest OS will not >> see this fd (absolutely), it only sees guest memory, on top of which it >> can build its own page cache system for its own file-mapped content but >> that is unrelated to host page cache. > > yes. If guest fills its page cache with file backed memory, this at host > side(on shmem fd backend) will also fill the host page cache fast. This > can have an impact on performance of guest VM's if host goes to memory > pressure situation sooner. Or else we end up utilizing way less System > RAM. Is this in any meaningful way different from a regular VM? --Andy
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Thu, Mar 10, 2022 at 6:09 AM Chao Peng wrote: > > This is the v5 of this series which tries to implement the fd-based KVM > guest private memory. The patches are based on latest kvm/queue branch > commit: > > d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2 Can this series be run and a VM booted without TDX? A feature like that might help push it forward. --Andy
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Wed, Mar 30, 2022, at 10:58 AM, Sean Christopherson wrote: > On Wed, Mar 30, 2022, Quentin Perret wrote: >> On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote: >> > On 29/03/2022 18:01, Quentin Perret wrote: >> > > Is implicit sharing a thing? E.g., if a guest makes a memory access in >> > > the shared gpa range at an address that doesn't have a backing memslot, >> > > will KVM check whether there is a corresponding private memslot at the >> > > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or >> > > would that just generate an MMIO exit as usual? >> > >> > My understanding is that the guest needs some way of tagging whether a >> > page is expected to be shared or private. On the architectures I'm aware >> > of this is done by effectively stealing a bit from the IPA space and >> > pretending it's a flag bit. >> >> Right, and that is in fact the main point of divergence we have I think. >> While I understand this might be necessary for TDX and the likes, this >> makes little sense for pKVM. This would effectively embed into the IPA a >> purely software-defined non-architectural property/protocol although we >> don't actually need to: we (pKVM) can reasonably expect the guest to >> explicitly issue hypercalls to share pages in-place. So I'd be really >> keen to avoid baking in assumptions about that model too deep in the >> host mm bits if at all possible. > > There is no assumption about stealing PA bits baked into this API. Even > within > x86 KVM, I consider it a hard requirement that the common flows not assume the > private vs. shared information is communicated through the PA. Quentin, I think we might need a clarification. The API in this patchset indeed has no requirement that a PA bit distinguish between private and shared, but I think it makes at least a weak assumption that *something*, a priori, distinguishes them. In particular, there are private memslots and shared memslots, so the logical flow of resolving a guest memory access looks like: 1. guest accesses a GVA 2. read guest paging structures 3. determine whether this is a shared or private access 4. read host (KVM memslots and anything else, EPT, NPT, RMP, etc) structures accordingly. In particular, the memslot to reference is different depending on the access type. For TDX, this maps on to the fd-based model perfectly: the host-side paging structures for the shared and private slots are completely separate. For SEV, the structures are shared and KVM will need to figure out what to do in case a private and shared memslot overlap. Presumably it's sufficient to declare that one of them wins, although actually determining which one is active for a given GPA may involve checking whether the backing store for a given page actually exists. But I don't understand pKVM well enough to understand how it fits in. Quentin, how is the shared vs private mode of a memory access determined? How do the paging structures work? Can a guest switch between shared and private by issuing a hypercall without changing any guest-side paging structures or anything else? It's plausible that SEV and (maybe) pKVM would be better served if memslots could be sparse or if there was otherwise a direct way for host userspace to indicate to KVM which address ranges are actually active (not hole-punched) in a given memslot or to otherwise be able to make a rule that two different memslots (one shared and one private) can't claim the same address.
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Fri, Apr 1, 2022, at 7:59 AM, Quentin Perret wrote: > On Thursday 31 Mar 2022 at 09:04:56 (-0700), Andy Lutomirski wrote: > To answer your original question about memory 'conversion', the key > thing is that the pKVM hypervisor controls the stage-2 page-tables for > everyone in the system, all guests as well as the host. As such, a page > 'conversion' is nothing more than a permission change in the relevant > page-tables. > So I can see two different ways to approach this. One is that you split the whole address space in half and, just like SEV and TDX, allocate one bit to indicate the shared/private status of a page. This makes it work a lot like SEV and TDX. The other is to have shared and private pages be distinguished only by their hypercall history and the (protected) page tables. This saves some address space and some page table allocations, but it opens some cans of worms too. In particular, the guest and the hypervisor need to coordinate, in a way that the guest can trust, to ensure that the guest's idea of which pages are private match the host's. This model seems a bit harder to support nicely with the private memory fd model, but not necessarily impossible. Also, what are you trying to accomplish by having the host userspace mmap private pages? Is the idea that multiple guest could share the same page until such time as one of them tries to write to it? That would be kind of like having a third kind of memory that's visible to host and guests but is read-only for everyone. TDX and SEV can't support this at all (a private page belongs to one guest and one guest only, at least in SEV and in the current TDX SEAM spec). I imagine that this could be supported with private memory fds with some care without mmap, though -- the host could still populate the page with memcpy. Or I suppose a memslot could support using MAP_PRIVATE fds and have approximately the right semantics. --Andy
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote: > On Mon, Apr 04, 2022, Quentin Perret wrote: >> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote: >> FWIW, there are a couple of reasons why I'd like to have in-place >> conversions: >> >> - one goal of pKVM is to migrate some things away from the Arm >>Trustzone environment (e.g. DRM and the likes) and into protected VMs >>instead. This will give Linux a fighting chance to defend itself >>against these things -- they currently have access to _all_ memory. >>And transitioning pages between Linux and Trustzone (donations and >>shares) is fast and non-destructive, so we really do not want pKVM to >>regress by requiring the hypervisor to memcpy things; > > Is there actually a _need_ for the conversion to be non-destructive? > E.g. I assume > the "trusted" side of things will need to be reworked to run as a pKVM > guest, at > which point reworking its logic to understand that conversions are > destructive and > slow-ish doesn't seem too onerous. > >> - it can be very useful for protected VMs to do shared=>private >>conversions. Think of a VM receiving some data from the host in a >>shared buffer, and then it wants to operate on that buffer without >>risking to leak confidential informations in a transient state. In >>that case the most logical thing to do is to convert the buffer back >>to private, do whatever needs to be done on that buffer (decrypting a >>frame, ...), and then share it back with the host to consume it; > > If performance is a motivation, why would the guest want to do two > conversions > instead of just doing internal memcpy() to/from a private page? I > would be quite > surprised if multiple exits and TLB shootdowns is actually faster, > especially at > any kind of scale where zapping stage-2 PTEs will cause lock contention > and IPIs. I don't know the numbers or all the details, but this is arm64, which is a rather better architecture than x86 in this regard. So maybe it's not so bad, at least in very simple cases, ignoring all implementation details. (But see below.) Also the systems in question tend to have fewer CPUs than some of the massive x86 systems out there. If we actually wanted to support transitioning the same page between shared and private, though, we have a bit of an awkward situation. Private to shared is conceptually easy -- do some bookkeeping, reconstitute the direct map entry, and it's done. The other direction is a mess: all existing uses of the page need to be torn down. If the page has been recently used for DMA, this includes IOMMU entries. Quentin: let's ignore any API issues for now. Do you have a concept of how a nondestructive shared -> private transition could work well, even in principle? The best I can come up with is a special type of shared page that is not GUP-able and maybe not even mmappable, having a clear option for transitions to fail, and generally preventing the nasty cases from happening in the first place. Maybe there could be a special mode for the private memory fds in which specific pages are marked as "managed by this fd but actually shared". pread() and pwrite() would work on those pages, but not mmap(). (Or maybe mmap() but the resulting mappings would not permit GUP.) And transitioning them would be a special operation on the fd that is specific to pKVM and wouldn't work on TDX or SEV. Hmm. Sean and Chao, are we making a bit of a mistake by making these fds technology-agnostic? That is, would we want to distinguish between a TDX backing fd, a SEV backing fd, a software-based backing fd, etc? API-wise this could work by requiring the fd to be bound to a KVM VM instance and possibly even configured a bit before any other operations would be allowed. (Destructive transitions nicely avoid all the nasty cases. If something is still pinning a shared page when it's "transitioned" to private (really just replaced with a new page), then the old page continues existing for as long as needed as a separate object.)
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Tue, Apr 5, 2022, at 3:36 AM, Quentin Perret wrote: > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote: >> >> >> On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote: >> > On Mon, Apr 04, 2022, Quentin Perret wrote: >> >> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote: >> >> FWIW, there are a couple of reasons why I'd like to have in-place >> >> conversions: >> >> >> >> - one goal of pKVM is to migrate some things away from the Arm >> >>Trustzone environment (e.g. DRM and the likes) and into protected VMs >> >>instead. This will give Linux a fighting chance to defend itself >> >>against these things -- they currently have access to _all_ memory. >> >>And transitioning pages between Linux and Trustzone (donations and >> >>shares) is fast and non-destructive, so we really do not want pKVM to >> >>regress by requiring the hypervisor to memcpy things; >> > >> > Is there actually a _need_ for the conversion to be non-destructive? >> > E.g. I assume >> > the "trusted" side of things will need to be reworked to run as a pKVM >> > guest, at >> > which point reworking its logic to understand that conversions are >> > destructive and >> > slow-ish doesn't seem too onerous. >> > >> >> - it can be very useful for protected VMs to do shared=>private >> >>conversions. Think of a VM receiving some data from the host in a >> >>shared buffer, and then it wants to operate on that buffer without >> >>risking to leak confidential informations in a transient state. In >> >>that case the most logical thing to do is to convert the buffer back >> >>to private, do whatever needs to be done on that buffer (decrypting a >> >>frame, ...), and then share it back with the host to consume it; >> > >> > If performance is a motivation, why would the guest want to do two >> > conversions >> > instead of just doing internal memcpy() to/from a private page? I >> > would be quite >> > surprised if multiple exits and TLB shootdowns is actually faster, >> > especially at >> > any kind of scale where zapping stage-2 PTEs will cause lock contention >> > and IPIs. >> >> I don't know the numbers or all the details, but this is arm64, which is a >> rather better architecture than x86 in this regard. So maybe it's not so >> bad, at least in very simple cases, ignoring all implementation details. >> (But see below.) Also the systems in question tend to have fewer CPUs than >> some of the massive x86 systems out there. > > Yep. I can try and do some measurements if that's really necessary, but > I'm really convinced the cost of the TLBI for the shared->private > conversion is going to be significantly smaller than the cost of memcpy > the buffer twice in the guest for us. To be fair, although the cost for > the CPU update is going to be low, the cost for IOMMU updates _might_ be > higher, but that very much depends on the hardware. On systems that use > e.g. the Arm SMMU, the IOMMUs can use the CPU page-tables directly, and > the iotlb invalidation is done on the back of the CPU invalidation. So, > on systems with sane hardware the overhead is *really* quite small. > > Also, memcpy requires double the memory, it is pretty bad for power, and > it causes memory traffic which can't be a good thing for things running > concurrently. > >> If we actually wanted to support transitioning the same page between shared >> and private, though, we have a bit of an awkward situation. Private to >> shared is conceptually easy -- do some bookkeeping, reconstitute the direct >> map entry, and it's done. The other direction is a mess: all existing uses >> of the page need to be torn down. If the page has been recently used for >> DMA, this includes IOMMU entries. >> >> Quentin: let's ignore any API issues for now. Do you have a concept of how >> a nondestructive shared -> private transition could work well, even in >> principle? > > I had a high level idea for the workflow, but I haven't looked into the > implementation details. > > The idea would be to allow KVM *or* userspace to take a reference > to a page in the fd in an exclusive manner. KVM could take a reference > on a page (which would be necessary before to donating it to a guest) > using some kind of memfile_notifier as proposed in this series, and >
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Tue, Apr 5, 2022, at 11:30 AM, Sean Christopherson wrote: > On Tue, Apr 05, 2022, Andy Lutomirski wrote: > >> resume guest >> *** host -> hypervisor -> guest *** >> Guest unshares the page. >> *** guest -> hypervisor *** >> Hypervisor removes PTE. TLBI. >> *** hypervisor -> guest *** >> >> Obviously considerable cleverness is needed to make a virt IOMMU like this >> work well, but still. >> >> Anyway, my suggestion is that the fd backing proposal get slightly modified >> to get it ready for multiple subtypes of backing object, which should be a >> pretty minimal change. Then, if someone actually needs any of this >> cleverness, it can be added later. In the mean time, the >> pread()/pwrite()/splice() scheme is pretty good. > > Tangentially related to getting private-fd ready for multiple things, > what about > implementing the pread()/pwrite()/splice() scheme in pKVM itself? I.e. > read() on > the VM fd, with the offset corresponding to gfn in some way. > Hmm, could make sense.
Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
On Thu, Apr 7, 2022, at 9:05 AM, Sean Christopherson wrote: > On Thu, Mar 10, 2022, Chao Peng wrote: >> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE >> memory behave like longterm pinned pages and thus should be accounted to >> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK. >> >> Signed-off-by: Chao Peng >> --- >> mm/shmem.c | 25 - >> 1 file changed, 24 insertions(+), 1 deletion(-) >> >> diff --git a/mm/shmem.c b/mm/shmem.c >> index 7b43e274c9a2..ae46fb96494b 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, >> pgoff_t start, pgoff_t end) >> static void notify_invalidate_page(struct inode *inode, struct folio *folio, >> pgoff_t start, pgoff_t end) >> { >> -#ifdef CONFIG_MEMFILE_NOTIFIER >> struct shmem_inode_info *info = SHMEM_I(inode); >> >> +#ifdef CONFIG_MEMFILE_NOTIFIER >> start = max(start, folio->index); >> end = min(end, folio->index + folio_nr_pages(folio)); >> >> memfile_notifier_invalidate(&info->memfile_notifiers, start, end); >> #endif >> + >> +if (info->xflags & SHM_F_INACCESSIBLE) >> +atomic64_sub(end - start, ¤t->mm->pinned_vm); > > As Vishal's to-be-posted selftest discovered, this is broken as > current->mm may > be NULL. Or it may be a completely different mm, e.g. AFAICT there's > nothing that > prevents a different process from punching hole in the shmem backing. > How about just not charging the mm in the first place? There’s precedent: ramfs and hugetlbfs (at least sometimes — I’ve lost track of the current status). In any case, for an administrator to try to assemble the various rlimits into a coherent policy is, and always has been, quite messy. ISTM cgroup limits, which can actually add across processes usefully, are much better. So, aside from the fact that these fds aren’t in a filesystem and are thus available by default, I’m not convinced that this accounting is useful or necessary. Maybe we could just have some switch require to enable creation of private memory in the first place, and anyone who flips that switch without configuring cgroups is subject to DoS.
Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
On Tue, Apr 12, 2022, at 7:36 AM, Jason Gunthorpe wrote: > On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote: > >> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he >> past already with secretmem, it's not 100% that good of a fit (unmovable >> is worth than mlocked). But it gets the job done for now at least. > > No, it doesn't. There are too many different interpretations how > MELOCK is supposed to work > > eg VFIO accounts per-process so hostile users can just fork to go past > it. > > RDMA is per-process but uses a different counter, so you can double up > > iouring is per-user and users a 3rd counter, so it can triple up on > the above two > >> So I'm open for alternative to limit the amount of unmovable memory we >> might allocate for user space, and then we could convert seretmem as well. > > I think it has to be cgroup based considering where we are now :\ > So this is another situation where the actual backend (TDX, SEV, pKVM, pure software) makes a difference -- depending on exactly what backend we're using, the memory may not be unmoveable. It might even be swappable (in the potentially distant future). Anyway, here's a concrete proposal, with a bit of handwaving: We add new cgroup limits: memory.unmoveable memory.locked These can be set to an actual number or they can be set to the special value ROOT_CAP. If they're set to ROOT_CAP, then anyone in the cgroup with capable(CAP_SYS_RESOURCE) (i.e. the global capability) can allocate movable or locked memory with this (and potentially other) new APIs. If it's 0, then they can't. If it's another value, then the memory can be allocated, charged to the cgroup, up to the limit, with no particular capability needed. The default at boot is ROOT_CAP. Anyone who wants to configure it differently is free to do so. This avoids introducing a DoS, makes it easy to run tests without configuring cgroup, and lets serious users set up their cgroups. Nothing is charge per mm. To make this fully sensible, we need to know what the backend is for the private memory before allocating any so that we can charge it accordingly.
Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
On 8/18/22 06:24, Kirill A . Shutemov wrote: On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote: On Wed, 6 Jul 2022, Chao Peng wrote: This is the v7 of this series which tries to implement the fd-based KVM guest private memory. Here at last are my reluctant thoughts on this patchset. fd-based approach for supporting KVM guest private memory: fine. Use or abuse of memfd and shmem.c: mistaken. memfd_create() was an excellent way to put together the initial prototype. But since then, TDX in particular has forced an effort into preventing (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs. Are any of the shmem.c mods useful to existing users of shmem.c? No. Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No. What use do you have for a filesystem here? Almost none. IIUC, what you want is an fd through which QEMU can allocate kernel memory, selectively free that memory, and communicate fd+offset+length to KVM. And perhaps an interface to initialize a little of that memory from a template (presumably copied from a real file on disk somewhere). You don't need shmem.c or a filesystem for that! If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever. If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated. Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping theoretically possible, but I'm not aware of any plans as of now. [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html This thing? https://cdrdv2.intel.com/v1/dl/getContent/733578 That looks like migration between computers, not between NUMA nodes. Or am I missing something?
Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
On 8/19/22 17:27, Kirill A. Shutemov wrote: On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote: On Thu, 18 Aug 2022, Kirill A . Shutemov wrote: On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote: If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever. If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated. Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping theoretically possible, but I'm not aware of any plans as of now. [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html I always forget, migration means different things to different audiences. As an mm person, I was meaning page migration, whereas a virtualization person thinks VM live migration (which that reference appears to be about), a scheduler person task migration, an ornithologist bird migration, etc. But you're an mm person too: you may have cited that reference in the knowledge that TDX 1.5 Live Migration will entail page migration of the kind I'm thinking of. (Anyway, it's not important to clarify that here.) TDX 1.5 brings both. In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE. This seems to be a pretty bad fit for the way that the core mm migrates pages. The core mm unmaps the page, then moves (in software) the contents to a new address, then faults it in. TDH.MEM.PAGE.RELOCATE doesn't fit into that workflow very well. I'm not saying it can't be done, but it won't just work.
Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
On 8/24/22 02:41, Chao Peng wrote: On Tue, Aug 23, 2022 at 04:05:27PM +, Sean Christopherson wrote: On Tue, Aug 23, 2022, David Hildenbrand wrote: On 19.08.22 05:38, Hugh Dickins wrote: On Fri, 19 Aug 2022, Sean Christopherson wrote: On Thu, Aug 18, 2022, Kirill A . Shutemov wrote: On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote: On Wed, 6 Jul 2022, Chao Peng wrote: But since then, TDX in particular has forced an effort into preventing (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs. Are any of the shmem.c mods useful to existing users of shmem.c? No. Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No. But QEMU and other VMMs are users of shmem and memfd. The new features certainly aren't useful for _all_ existing users, but I don't think it's fair to say that they're not useful for _any_ existing users. Okay, I stand corrected: there exist some users of memfd_create() who will also have use for "INACCESSIBLE" memory. As raised in reply to the relevant patch, I'm not sure if we really have to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a requirement of specific memfd_notifer (memfile_notifier) implementations -- such as TDX that will convert the memory and MCE-kill the machine on ordinary write access. We might be able to set/enforce this when registering a notifier internally instead, and fail notifier registration if a condition isn't met (e.g., existing mmap). So I'd be curious, which other users of shmem/memfd would benefit from (MMU)-"INACCESSIBLE" memory obtained via memfd_create()? I agree that there's no need to expose the inaccessible behavior via uAPI. Making it a kernel-internal thing that's negotiated/resolved when KVM binds to the fd would align INACCESSIBLE with the UNMOVABLE and UNRECLAIMABLE flags (and any other flags that get added in the future). AFAICT, the user-visible flag is a holdover from the early RFCs and doesn't provide any unique functionality. That's also what I'm thinking. And I don't see problem immediately if user has populated the fd at the binding time. Actually that looks an advantage for previously discussed guest payload pre-loading. I think this gets awkward. Trying to define sensible semantics for what happens if a shmem or similar fd gets used as secret guest memory and that fd isn't utterly and completely empty can get quite nasty. For example: If there are already mmaps, then TDX (much more so than SEV) really doesn't want to also use it as guest memory. If there is already data in the fd, then maybe some technologies can use this for pre-population, but TDX needs explicit instructions in order to get the guest's hash right. In general, it seems like it will be much more likely to actually work well if the user (uAPI) is required to declare to the kernel exactly what the fd is for (e.g. TDX secret memory, software-only secret memory, etc) before doing anything at all with it other than binding it to KVM. INACCESSIBLE is a way to achieve this. Maybe it's not the prettiest in the world -- I personally would rather see an explicit request for, say, TDX or SEV memory or maybe the memory that works for a particular KVM instance instead of something generic like INACCESSIBLE, but this is a pretty weak preference. But I think that just starting with a plain memfd is a can of worms.
Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
On Fri, Sep 9, 2022, at 7:32 AM, Kirill A . Shutemov wrote: > On Thu, Sep 08, 2022 at 09:48:35PM -0700, Andy Lutomirski wrote: >> On 8/19/22 17:27, Kirill A. Shutemov wrote: >> > On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote: >> > > On Thu, 18 Aug 2022, Kirill A . Shutemov wrote: >> > > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote: >> > > > > >> > > > > If your memory could be swapped, that would be enough of a good >> > > > > reason >> > > > > to make use of shmem.c: but it cannot be swapped; and although there >> > > > > are some references in the mailthreads to it perhaps being swappable >> > > > > in future, I get the impression that will not happen soon if ever. >> > > > > >> > > > > If your memory could be migrated, that would be some reason to use >> > > > > filesystem page cache (because page migration happens to understand >> > > > > that type of memory): but it cannot be migrated. >> > > > >> > > > Migration support is in pipeline. It is part of TDX 1.5 [1]. And >> > > > swapping >> > > > theoretically possible, but I'm not aware of any plans as of now. >> > > > >> > > > [1] >> > > > https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html >> > > >> > > I always forget, migration means different things to different audiences. >> > > As an mm person, I was meaning page migration, whereas a virtualization >> > > person thinks VM live migration (which that reference appears to be >> > > about), >> > > a scheduler person task migration, an ornithologist bird migration, etc. >> > > >> > > But you're an mm person too: you may have cited that reference in the >> > > knowledge that TDX 1.5 Live Migration will entail page migration of the >> > > kind I'm thinking of. (Anyway, it's not important to clarify that here.) >> > >> > TDX 1.5 brings both. >> > >> > In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE. >> > >> >> This seems to be a pretty bad fit for the way that the core mm migrates >> pages. The core mm unmaps the page, then moves (in software) the contents >> to a new address, then faults it in. TDH.MEM.PAGE.RELOCATE doesn't fit into >> that workflow very well. I'm not saying it can't be done, but it won't just >> work. > > Hm. From what I see we have all necessary infrastructure in place. > > Unmaping is NOP for inaccessible pages as it is never mapped and we have > mapping->a_ops->migrate_folio() callback that allows to replace software > copying with whatever is needed, like TDH.MEM.PAGE.RELOCATE. > > What do I miss? Hmm, maybe this isn't as bad as I thought. Right now, unless I've missed something, the migration workflow is to unmap (via try_to_migrate) all mappings, then migrate the backing store (with ->migrate_folio(), although it seems like most callers expect the actual copy to happen outside of ->migrate_folio(), and then make new mappings. With the *current* (vma-based, not fd-based) model for KVM memory, this won't work -- we can't unmap before calling TDH.MEM.PAGE.RELOCATE. But maybe it's actually okay with some care or maybe mild modifications with the fd-based model. We don't have any mmaps, per se, to unmap for secret / INACCESSIBLE memory. So maybe we can get all the way to ->migrate_folio() without zapping anything in the secure EPT and just call TDH-MEM.PAGE.RELOCATE from inside migrate_folio(). And there will be nothing to fault back in. From the core code's perspective, it's like migrating a memfd that doesn't happen to have my mappings at the time. --Andy
Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
(please excuse any formatting disasters. my internet went out as I was composing this, and i did my best to rescue it.) On Mon, Sep 19, 2022, at 12:10 PM, Sean Christopherson wrote: > +Will, Marc and Fuad (apologies if I missed other pKVM folks) > > On Mon, Sep 19, 2022, David Hildenbrand wrote: >> On 15.09.22 16:29, Chao Peng wrote: >> > From: "Kirill A. Shutemov" >> > >> > KVM can use memfd-provided memory for guest memory. For normal userspace >> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its >> > virtual address space and then tells KVM to use the virtual address to >> > setup the mapping in the secondary page table (e.g. EPT). >> > >> > With confidential computing technologies like Intel TDX, the >> > memfd-provided memory may be encrypted with special key for special >> > software domain (e.g. KVM guest) and is not expected to be directly >> > accessed by userspace. Precisely, userspace access to such encrypted >> > memory may lead to host crash so it should be prevented. >> >> Initially my thaught was that this whole inaccessible thing is TDX specific >> and there is no need to force that on other mechanisms. That's why I >> suggested to not expose this to user space but handle the notifier >> requirements internally. >> >> IIUC now, protected KVM has similar demands. Either access (read/write) of >> guest RAM would result in a fault and possibly crash the hypervisor (at >> least not the whole machine IIUC). > > Yep. The missing piece for pKVM is the ability to convert from shared > to private > while preserving the contents, e.g. to hand off a large buffer > (hundreds of MiB) > for processing in the protected VM. Thoughts on this at the bottom. > >> > This patch introduces userspace inaccessible memfd (created with >> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through >> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via >> > in-kernel interface so KVM can directly interact with core-mm without >> > the need to map the memory into KVM userspace. >> >> With secretmem we decided to not add such "concept switch" flags and instead >> use a dedicated syscall. >> > > I have no personal preference whatsoever between a flag and a dedicated > syscall, > but a dedicated syscall does seem like it would give the kernel a bit more > flexibility. The third option is a device node, e.g. /dev/kvm_secretmem or /dev/kvm_tdxmem or similar. But if we need flags or other details in the future, maybe this isn't ideal. > >> What about memfd_inaccessible()? Especially, sealing and hugetlb are not >> even supported and it might take a while to support either. > > Don't know about sealing, but hugetlb support for "inaccessible" memory > needs to > come sooner than later. "inaccessible" in quotes because we might want > to choose > a less binary name, e.g. "restricted"?. > > Regarding pKVM's use case, with the shim approach I believe this can be done > by > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions > piled on top. > > My first thought was to make the uAPI a set of KVM ioctls so that KVM > could tightly > tightly control usage without taking on too much complexity in the > kernel, but > working through things, routing the behavior through the shim itself > might not be > all that horrific. > > IIRC, we discarded the idea of allowing userspace to map the "private" > fd because > things got too complex, but with the shim it doesn't seem _that_ bad. What's the exact use case? Is it just to pre-populate the memory? > > E.g. on the memfd side: > > 1. The entire memfd must be mapped, and at most one mapping is allowed, i.e. > mapping is all or nothing. > > 2. Acquiring a reference via get_pfn() is disallowed if there's a mapping > for > the restricted memfd. > > 3. Add notifier hooks to allow downstream users to further restrict things. > > 4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything > in > one shot. > > 5. Require that there are no outstanding references at munmap(). Or if this > can't be guaranteed by userspace, maybe add some way for userspace to > wait > until it's ok to convert to private? E.g. so that get_pfn() doesn't need > to do an expensive check every time. Hmm. I haven't looked at the code to see if this would really work, but I think this could be done more in line with how the rest of the kernel works by using the rmap infrastructure. When the pKVM memfd is in not-yet-private mode, just let it be mmapped as usual (but don't allow any form of GUP or pinning). Then have an ioctl to switch to to shared mode that takes locks or sets flags so that no new faults can be serviced and does unmap_mapping_range. As long as the shim arranges to have its own vm_ops, I don't immediately see any reason this can't work.
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Fri, Apr 22, 2022, at 3:56 AM, Chao Peng wrote: > On Tue, Apr 05, 2022 at 06:03:21PM +, Sean Christopherson wrote: >> On Tue, Apr 05, 2022, Quentin Perret wrote: >> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote: > Only when the register succeeds, the fd is > converted into a private fd, before that, the fd is just a normal (shared) > one. During this conversion, the previous data is preserved so you can put > some initial data in guest pages (whether the architecture allows this is > architecture-specific and out of the scope of this patch). I think this can be made to work, but it will be awkward. On TDX, for example, what exactly are the semantics supposed to be? An error code if the memory isn't all zero? An error code if it has ever been written? Fundamentally, I think this is because your proposed lifecycle for these memfiles results in a lightweight API but is awkward for the intended use cases. You're proposing, roughly: 1. Create a memfile. Now it's in a shared state with an unknown virt technology. It can be read and written. Let's call this state BRAND_NEW. 2. Bind to a VM. Now it's an a bound state. For TDX, for example, let's call the new state BOUND_TDX. In this state, the TDX rules are followed (private memory can't be converted, etc). The problem here is that the BOUND_NEW state allows things that are nonsensical in TDX, and the binding step needs to invent some kind of semantics for what happens when binding a nonempty memfile. So I would propose a somewhat different order: 1. Create a memfile. It's in the UNBOUND state and no operations whatsoever are allowed except binding or closing. 2. Bind the memfile to a VM (or at least to a VM technology). Now it's in the initial state appropriate for that VM. For TDX, this completely bypasses the cases where the data is prepopulated and TDX can't handle it cleanly. For SEV, it bypasses a situation in which data might be written to the memory before we find out whether that data will be unreclaimable or unmovable. -- Now I have a question, since I don't think anyone has really answered it: how does this all work with SEV- or pKVM-like technologies in which private and shared pages share the same address space? I sounds like you're proposing to have a big memfile that contains private and shared pages and to use that same memfile as pages are converted back and forth. IO and even real physical DMA could be done on that memfile. Am I understanding correctly? If so, I think this makes sense, but I'm wondering if the actual memslot setup should be different. For TDX, private memory lives in a logically separate memslot space. For SEV and pKVM, it doesn't. I assume the API can reflect this straightforwardly. And the corresponding TDX question: is the intent still that shared pages aren't allowed at all in a TDX memfile? If so, that would be the most direct mapping to what the hardware actually does. --Andy
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote: > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote: >> >> >> 2. Bind the memfile to a VM (or at least to a VM technology). Now it's in >> the initial state appropriate for that VM. >> >> For TDX, this completely bypasses the cases where the data is prepopulated >> and TDX can't handle it cleanly. For SEV, it bypasses a situation in which >> data might be written to the memory before we find out whether that data >> will be unreclaimable or unmovable. > > This sounds a more strict rule to avoid semantics unclear. > > So userspace needs to know what excatly happens for a 'bind' operation. > This is different when binds to different technologies. E.g. for SEV, it > may imply after this call, the memfile can be accessed (through mmap or > what ever) from userspace, while for current TDX this should be not allowed. I think this is actually a good thing. While SEV, TDX, pKVM, etc achieve similar goals and have broadly similar ways of achieving them, they really are different, and having userspace be aware of the differences seems okay to me. (Although I don't think that allowing userspace to mmap SEV shared pages is particularly wise -- it will result in faults or cache incoherence depending on the variant of SEV in use.) > > And I feel we still need a third flow/operation to indicate the > completion of the initialization on the memfile before the guest's > first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed > and prevent future userspace access. After this point, then the memfile > becomes truely private fd. Even that is technology-dependent. For TDX, this operation doesn't really exist. For SEV, I'm not sure (I haven't read the specs in nearly enough detail). For pKVM, I guess it does exist and isn't quite the same as a shared->private conversion. Maybe this could be generalized a bit as an operation "measure and make private" that would be supported by the technologies for which it's useful. > >> >> >> -- >> >> Now I have a question, since I don't think anyone has really answered it: >> how does this all work with SEV- or pKVM-like technologies in which private >> and shared pages share the same address space? I sounds like you're >> proposing to have a big memfile that contains private and shared pages and >> to use that same memfile as pages are converted back and forth. IO and even >> real physical DMA could be done on that memfile. Am I understanding >> correctly? > > For TDX case, and probably SEV as well, this memfile contains private memory > only. But this design at least makes it possible for usage cases like > pKVM which wants both private/shared memory in the same memfile and rely > on other ways like mmap/munmap or mprotect to toggle private/shared instead > of fallocate/hole punching. Hmm. Then we still need some way to get KVM to generate the correct SEV pagetables. For TDX, there are private memslots and shared memslots, and they can overlap. If they overlap and both contain valid pages at the same address, then the results may not be what the guest-side ABI expects, but everything will work. So, when a single logical guest page transitions between shared and private, no change to the memslots is needed. For SEV, this is not the case: everything is in one set of pagetables, and there isn't a natural way to resolve overlaps. If the memslot code becomes efficient enough, then the memslots could be fragmented. Or the memfile could support private and shared data in the same memslot. And if pKVM does this, I don't see why SEV couldn't also do it and hopefully reuse the same code. > >> >> If so, I think this makes sense, but I'm wondering if the actual memslot >> setup should be different. For TDX, private memory lives in a logically >> separate memslot space. For SEV and pKVM, it doesn't. I assume the API can >> reflect this straightforwardly. > > I believe so. The flow should be similar but we do need pass different > flags during the 'bind' to the backing store for different usages. That > should be some new flags for pKVM but the callbacks (API here) between > memfile_notifile and its consumers can be reused. And also some different flag in the operation that installs the fd as a memslot? > >> >> And the corresponding TDX question: is the intent still that shared pages >> aren't allowed at all in a TDX memfile? If so, that would be the most >> direct mapping to what the hardware actually does. > > Exactly. TDX
Re: [RFC v2 PATCH 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On 11/19/21 05:47, Chao Peng wrote: This RFC series try to implement the fd-based KVM guest private memory proposal described at [1] and an improved 'New Proposal' described at [2]. I generally like this. Thanks!
Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST
On 11/19/21 05:47, Chao Peng wrote: From: "Kirill A. Shutemov" The new seal type provides semantics required for KVM guest private memory support. A file descriptor with the seal set is going to be used as source of guest memory in confidential computing environments such as Intel TDX and AMD SEV. F_SEAL_GUEST can only be set on empty memfd. After the seal is set userspace cannot read, write or mmap the memfd. I don't have a strong objection here, but, given that you're only supporting it for memfd, would a memfd_create() flag be more straightforward? If nothing else, it would avoid any possible locking issue. I'm also very very slightly nervous about a situation in which one program sends a memfd to an untrusted other process and that process truncates the memfd and then F_SEAL_GUESTs it. This could be mostly mitigated by also requiring that no other seals be set when F_SEAL_GUEST happens, but the alternative MFD_GUEST would eliminate this issue too.
RE: [PATCH] hw/vhost-user-blk: turn on VIRTIO_BLK_F_SIZE_MAX feature for virtio blk device
Hi Raphael, Thanks for your reply. I will fix the grammar mistake in V2. -Original Message- From: Raphael Norwitz Sent: Tuesday, November 30, 2021 5:58 AM To: Pei, Andy Cc: qemu-devel@nongnu.org; qemu-bl...@nongnu.org; Liu, Changpeng ; Raphael Norwitz ; m...@redhat.com; kw...@redhat.com; mre...@redhat.com Subject: Re: [PATCH] hw/vhost-user-blk: turn on VIRTIO_BLK_F_SIZE_MAX feature for virtio blk device Just a commit message nit. Otherwise I'm happy with this. OFC should not be queued for 6.2. On Fri, Nov 26, 2021 at 10:00:18AM +0800, Andy Pei wrote: > Turn on pre-defined feature VIRTIO_BLK_F_SIZE_MAX virtio blk device to > avoid guest DMA request size is too large to exceed hardware spec. Grammar here. Should be something like "...DMA request sizes which are to large for the hardware spec". > > Signed-off-by: Andy Pei Acked-by: Raphael Norwitz > --- > hw/block/vhost-user-blk.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c > index ba13cb8..eb1264a 100644 > --- a/hw/block/vhost-user-blk.c > +++ b/hw/block/vhost-user-blk.c > @@ -252,6 +252,7 @@ static uint64_t vhost_user_blk_get_features(VirtIODevice > *vdev, > VHostUserBlk *s = VHOST_USER_BLK(vdev); > > /* Turn on pre-defined features */ > +virtio_add_feature(&features, VIRTIO_BLK_F_SIZE_MAX); > virtio_add_feature(&features, VIRTIO_BLK_F_SEG_MAX); > virtio_add_feature(&features, VIRTIO_BLK_F_GEOMETRY); > virtio_add_feature(&features, VIRTIO_BLK_F_TOPOLOGY); > -- > 1.8.3.1 >
[PATCH v2] hw/vhost-user-blk: turn on VIRTIO_BLK_F_SIZE_MAX feature for virtio blk device
Turn on pre-defined feature VIRTIO_BLK_F_SIZE_MAX for virtio blk device to avoid guest DMA request sizes which are too large for hardware spec. Signed-off-by: Andy Pei --- hw/block/vhost-user-blk.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index ba13cb8..eb1264a 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -252,6 +252,7 @@ static uint64_t vhost_user_blk_get_features(VirtIODevice *vdev, VHostUserBlk *s = VHOST_USER_BLK(vdev); /* Turn on pre-defined features */ +virtio_add_feature(&features, VIRTIO_BLK_F_SIZE_MAX); virtio_add_feature(&features, VIRTIO_BLK_F_SEG_MAX); virtio_add_feature(&features, VIRTIO_BLK_F_GEOMETRY); virtio_add_feature(&features, VIRTIO_BLK_F_TOPOLOGY); -- 1.8.3.1
Re: [Qemu-devel] [PATCH RFC] tcmu: Introduce qemu-tcmu
On 10/20/2016 07:30 AM, Fam Zheng wrote: On Thu, 10/20 15:08, Stefan Hajnoczi wrote: If a corrupt image is able to execute arbitrary code in the qemu-tcmu process, does /dev/uio0 or the tcmu shared memory interface allow get root or kernel privileges? I haven't audited the code, but target_core_user.ko should contain the access to /dev/uioX and make sure there is no security risk regarding buggy or malicious handlers. Otherwise it's a bug that should be fixed. Andy can correct me if I'm wrong. Yes... well, TCMU ensures that a bad handler can't scribble to kernel memory outside the shared memory area. UIO devices are basically a "device drivers in userspace" kind of API so they require root to use. I seem to remember somebody mentioning ways this might work for less-privileged handlers (fd-passing??) but no way to do this exists just yet. Regards -- Andy
Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation
> On Dec 27, 2018, at 10:18 AM, Florian Weimer wrote: > > We have a bit of an interesting problem with respect to the d_off > field in struct dirent. > > When running a 64-bit kernel on certain file systems, notably ext4, > this field uses the full 63 bits even for small directories (strace -v > output, wrapped here for readability): > > getdents(3, [ > {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, > d_name="authorized_keys", d_type=DT_REG}, > {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", > d_type=DT_DIR}, > {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", > d_type=DT_DIR} > ], 32768) = 88 > > When running in 32-bit compat mode, this value is somehow truncated to > 31 bits, for both the getdents and the getdents64 (!) system call (at > least on i386). I imagine you’re encountering this bug: https://lkml.org/lkml/2018/10/18/859 Presumably the right fix involves modifying the relevant VFS file operations to indicate the relevant ABI to the implementations. I would guess that 9p is triggering the “not really in the syscall you think you’re in” issue.
Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation
[sending again, slightly edited, due to email client issues] On Thu, Dec 27, 2018 at 9:25 AM Florian Weimer wrote: > > We have a bit of an interesting problem with respect to the d_off > field in struct dirent. > > When running a 64-bit kernel on certain file systems, notably ext4, > this field uses the full 63 bits even for small directories (strace -v > output, wrapped here for readability): > > getdents(3, [ > {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, > d_name="authorized_keys", d_type=DT_REG}, > {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", > d_type=DT_DIR}, > {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", > d_type=DT_DIR} > ], 32768) = 88 > > When running in 32-bit compat mode, this value is somehow truncated to > 31 bits, for both the getdents and the getdents64 (!) system call (at > least on i386). > ... > > However, both qemu-user and the 9p file system can run in such a way > that the kernel is entered from a 64-bit process, but the actual usage > is from a 32-bit process: I imagine that at least some of the problems you're seeing are due to this bug: https://lkml.org/lkml/2018/10/18/859 Presumably the right fix involves modifying the relevant VFS file operations to indicate the relevant ABI to the implementations. I would guess that 9p is triggering the “not really in the syscall you think you’re in” issue.
Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation
> On Dec 28, 2018, at 6:54 PM, Matthew Wilcox wrote: > >> On Sat, Dec 29, 2018 at 12:12:27AM +, Peter Maydell wrote: >> On Fri, 28 Dec 2018 at 23:16, Andreas Dilger wrot >>> On Dec 28, 2018, at 4:18 AM, Peter Maydell wrote: The problem is that there is no 32-bit API in some cases (unless I have misunderstood the kernel code) -- not all host architectures implement compat syscalls or allow them to be called from 64-bit processes or implement all the older syscall variants that had smaller offets. If there was a guaranteed "this syscall always exists and always gives me 32-bit offsets" we could use it. >>> >>> The "32bitapi" mount option would use 32-bit hash for seekdir >>> and telldir, regardless of what kernel API was used. That would >>> just set the FMODE_32BITHASH flag in the file->f_mode for all files. >> >> A mount option wouldn't be much use to QEMU -- we can't tell >> our users how to mount their filesystems, which they're >> often doing lots of other things with besides running QEMU. >> (Otherwise we could just tell them "don't use ext4", which >> would also solve the problem :-)) We need something we can >> use at the individual-syscall level. > > Could you use a prctl to set whether you were running in 32 or 64 bit > mode? Or do you change which kind of task you're emulating too often > to make this a good idea? How would this work? We already have the separate COMPAT_DEFINE_SYSCALL entries *and* in_compat_syscall(). Now we’d have a third degree of freedom. Either the arches people care about should add reasonable ways to issue 32-bit syscalls from 64-bit mode or there should be an explicit way to ask for the 32-bit directory offsets.
Re: [Qemu-devel] [PATCH] headers: fix linux/mod_devicetable.h inclusions
On Mon, Jul 9, 2018 at 6:19 PM, Arnd Bergmann wrote: > A couple of drivers produced build errors after the mod_devicetable.h > header was split out from the platform_device one, e.g. > > drivers/media/platform/davinci/vpbe_osd.c:42:40: error: array type has > incomplete element type 'struct platform_device_id' > drivers/media/platform/davinci/vpbe_venc.c:42:40: error: array type has > incomplete element type 'struct platform_device_id' > > This adds the inclusion where needed. > > Fixes: ac3167257b9f ("headers: separate linux/mod_devicetable.h from > linux/platform_device.h") > Signed-off-by: Arnd Bergmann > drivers/platform/x86/intel_punit_ipc.c | 1 + > --- a/drivers/platform/x86/intel_punit_ipc.c > +++ b/drivers/platform/x86/intel_punit_ipc.c > @@ -12,6 +12,7 @@ > */ > > #include > +#include > #include > #include > #include Acked-by: Andy Shevchenko for the above bits. -- With Best Regards, Andy Shevchenko
Re: [Qemu-devel] [PATCH v21 1/5] xbitmap: Introduce xbitmap
On Fri, Feb 16, 2018 at 8:30 PM, Matthew Wilcox wrote: > On Fri, Feb 16, 2018 at 07:44:50PM +0200, Andy Shevchenko wrote: >> On Tue, Jan 9, 2018 at 1:10 PM, Wei Wang wrote: >> > From: Matthew Wilcox >> > >> > The eXtensible Bitmap is a sparse bitmap representation which is >> > efficient for set bits which tend to cluster. It supports up to >> > 'unsigned long' worth of bits. >> >> > lib/xbitmap.c| 444 >> > +++ >> >> Please, split tests to a separate module. > > Hah, I just did this two days ago! I didn't publish it yet, but I also made > it compile both in userspace and as a kernel module. > > It's the top two commits here: > > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2018-02-12 > Thanks! > Note this is a complete rewrite compared to the version presented here; it > sits on top of the XArray and no longer has a preload interface. It has a > superset of the IDA functionality. Noted. Now, the question about test case. Why do you heavily use BUG_ON? Isn't resulting statistics enough? See how other lib/test_* modules do. -- With Best Regards, Andy Shevchenko
Re: [Qemu-devel] [PATCH v21 1/5] xbitmap: Introduce xbitmap
On Tue, Jan 9, 2018 at 1:10 PM, Wei Wang wrote: > From: Matthew Wilcox > > The eXtensible Bitmap is a sparse bitmap representation which is > efficient for set bits which tend to cluster. It supports up to > 'unsigned long' worth of bits. > lib/xbitmap.c| 444 > +++ Please, split tests to a separate module. -- With Best Regards, Andy Shevchenko
[Bug 1915063] Re: Windows 10 wil not install using qemu-system-x86_64
The commit in question is marked for stable: commit 841c2be09fe4f495fe5224952a419bd8c7e5b455 Author: Maxim Levitsky Date: Wed Jul 8 14:57:31 2020 +0300 kvm: x86: replace kvm_spec_ctrl_test_value with runtime test on the host To avoid complex and in some cases incorrect logic in kvm_spec_ctrl_test_value, just try the guest's given value on the host processor instead, and if it doesn't #GP, allow the guest to set it. One such case is when host CPU supports STIBP mitigation but doesn't support IBRS (as is the case with some Zen2 AMD cpus), and in this case we were giving guest #GP when it tried to use STIBP The reason why can can do the host test is that IA32_SPEC_CTRL msr is passed to the guest, after the guest sets it to a non zero value for the first time (due to performance reasons), and as as result of this, it is pointless to emulate #GP condition on this first access, in a different way than what the host CPU does. This is based on a patch from Sean Christopherson, who suggested this idea. Fixes: 6441fa6178f5 ("KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL") Cc: sta...@vger.kernel.org Suggested-by: Sean Christopherson Signed-off-by: Maxim Levitsky Message-Id: <20200708115731.180097-1-mlevi...@redhat.com> Signed-off-by: Paolo Bonzini It appears to be in `v5.4.102` which is currently queued up for the cycle following the one just starting. -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1915063 Title: Windows 10 wil not install using qemu-system-x86_64 Status in QEMU: New Status in linux package in Ubuntu: Confirmed Status in linux-oem-5.10 package in Ubuntu: Fix Released Status in linux-oem-5.6 package in Ubuntu: Confirmed Status in qemu package in Ubuntu: Invalid Bug description: Steps to reproduce install virt-manager and ovmf if nopt already there copy windows and virtio iso files to /var/lib/libvirt/images Use virt-manager from local machine to create your VMs with the disk, CPUs and memory required Select customize configuration then select OVMF(UEFI) instead of seabios set first CDROM to the windows installation iso (enable in boot options) add a second CDROM and load with the virtio iso change spice display to VNC Always get a security error from windows and it fails to launch the installer (works on RHEL and Fedora) I tried updating the qemu version from Focals 4.2 to Groovy 5.0 which was of no help --- ProblemType: Bug ApportVersion: 2.20.11-0ubuntu27.14 Architecture: amd64 CasperMD5CheckResult: skip CurrentDesktop: ubuntu:GNOME DistributionChannelDescriptor: # This is the distribution channel descriptor for the OEM CDs # For more information see http://wiki.ubuntu.com/DistributionChannelDescriptor canonical-oem-sutton-focal-amd64-20201030-422+pc-sutton-bachman-focal-amd64+X00 DistroRelease: Ubuntu 20.04 InstallationDate: Installed on 2021-01-20 (19 days ago) InstallationMedia: Ubuntu 20.04 "Focal" - Build amd64 LIVE Binary 20201030-14:39 MachineType: LENOVO 30E102Z NonfreeKernelModules: nvidia_modeset nvidia Package: linux (not installed) ProcEnviron: TERM=xterm-256color PATH=(custom, no user) XDG_RUNTIME_DIR= LANG=en_US.UTF-8 SHELL=/bin/bash ProcFB: 0 EFI VGA ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.6.0-1042-oem root=UUID=389cd165-fc52-4814-b837-a1090b9c2387 ro locale=en_US quiet splash vt.handoff=7 ProcVersionSignature: Ubuntu 5.6.0-1042.46-oem 5.6.19 RelatedPackageVersions: linux-restricted-modules-5.6.0-1042-oem N/A linux-backports-modules-5.6.0-1042-oem N/A linux-firmware 1.187.8 RfKill: Tags: focal Uname: Linux 5.6.0-1042-oem x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: adm cdrom dip docker kvm libvirt lpadmin plugdev sambashare sudo _MarkForUpload: True dmi.bios.date: 07/29/2020 dmi.bios.vendor: LENOVO dmi.bios.version: S07KT08A dmi.board.name: 1046 dmi.board.vendor: LENOVO dmi.board.version: Not Defined dmi.chassis.type: 3 dmi.chassis.vendor: LENOVO dmi.chassis.version: None dmi.modalias: dmi:bvnLENOVO:bvrS07KT08A:bd07/29/2020:svnLENOVO:pn30E102Z:pvrThinkStationP620:rvnLENOVO:rn1046:rvrNotDefined:cvnLENOVO:ct3:cvrNone: dmi.product.family: INVALID dmi.product.name: 30E102Z dmi.product.sku: LENOVO_MT_30E1_BU_Think_FM_ThinkStation P620 dmi.product.version: ThinkStation P620 dmi.sys.vendor: LENOVO To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1915063/+subscriptions
Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
On Fri, Oct 16, 2020 at 6:40 PM Jann Horn wrote: > > [adding some more people who are interested in RNG stuff: Andy, Jason, > Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this > concerns some pretty fundamental API stuff related to RNG usage] > > On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin > wrote: > > - Background > > > > The VM Generation ID is a feature defined by Microsoft (paper: > > http://go.microsoft.com/fwlink/?LinkId=260709) and supported by > > multiple hypervisor vendors. > > > > The feature is required in virtualized environments by apps that work > > with local copies/caches of world-unique data such as random values, > > uuids, monotonically increasing counters, etc. > > Such apps can be negatively affected by VM snapshotting when the VM > > is either cloned or returned to an earlier point in time. > > > > The VM Generation ID is a simple concept meant to alleviate the issue > > by providing a unique ID that changes each time the VM is restored > > from a snapshot. The hw provided UUID value can be used to > > differentiate between VMs or different generations of the same VM. > > > > - Problem > > > > The VM Generation ID is exposed through an ACPI device by multiple > > hypervisor vendors but neither the vendors or upstream Linux have no > > default driver for it leaving users to fend for themselves. > > > > Furthermore, simply finding out about a VM generation change is only > > the starting point of a process to renew internal states of possibly > > multiple applications across the system. This process could benefit > > from a driver that provides an interface through which orchestration > > can be easily done. > > > > - Solution > > > > This patch is a driver which exposes the Virtual Machine Generation ID > > via a char-dev FS interface that provides ID update sync and async > > notification, retrieval and confirmation mechanisms: > > > > When the device is 'open()'ed a copy of the current vm UUID is > > associated with the file handle. 'read()' operations block until the > > associated UUID is no longer up to date - until HW vm gen id changes - > > at which point the new UUID is provided/returned. Nonblocking 'read()' > > uses EWOULDBLOCK to signal that there is no _new_ UUID available. > > > > 'poll()' is implemented to allow polling for UUID updates. Such > > updates result in 'EPOLLIN' events. > > > > Subsequent read()s following a UUID update no longer block, but return > > the updated UUID. The application needs to acknowledge the UUID update > > by confirming it through a 'write()'. > > Only on writing back to the driver the right/latest UUID, will the > > driver mark this "watcher" as up to date and remove EPOLLIN status. > > > > 'mmap()' support allows mapping a single read-only shared page which > > will always contain the latest UUID value at offset 0. > > It would be nicer if that page just contained an incrementing counter, > instead of a UUID. It's not like the application cares *what* the UUID > changed to, just that it *did* change and all RNGs state now needs to > be reseeded from the kernel, right? And an application can't reliably > read the entire UUID from the memory mapping anyway, because the VM > might be forked in the middle. > > So I think your kernel driver should detect UUID changes and then turn > those into a monotonically incrementing counter. (Probably 64 bits > wide?) (That's probably also a little bit faster than comparing an > entire UUID.) > > An option might be to put that counter into the vDSO, instead of a > separate VMA; but I don't know how the other folks feel about that. > Andy, do you have opinions on this? That way, normal userspace code > that uses this infrastructure wouldn't have to mess around with a > special device at all. And it'd be usable in seccomp sandboxes and so > on without needing special plumbing. And libraries wouldn't have to > call open() and mess with file descriptor numbers. The vDSO might be annoyingly slow for this. Something like the rseq page might make sense. It could be a generic indication of "system went through some form of suspend".
Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin wrote: > > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote: > > 4c. The guest kernel maintains an array of physical addresses that are > > MADV_WIPEONFORK. The hypervisor knows about this array and its > > location through whatever protocol, and before resuming a > > moved/snapshotted/duplicated VM, it takes the responsibility for > > memzeroing this memory. The huge pro here would be that this > > eliminates all races, and reduces complexity quite a bit, because the > > hypervisor can perfectly synchronize its bringup (and SMP bringup) > > with this, and it can even optimize things like on-disk memory > > snapshots to simply not write out those pages to disk. > > > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new > > userspace API to deal with, and it'd be race free, and eliminate a lot > > of kernel complexity. > > Clearly this has a chance to break applications, right? > If there's an app that uses this as a non-system-calls way > to find out whether there was a fork, it will break > when wipe triggers without a fork ... > For example, imagine: > > MADV_WIPEONFORK > copy secret data to MADV_DONTFORK > fork > > > used to work, with this change it gets 0s instead of the secret data. > > > I am also not sure it's wise to expose each guest process > to the hypervisor like this. E.g. each process needs a > guest physical address of its own then. This is a finite resource. > > > The mmap interface proposed here is somewhat baroque, but it is > certainly simple to implement ... Wipe of fork/vmgenid/whatever could end up being much more problematic than it naively appears -- it could be wiped in the middle of a read. Either the API needs to handle this cleanly, or we need something more aggressive like signal-on-fork. --Andy
Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
On Sun, Oct 18, 2020 at 8:59 AM Michael S. Tsirkin wrote: > > On Sun, Oct 18, 2020 at 08:54:36AM -0700, Andy Lutomirski wrote: > > On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin wrote: > > > > > > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote: > > > > 4c. The guest kernel maintains an array of physical addresses that are > > > > MADV_WIPEONFORK. The hypervisor knows about this array and its > > > > location through whatever protocol, and before resuming a > > > > moved/snapshotted/duplicated VM, it takes the responsibility for > > > > memzeroing this memory. The huge pro here would be that this > > > > eliminates all races, and reduces complexity quite a bit, because the > > > > hypervisor can perfectly synchronize its bringup (and SMP bringup) > > > > with this, and it can even optimize things like on-disk memory > > > > snapshots to simply not write out those pages to disk. > > > > > > > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we > > > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new > > > > userspace API to deal with, and it'd be race free, and eliminate a lot > > > > of kernel complexity. > > > > > > Clearly this has a chance to break applications, right? > > > If there's an app that uses this as a non-system-calls way > > > to find out whether there was a fork, it will break > > > when wipe triggers without a fork ... > > > For example, imagine: > > > > > > MADV_WIPEONFORK > > > copy secret data to MADV_DONTFORK > > > fork > > > > > > > > > used to work, with this change it gets 0s instead of the secret data. > > > > > > > > > I am also not sure it's wise to expose each guest process > > > to the hypervisor like this. E.g. each process needs a > > > guest physical address of its own then. This is a finite resource. > > > > > > > > > The mmap interface proposed here is somewhat baroque, but it is > > > certainly simple to implement ... > > > > Wipe of fork/vmgenid/whatever could end up being much more problematic > > than it naively appears -- it could be wiped in the middle of a read. > > Either the API needs to handle this cleanly, or we need something more > > aggressive like signal-on-fork. > > > > --Andy > > > Right, it's not on fork, it's actually when process is snapshotted. > > If we assume it's CRIU we care about, then I > wonder what's wrong with something like > MADV_CHANGEONPTRACE_SEIZE > and basically say it's X bytes which change the value... I feel like we may be approaching this from the wrong end. Rather than saying "what data structure can the kernel expose that might plausibly be useful", how about we try identifying some specific userspace needs and see what a good solution could look like. I can identify two major cryptographic use cases: 1. A userspace RNG. The API exposed by the userspace end is a function that generates random numbers. The userspace code in turn wants to know some things from the kernel: it wants some best-quality-available random seed data from the kernel (and possibly an indication of how good it is) as well as an indication of whether the userspace memory may have been cloned or rolled back, or, failing that, an indication of whether a reseed is needed. Userspace could implement a wide variety of algorithms on top depending on its goals and compliance requirements, but the end goal is for the userspace part to be very, very fast. 2. A userspace crypto stack that wants to avoid shooting itself in the foot due to inadvertently doing the same thing twice. For example, an AES-GCM stack does not want to reuse an IV, *expecially* if there is even the slightest chance that it might reuse the IV for different data. This use case doesn't necessarily involve random numbers, but, if anything, it needs to be even faster than #1. The threats here are not really the same. For #1, a userspace RNG should be able to recover from a scenario in which an adversary clones the entire process *and gets to own the clone*. For example, in Android, an adversary can often gain complete control of a fork of the zygote -- this shouldn't adversely affect the security properties of other forks. Similarly, a server farm could operate by having one booted server that is cloned to create more workers. Those clones could be provisioned with secrets and permissions post-clone, and at attacker gaining control of a fresh clone could be considered acceptable. For #2, in contrast, if an adversary gains control of a clone of an AES-GCM session, they learn the key outright -- the relevant attack scenario is that the adversary gets to interact with two clones without compromising either clone per se. It's worth noting that, in both cases, there could possibly be more than one instance of an RNG or an AES-GCM session in the same process. This means that using signals is awkward but not necessarily impossibly. (This is an area in which Linux, and POSIX in general, is much weaker than Windows.)
Re: [PATCH v7 0/6] gpio: Add GPIO Aggregator
On Wed, May 20, 2020 at 3:40 PM Andy Shevchenko wrote: > On Wed, May 20, 2020 at 3:38 PM Geert Uytterhoeven > wrote: > > On Wed, May 20, 2020 at 2:14 PM Andy Shevchenko > > wrote: > > > On Mon, May 11, 2020 at 04:52:51PM +0200, Geert Uytterhoeven wrote: > > ... > > > > Sorry for late reply, recently noticed this nice idea. > > > The comment I have is, please, can we reuse bitmap parse algorithm and > > > syntax? > > > We have too many different formats and parsers in the kernel and bitmap's > > > one > > > seems suitable here. > > > > Thank you, I wasn't aware of that. > > > > Which one do you mean? The documentation seems to be confusing, > > and incomplete. > > My first guess was bitmap_parse(), but that one assumes hex values? > > And given it processes the unsigned long bitmap in u32 chunks, I guess > > it doesn't work as expected on big-endian 64-bit? > > > > bitmap_parselist() looks more suitable, and the format seems to be > > compatible with what's currently used, so it won't change ABI. What ABI? We didn't have a release with it, right? So, we are quite flexible for few more weeks to amend it. > > Is that the one you propose? > > Yes, sorry for the confusion. > > > > (Despite other small clean ups, like strstrip() use) > > > > Aka strim()? There are too many of them, to know all of them by heart ;-) > > The difference between them is __must_check flag. But yes. -- With Best Regards, Andy Shevchenko
Re: [PATCH v7 0/6] gpio: Add GPIO Aggregator
On Wed, May 20, 2020 at 3:38 PM Geert Uytterhoeven wrote: > On Wed, May 20, 2020 at 2:14 PM Andy Shevchenko > wrote: > > On Mon, May 11, 2020 at 04:52:51PM +0200, Geert Uytterhoeven wrote: ... > > Sorry for late reply, recently noticed this nice idea. > > The comment I have is, please, can we reuse bitmap parse algorithm and > > syntax? > > We have too many different formats and parsers in the kernel and bitmap's > > one > > seems suitable here. > > Thank you, I wasn't aware of that. > > Which one do you mean? The documentation seems to be confusing, > and incomplete. > My first guess was bitmap_parse(), but that one assumes hex values? > And given it processes the unsigned long bitmap in u32 chunks, I guess > it doesn't work as expected on big-endian 64-bit? > > bitmap_parselist() looks more suitable, and the format seems to be > compatible with what's currently used, so it won't change ABI. > Is that the one you propose? Yes, sorry for the confusion. > > (Despite other small clean ups, like strstrip() use) > > Aka strim()? There are too many of them, to know all of them by heart ;-) The difference between them is __must_check flag. But yes. -- With Best Regards, Andy Shevchenko
Re: [PATCH v7 0/6] gpio: Add GPIO Aggregator
gt; destroyed by writing to atribute files in sysfs. > Sample session on the Renesas Koelsch development board: > > - Unbind LEDs from leds-gpio driver: > > echo leds > /sys/bus/platform/drivers/leds-gpio/unbind > > - Create aggregators: > > $ echo e6052000.gpio 19,20 \ > > /sys/bus/platform/drivers/gpio-aggregator/new_device > > gpio-aggregator gpio-aggregator.0: gpio 0 => gpio-953 > gpio-aggregator gpio-aggregator.0: gpio 1 => gpio-954 > gpiochip_find_base: found new base at 758 > gpio gpiochip12: (gpio-aggregator.0): added GPIO chardev (254:13) > gpiochip_setup_dev: registered GPIOs 758 to 759 on device: gpiochip12 > (gpio-aggregator.0) > > $ echo e6052000.gpio 21 e605.gpio 20-22 \ > > /sys/bus/platform/drivers/gpio-aggregator/new_device > > gpio-aggregator gpio-aggregator.1: gpio 0 => gpio-955 > gpio-aggregator gpio-aggregator.1: gpio 1 => gpio-1012 > gpio-aggregator gpio-aggregator.1: gpio 2 => gpio-1013 > gpio-aggregator gpio-aggregator.1: gpio 3 => gpio-1014 > gpiochip_find_base: found new base at 754 > gpio gpiochip13: (gpio-aggregator.1): added GPIO chardev (254:13) > gpiochip_setup_dev: registered GPIOs 754 to 757 on device: gpiochip13 > (gpio-aggregator.1) > > - Adjust permissions on /dev/gpiochip1[23] (optional) > > - Control LEDs: > > $ gpioset gpiochip12 0=0 1=1 # LED6 OFF, LED7 ON > $ gpioset gpiochip12 0=1 1=0 # LED6 ON, LED7 OFF > $ gpioset gpiochip13 0=1 # LED8 ON > $ gpioset gpiochip13 0=0 # LED8 OFF > > - Destroy aggregators: > > $ echo gpio-aggregator.0 \ > > /sys/bus/platform/drivers/gpio-aggregator/delete_device > $ echo gpio-aggregator.1 \ > > /sys/bus/platform/drivers/gpio-aggregator/delete_device > > To ease testing, I have pushed this series to the > topic/gpio-aggregator-v7 branch of my renesas-drivers repository at > git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-drivers.git. > > Thanks! > > References: > [1] "[PATCH QEMU v2 0/5] Add a GPIO backend" > > (https://lore.kernel.org/linux-gpio/20200423090118.11199-1-geert+rene...@glider.be/) > [2] "[PATCH V4 2/2] gpio: inverter: document the inverter bindings" > > (https://lore.kernel.org/r/1561699236-18620-3-git-send-email-harish_kand...@mentor.com/) > [3] "[PATCH v6 0/8] gpio: Add GPIO Aggregator" > > (https://lore.kernel.org/linux-doc/20200324135328.5796-1-geert+rene...@glider.be/) > [4] "[PATCH v5 0/5] gpio: Add GPIO Aggregator" > > (https://lore.kernel.org/r/20200218151812.7816-1-geert+rene...@glider.be/) > [5] "[PATCH v4 0/5] gpio: Add GPIO Aggregator" > > (https://lore.kernel.org/r/20200115181523.23556-1-geert+rene...@glider.be) > [6] "[PATCH v3 0/7] gpio: Add GPIO Aggregator/Repeater" > > (https://lore.kernel.org/r/20191127084253.16356-1-geert+rene...@glider.be/) > [7] "[PATCH/RFC v2 0/5] gpio: Add GPIO Aggregator Driver" > > (https://lore.kernel.org/r/20190911143858.13024-1-geert+rene...@glider.be/) > [8] "[PATCH RFC] gpio: Add Virtual Aggregator GPIO Driver" > > (https://lore.kernel.org/r/20190705160536.12047-1-geert+rene...@glider.be/) > [9] "[PATCH QEMU POC] Add a GPIO backend" > > (https://lore.kernel.org/r/20181003152521.23144-1-geert+rene...@glider.be/) > [10] "Getting To Blinky: Virt Edition / Making device pass-through >work on embedded ARM" > (https://fosdem.org/2019/schedule/event/vai_getting_to_blinky/) > > Geert Uytterhoeven (6): > i2c: i801: Use GPIO_LOOKUP() helper macro > mfd: sm501: Use GPIO_LOOKUP_IDX() helper macro > gpiolib: Add support for GPIO lookup by line name > gpio: Add GPIO Aggregator > docs: gpio: Add GPIO Aggregator documentation > MAINTAINERS: Add GPIO Aggregator section > > .../admin-guide/gpio/gpio-aggregator.rst | 111 > Documentation/admin-guide/gpio/index.rst | 1 + > Documentation/driver-api/gpio/board.rst | 15 +- > MAINTAINERS | 7 + > drivers/gpio/Kconfig | 12 + > drivers/gpio/Makefile | 1 + > drivers/gpio/gpio-aggregator.c| 568 ++ > drivers/gpio/gpiolib.c| 22 +- > drivers/i2c/busses/i2c-i801.c | 6 +- > drivers/mfd/sm501.c | 24 +- > include/linux/gpio/machine.h | 17 +- > 11 files changed, 748 insertions(+), 36 deletions(-) > create mode 100644 Documentation/admin-guide/gpio/gpio-aggregator.rst > create mode 100644 drivers/gpio/gpio-aggregator.c > > -- > 2.17.1 > > Gr{oetje,eeting}s, > > Geert > > -- > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- > ge...@linux-m68k.org > > In personal conversations with technical people, I call myself a hacker. But > when I'm talking to journalists I just say "programmer" or something like > that. > -- Linus Torvalds -- With Best Regards, Andy Shevchenko
[Qemu-devel] [Bug 586175] Re: Windows XP/2003 doesn't boot
Great solution Andreas, it worked for a Win2k image which I could only boot previously using an iso from http://www.resoo.org/docs/ntldr/files/ However, I have a w7 image that I have never managed to boot, apart from its installation cd image using virt-install 20Gb w7 image: # losetup /dev/loop0 /vm/w7.img; kpartx -a /dev/loop0 # fdisk -l /dev/loop0 Disk /dev/loop0: 21.5 GB, 21474836480 bytes 255 heads, 63 sectors/track, 2610 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0xaf12c11f Device Boot Start End Blocks Id System /dev/loop0p1 * 1 13 1024007 HPFS/NTFS Partition 1 does not end on cylinder boundary. /dev/loop0p2 132611208670727 HPFS/NTFS Partition 2 does not end on cylinder boundary. # hexedit /dev/mapper/loop0p1 EB 52 90 4E 54 46 53 20 20 20 20 00 02 08 00 00 00 00 00 00 00 F8 00 00 3F 00 10 00 00 08 00 00 .R.NTFS.?... 0020 00 00 00 00 80 00 80 00 FF 1F 03 00 00 00 00 00 55 21 00 00 00 00 00 00 02 00 00 00 00 00 00 00 U!.. # hexedit /dev/mapper/loop0p2 EB 52 90 4E 54 46 53 20 20 20 20 00 02 08 00 00 00 00 00 00 00 F8 00 00 3F 00 10 00 00 28 03 00 .R.NTFS.?(.. 0020 00 00 00 00 80 00 80 00 FF CF 7C 02 00 00 00 00 00 00 0C 00 00 00 00 00 02 00 00 00 00 00 00 00 ..|. # kpartx -d /dev/loop0; losetup -d /dev/loop0 I changed location 0x1a to 0xFF on one or other or both partitions and it still will not boot in virt-manager. Cheers, Andy. -- Windows XP/2003 doesn't boot https://bugs.launchpad.net/bugs/586175 You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. Status in QEMU: Incomplete Status in “qemu-kvm” package in Ubuntu: New Status in Debian GNU/Linux: New Status in Fedora: Unknown Bug description: Hello everyone, my qemu doesn't boot any Windows XP/2003 installations if I try to boot the image. If I boot the install cd first, it's boot manager counts down and triggers the boot on it's own. That's kinda stupid. I'm using libvirt, but even by a simple > qemu-kvm -drive file=image.img,media=disk,if=ide,boot=on it won't boot. Qemu hangs at the message "Booting from Hard Disk..." I'm using qemu-kvm-0.12.4 with SeaBIOS 0.5.1 on Gentoo (No-Multilib and AMD64). It's a server, that means I'm using VNC as the primary graphic output but i don't think it should be an issue.
[Qemu-devel] [Bug 586175] Re: Windows XP/2003 doesn't boot
Andreas, The program that created the disk image seems confused, but it worked for creating a VM for FC11. Windows install seems to run fine, until wanting to boot from the drive it created. I don't know what creates the drive image and geometry, but it is broken. I think this is what I used to create the VM, but I have messed around with so many configurations and methods, I'm not sure what is what anymore. virt-install --connect qemu:///system -n w7 -r 2048 --vcpus=2 \ --disk path=/vm/w7.img,size=20,sparse=false,format=qcow2 \ -c /vm/w7cd.iso --vnc --noautoconsole \ --os-type windows --os-variant win7 --accelerate --network=bridge:br0 --hvm How many thousands of people have struggled with this and also got nowhere? It just looks like the virt-install developers have not tasted their own dogfood! LVM is supposed to be easy - just select vm image and boot, but the more I read about VMs, kvm, qemu, virtualbox, virsh etc, the more confused I get on how they relate to each other. testdisk reports this: ~~ Disk /dev/loop0 - 21 GB / 20 GiB - CHS 41943040 1 1 (wtf ??) Partition StartEndSize in sectors 1 * HPFS - NTFS 2048 206847 204800 [System Reserved] 2 P HPFS - NTFS 206848 41940991 41734144 Select 1: Disk /dev/loop0 - 21 GB / 20 GiB - CHS 41943040 1 1 Partition StartEndSize in sectors 1 * HPFS - NTFS 2048 206847 204800 [System Reserved] Boot sector Warning: Incorrect number of heads/cylinder 16 (NTFS) != 1 (HD) Warning: Incorrect number of sectors per track 63 (NTFS) != 1 (HD) Status: OK Backup boot sector Warning: Incorrect number of heads/cylinder 16 (NTFS) != 1 (HD) Warning: Incorrect number of sectors per track 63 (NTFS) != 1 (HD) Status: OK Sectors are identical. A valid NTFS Boot sector must be present in order to access any data; even if the partition is not bootable. ~~ Rebuild BS: ~~ Disk /dev/loop0 - 21 GB / 20 GiB - CHS 41943040 1 1 Partition StartEndSize in sectors 1 * HPFS - NTFS 2048 206847 204800 [System Reserved] filesystem size 204800 204800 sectors_per_cluster 8 8 mft_lcn 8533 8533 mftmirr_lcn 2 2 clusters_per_mft_record -10 -10 clusters_per_index_record 1 1 Extrapolated boot sector and current boot sector are different. ~~ Q Select 2: ~~ Disk /dev/loop0 - 21 GB / 20 GiB - CHS 41943040 1 1 Partition StartEndSize in sectors 2 P HPFS - NTFS 206848 41940991 41734144 Boot sector Warning: Incorrect number of heads/cylinder 16 (NTFS) != 1 (HD) Warning: Incorrect number of sectors per track 63 (NTFS) != 1 (HD) Status: OK Backup boot sector Warning: Incorrect number of heads/cylinder 16 (NTFS) != 1 (HD) Warning: Incorrect number of sectors per track 63 (NTFS) != 1 (HD) Status: OK Sectors are identical. A valid NTFS Boot sector must be present in order to access any data; even if the partition is not bootable. ~~ Rebuild BS: ~~ Disk /dev/loop0 - 21 GB / 20 GiB - CHS 41943040 1 1 Partition StartEndSize in sectors 2 P HPFS - NTFS 206848 41940991 41734144 filesystem size 41734144 41734144 sectors_per_cluster 8 8 mft_lcn 786432 786432 mftmirr_lcn 2 2 clusters_per_mft_record -10 -10 clusters_per_index_record 1 1 Extrapolated boot sector and current boot sector are different. ~~ It looks a mess. -- Windows XP/2003 doesn't boot https://bugs.launchpad.net/bugs/586175 You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. Status in QEMU: Incomplete Status in “qemu-kvm” package in Ubuntu: New Status in Debian GNU/Linux: New Status in Fedora: Unknown Bug description: Hello everyone, my qemu doesn't boot any Windows XP/2003 installations if I try to boot the image. If I boot the install cd first, it's boot manager counts down and triggers the boot on it's own. That's kinda stupid. I'm using libvirt, but even by a simple > qemu-kvm -drive file=image.img,media=disk,if=ide,boot=on it won't boot. Qemu hangs at the message "Booting from Hard Disk..." I'm using qemu-kvm-0.12.4 with SeaBIOS 0.5.1 on Gentoo (No-Multilib and AMD64). It's a server, that means I'm using VNC as the primary graphic output but i don't think it should be an issue.
Re: [Qemu-devel] [PATCH] ahci: enable pci bus master MemoryRegion before loading ahci engines
(0x7fcc4e19b4a0)[0]: cmd done ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxSERR] @ 0x30: 0x ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxIS] @ 0x10: 0x0001 ahci_mem_write_host ahci(0x7fcc4e19b4a0) write4 [reg:IS] @ 0x8: 0x0001 ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxCI] @ 0x38: 0x8000 handle_cmd_fis_dump ahci(0x7fcc4e19b4a0)[0]: FIS: 0x00: 27 80 ef 02 00 00 00 a0 00 00 00 00 00 00 00 00 0x10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ahci_cmd_done ahci(0x7fcc4e19b4a0)[0]: cmd done ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxSERR] @ 0x30: 0x ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxIS] @ 0x10: 0x0001 ahci_mem_write_host ahci(0x7fcc4e19b4a0) write4 [reg:IS] @ 0x8: 0x0001 ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxCI] @ 0x38: 0x0001 handle_cmd_fis_dump ahci(0x7fcc4e19b4a0)[0]: FIS: 0x00: 27 80 ec 00 00 00 00 a0 00 00 00 00 00 00 00 00 0x10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ahci_populate_sglist ahci(0x7fcc4e19b4a0)[0] ahci_dma_prepare_buf ahci(0x7fcc4e19b4a0)[0]: prepare buf limit=512 prepared=512 ahci_start_transfer ahci(0x7fcc4e19b4a0)[0]: reading 512 bytes on ata w/ sglist ahci_cmd_done ahci(0x7fcc4e19b4a0)[0]: cmd done ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxSERR] @ 0x30: 0x ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxIS] @ 0x10: 0x0001 ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxIS] @ 0x10: 0x0002 ahci_mem_write_host ahci(0x7fcc4e19b4a0) write4 [reg:IS] @ 0x8: 0x0001 --- -- Best regards, Andy Chiu On 2019/9/10 上午2:13, John Snow wrote: On 9/9/19 1:18 PM, andychiu via Qemu-devel wrote: If Windows 10 guests have enabled 'turn off hard disk after idle' option in power settings, and the guest has a SATA disk plugged in, the SATA disk will be turned off after a specified idle time. If the guest is live migrated or saved/loaded with its SATA disk turned off, the following error will occur: qemu-system-x86_64: AHCI: Failed to start FIS receive engine: bad FIS receive buffer address qemu-system-x86_64: Failed to load ich9_ahci:ahci qemu-system-x86_64: error while loading state for instance 0x0 of device ':00:1a.0/ich9_ahci' qemu-system-x86_64: load of migration failed: Operation not permitted Oof. That can't have been fun to discover. Observation from trace logs shows that a while after Windows 10 turns off a SATA disk (IDE disks don't have the following behavior), it will disable the PCI_COMMAND_MASTER flag of the pci device containing the ahci device. When the the disk is turning back on, the PCI_COMMAND_MASTER flag will be restored first. But if the guest is migrated or saved/loaded while the disk is off, the post_load callback of ahci device, ahci_state_post_load(), will fail at ahci_cond_start_engines() if the MemoryRegion pci_dev->bus_master_enable_region is not enabled, with pci_dev pointing to the PCIDevice struct containing the ahci device. This patch enables pci_dev->bus_master_enable_region before calling ahci_cond_start_engines() in ahci_state_post_load(), and restore the MemoryRegion to its original state afterwards.> This looks good to me from an AHCI perspective, but I'm not as clear on the implications of toggling the MemoryRegion, so I have some doubts. MST, can you chime in and clear my confusion? I suppose when the PCI_COMMAND_MASTER bit is turned off, we disable the memory region, as a guest would be unable to establish a new mapping in this time, so it makes sense that the attempt to map it fails. What's less clear to me is what happens to existing mappings when a region is disabled. Are they invalidated? If so, does it make sense that we are trying to establish a mapping here at all? Maybe it's absolutely correct that this fails. (I suppose, though, that the simple toggling of the region won't be a guest-visible event, so it's probably safe to do. Right?) What I find weird for AHCI is this: We try to engage the CLB mapping before the FIS mapping, but we fail at the FIS mapping. So why is PORT_CMD_FIS_RX set while PORT_CMD_START is unset? It