from:"Andy"

[Qemu-devel] metal mesh

2011-12-23 Thread andy

Dear Sir or Madam   



It is glad to write to you with keen hope to open a business relationship with 
you. I obtained your company name and email address from the Internet.


Boli hardware industrial limited is a factory specialized in wire and wire mesh 
production. Our products mainly include expanded wire metal mesh,window 
screen,welded wire mesh, square wire mesh, stainless steel wire mesh, hexagonal 
wire mesh fence, stainless steel wire, black annealed wire, galvanized wire, 
PVC-coated wire, barbed wire, hot galvanized patented wire, hot dipped 
galvanized chain etc!

For more information, we would like to let you know our company web site as 
below:http://www.boliwiremesh.com.



Hope to hear good news from you.
Sincerely Yours,
Andy


Sales manager

Boli hardware industrial limited

Tel: 86-0311-87789982

Fax: 86-0311-87769982

[Qemu-devel] [Bug 1490611] Re: Using qemu >=2.2.1 to convert raw->VHD (fixed) adds extra padding to the result file, which Microsoft Azure rejects as invalid

2018-08-31 Thread Andy

I'm using this version on xenial,
andy@bastion:~/temp$ qemu-img -h
qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 
2004-2008 Fabrice Bellard

qemu-img convert -f raw -O vpc -o subformat=fixed,force_size
/tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd

unfortunately the papapa2.vhd size is 25166336!=25165824 which means
it's not aligned in MiB.

could you please help?

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1490611

Title:
  Using qemu >=2.2.1 to convert raw->VHD (fixed) adds extra padding to
  the result file, which Microsoft Azure rejects as invalid

Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Xenial:
  Fix Released

Bug description:
  [Impact]

   * Starting with a raw disk image, using "qemu-img convert" to convert
  from raw to VHD results in the output VHD file's virtual size being
  aligned to the nearest 516096 bytes (16 heads x 63 sectors per head x
  512 bytes per sector), instead of preserving the input file's size as
  the output VHD's virtual disk size.

   * Microsoft Azure requires that disk images (VHDs) submitted for
  upload have virtual sizes aligned to a megabyte boundary. (Ex. 4096MB,
  4097MB, 4098MB, etc. are OK, 4096.5MB is rejected with an error.) This
  is reflected in Microsoft's documentation: https://azure.microsoft.com
  /en-us/documentation/articles/virtual-machines-linux-create-upload-
  vhd-generic/

   * The fix for this bug is a backport from upstream.
  http://git.qemu.org/?p=qemu.git;a=commitdiff;h=fb9245c2610932d33ce14

  [Test Case]

   * This is reproducible with the following set of commands (including
  the Azure command line tools from https://github.com/Azure/azure-
  xplat-cli). For the following example, I used qemu version 2.2.1:

  $ dd if=/dev/zero of=source-disk.img bs=1M count=4096

  $ stat source-disk.img
    File: ‘source-disk.img’
    Size: 4294967296  Blocks: 798656 IO Block: 4096   regular file
  Device: fc01h/64513dInode: 13247963Links: 1
  Access: (0644/-rw-r--r--)  Uid: ( 1000/  smkent)   Gid: ( 1000/  smkent)
  Access: 2015-08-18 09:48:02.613988480 -0700
  Modify: 2015-08-18 09:48:02.825985646 -0700
  Change: 2015-08-18 09:48:02.825985646 -0700
   Birth: -

  $ qemu-img convert -f raw -o subformat=fixed -O vpc source-disk.img
  dest-disk.vhd

  $ stat dest-disk.vhd
    File: ‘dest-disk.vhd’
    Size: 4296499712  Blocks: 535216 IO Block: 4096   regular file
  Device: fc01h/64513dInode: 13247964Links: 1
  Access: (0644/-rw-r--r--)  Uid: ( 1000/  smkent)   Gid: ( 1000/  smkent)
  Access: 2015-08-18 09:50:22.252077624 -0700
  Modify: 2015-08-18 09:49:24.424868868 -0700
  Change: 2015-08-18 09:49:24.424868868 -0700
   Birth: -

  $ azure vm image create testimage1 dest-disk.vhd -o linux -l "West US"
  info:Executing command vm image create
  + Retrieving storage accounts
  info:VHD size : 4097 MB
  info:Uploading 4195800.5 KB
  Requested:100.0% Completed:100.0% Running:   0 Time: 1m 0s Speed:  6744 KB/s
  info:https://[redacted].blob.core.windows.net/vm-images/dest-disk.vhd was 
uploaded successfully
  error:   The VHD 
https://[redacted].blob.core.windows.net/vm-images/dest-disk.vhd has an 
unsupported virtual size of 4296499200 bytes.  The size must be a whole number 
(in MBs).
  info:Error information has been recorded to /home/smkent/.azure/azure.err
  error:   vm image create command failed

   * A fixed qemu-img will not result in an error during azure image
  creation. It will require passing -o force_size, which will leverage
  the backported functionality.

  [Regression Potential]

   * The upstream fix introduces a qemu-img option (-o force_size) which
  is unset by default. The regression potential is very low, as a
  result.

  ...

  I also ran the above commands using qemu 2.4.0, which resulted in the
  same error as the conversion behavior is the same.

  However, qemu 2.1.1 and earlier (including qemu 2.0.0 installed by
  Ubuntu 14.04) does not pad the virtual disk size during conversion.
  Using qemu-img convert from qemu versions <=2.1.1 results in a VHD
  that is exactly the size of the raw input file plus 512 bytes (for the
  VHD footer). Those qemu versions do not attempt to realign the disk.
  As a result, Azure accepts VHD files created using those versions of
  qemu-img convert for upload.

  Is there a reason why newer qemu realigns the converted VHD file? It
  would be useful if an option were added to disable this feature, as
  current versions of qemu cannot be used to create VHD files for Azure
  using Microsoft's official instructions.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1490611/+subscriptions

[Qemu-devel] [Bug 1790268] [NEW] the vhd generated by qemu-img not align with MiB again.

2018-08-31 Thread Andy

Public bug reported:

I'm using this version on xenial,
andy@bastion:~/temp$ qemu-img -h
qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 
2004-2008 Fabrice Bellard

steps to repro:

dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu 
bs=1048576 count=24
mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L azure_cfg_dsk
sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
sudo -n chown andy 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
mkdir -p 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs
sudo -n umount 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
qemu-img convert -f raw -O vpc -o subformat=fixed,force_size 
/tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd

unfortunately the papapa2.vhd size is 25166336!=25165824 which means
it's not aligned in MiB.

could you please help?

** Affects: qemu
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1790268

Title:
  the vhd generated by qemu-img not align with MiB again.

Status in QEMU:
  New

Bug description:
  I'm using this version on xenial,
  andy@bastion:~/temp$ qemu-img -h
  qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 
2004-2008 Fabrice Bellard

  steps to repro:

  dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu 
bs=1048576 count=24
  mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L 
azure_cfg_dsk
  sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  sudo -n chown andy 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  mkdir -p 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs
  sudo -n umount 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  qemu-img convert -f raw -O vpc -o subformat=fixed,force_size 
/tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd

  unfortunately the papapa2.vhd size is 25166336!=25165824 which means
  it's not aligned in MiB.

  could you please help?

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1790268/+subscriptions

[Qemu-devel] [Bug 1790268] Re: the vhd generated by qemu-img not align with MiB again.

2018-08-31 Thread Andy

last bug report and fixed is
https://bugs.launchpad.net/qemu/+bug/1490611

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1790268

Title:
  the vhd generated by qemu-img not align with MiB again.

Status in QEMU:
  New

Bug description:
  I'm using this version on xenial,
  andy@bastion:~/temp$ qemu-img -h
  qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 
2004-2008 Fabrice Bellard

  steps to repro:

  dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu 
bs=1048576 count=24
  mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L 
azure_cfg_dsk
  sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  sudo -n chown andy 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  mkdir -p 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs
  sudo -n umount 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  qemu-img convert -f raw -O vpc -o subformat=fixed,force_size 
/tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd

  unfortunately the papapa2.vhd size is 25166336!=25165824 which means
  it's not aligned in MiB.

  could you please help?

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1790268/+subscriptions

[Qemu-devel] [Bug 1790268] Re: the vhd generated by qemu-img not align with MiB again.

2018-08-31 Thread Andy

and even the format is raw:
andy@bastion:~/temp$ qemu-img info papapa2.vhd 
image: papapa2.vhd
file format: raw
virtual size: 24M (25166336 bytes)
disk size: 152K

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1790268

Title:
  the vhd generated by qemu-img not align with MiB again.

Status in QEMU:
  New

Bug description:
  I'm using this version on xenial,
  andy@bastion:~/temp$ qemu-img -h
  qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 
2004-2008 Fabrice Bellard

  steps to repro:

  dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu 
bs=1048576 count=24
  mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L 
azure_cfg_dsk
  sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  sudo -n chown andy 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  mkdir -p 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs
  sudo -n umount 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  qemu-img convert -f raw -O vpc -o subformat=fixed,force_size 
/tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd

  unfortunately the papapa2.vhd size is 25166336!=25165824 which means
  it's not aligned in MiB.

  could you please help?

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1790268/+subscriptions

[Qemu-devel] [Bug 1790268] Re: the vhd generated by qemu-img not align with MB again.

2018-08-31 Thread Andy

** Summary changed:

- the vhd generated by qemu-img not align with MiB again.
+ the vhd generated by qemu-img not align with MB again.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1790268

Title:
  the vhd generated by qemu-img not align with MB again.

Status in QEMU:
  New

Bug description:
  I'm using this version on xenial,
  andy@bastion:~/temp$ qemu-img -h
  qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 
2004-2008 Fabrice Bellard

  steps to repro:

  dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu 
bs=1048576 count=24
  mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L 
azure_cfg_dsk
  sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  sudo -n chown andy 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  mkdir -p 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs
  sudo -n umount 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  qemu-img convert -f raw -O vpc -o subformat=fixed,force_size 
/tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd

  unfortunately the papapa2.vhd size is 25166336!=25165824 which means
  it's not aligned in MiB.

  could you please help?

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1790268/+subscriptions

[Qemu-devel] [Bug 1790268] Re: the vhd generated by qemu-img not align with MB again.

2018-09-01 Thread Andy

** Description changed:

  I'm using this version on xenial,
  andy@bastion:~/temp$ qemu-img -h
  qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 
2004-2008 Fabrice Bellard
  
  steps to repro:
  
  dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu 
bs=1048576 count=24
  mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L 
azure_cfg_dsk
  sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  sudo -n chown andy 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  mkdir -p 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs
  sudo -n umount 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  qemu-img convert -f raw -O vpc -o subformat=fixed,force_size 
/tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd
  
  unfortunately the papapa2.vhd size is 25166336!=25165824 which means
  it's not aligned in MiB.
  
+ and also seems it's not a vhd file:
+ andy@bastion:~/temp$ qemu-img info papapa2.vhd
+ image: papapa2.vhd
+ file format: raw
+ virtual size: 24M (25166336 bytes)
+ disk size: 152K
+ 
  could you please help?

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1790268

Title:
  the vhd generated by qemu-img not align with MB again.

Status in QEMU:
  New

Bug description:
  I'm using this version on xenial,
  andy@bastion:~/temp$ qemu-img -h
  qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 
2004-2008 Fabrice Bellard

  steps to repro:

  dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu 
bs=1048576 count=24
  mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L 
azure_cfg_dsk
  sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  sudo -n chown andy 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  mkdir -p 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs
  sudo -n umount 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  qemu-img convert -f raw -O vpc -o subformat=fixed,force_size 
/tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd

  unfortunately the papapa2.vhd size is 25166336!=25165824 which means
  it's not aligned in MiB.

  and also seems it's not a vhd file:
  andy@bastion:~/temp$ qemu-img info papapa2.vhd
  image: papapa2.vhd
  file format: raw
  virtual size: 24M (25166336 bytes)
  disk size: 152K

  could you please help?

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1790268/+subscriptions

[Qemu-devel] [Bug 1790268] Re: the vhd generated by qemu-img not align with MB again.

2018-09-01 Thread Andy

** Changed in: qemu
   Status: New => Invalid

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1790268

Title:
  the vhd generated by qemu-img not align with MB again.

Status in QEMU:
  Invalid

Bug description:
  I'm using this version on xenial,
  andy@bastion:~/temp$ qemu-img -h
  qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 
2004-2008 Fabrice Bellard

  steps to repro:

  dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu 
bs=1048576 count=24
  mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L 
azure_cfg_dsk
  sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  sudo -n chown andy 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  mkdir -p 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs
  sudo -n umount 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  qemu-img convert -f raw -O vpc -o subformat=fixed,force_size 
/tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd

  unfortunately the papapa2.vhd size is 25166336!=25165824 which means
  it's not aligned in MiB.

  and also seems it's not a vhd file:
  andy@bastion:~/temp$ qemu-img info papapa2.vhd
  image: papapa2.vhd
  file format: raw
  virtual size: 24M (25166336 bytes)
  disk size: 152K

  could you please help?

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1790268/+subscriptions

[Qemu-devel] [Bug 1790268] Re: the vhd generated by qemu-img not align with MB again.

2018-09-01 Thread Andy

** Changed in: qemu
 Assignee: (unassigned) => Andy (andyliuliming)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1790268

Title:
  the vhd generated by qemu-img not align with MB again.

Status in QEMU:
  Invalid

Bug description:
  I'm using this version on xenial,
  andy@bastion:~/temp$ qemu-img -h
  qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.31), Copyright (c) 
2004-2008 Fabrice Bellard

  steps to repro:

  dd if=/dev/zero of=/tmp/azure_config_disk_image20180901-22672-16zxelu 
bs=1048576 count=24
  mkfs.ext4 -F /tmp/azure_config_disk_image20180901-22672-16zxelu -L 
azure_cfg_dsk
  sudo -n mount -o loop /tmp/azure_config_disk_image20180901-22672-16zxelu 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  sudo -n chown andy 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  mkdir -p 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat/configs
  sudo -n umount 
/tmp/azure_config_disk_mount66c11d7a-5f2b-4ed5-b959-3b48dbc42a2a20180901-22672-1ejreat
  qemu-img convert -f raw -O vpc -o subformat=fixed,force_size 
/tmp/azure_config_disk_image20180901-22672-16zxelu papapa2.vhd

  unfortunately the papapa2.vhd size is 25166336!=25165824 which means
  it's not aligned in MiB.

  and also seems it's not a vhd file:
  andy@bastion:~/temp$ qemu-img info papapa2.vhd
  image: papapa2.vhd
  file format: raw
  virtual size: 24M (25166336 bytes)
  disk size: 152K

  could you please help?

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1790268/+subscriptions

[Qemu-devel] [Bug 1025244] Re: qcow2 image increasing disk size above the virtual limit

2012-12-18 Thread Andy Menzel

Any solution right now? I have a similar problem like Todor Andreev;
Our daily backup of some virtual machines (qcow2) looks like that:

1. shutdown the VM
2. create a snapshot via: "qemu-img snapshot -c nameofsnapshot..."
3. boot the VM
4. backup the snapshot to another virtual disk via: "qemu-img convert  -f qcow2 
-O qcow2 -s nameofsnapshot..."
5. DELETE the snapshot from VM via: qemu-img snapshot -d nameofsnapshot...

But the problem is, that our original VM-size growing steadily (although
few changes were made) ?!

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1025244

Title:
  qcow2 image increasing disk size above the virtual limit

Status in QEMU:
  New
Status in “qemu-kvm” package in Ubuntu:
  Triaged

Bug description:
  Using qemu/kvm, qcow2 images, ext4 file systems on both guest and host
   Host and Guest: Ubuntu server 12.04 64bit
  To create an image I did this:

  qemu-img create -f qcow2 -o preallocation=metadata ubuntu-pdc-vda.img 
10737418240 (not sure about the exact bytes, but around this)
  ls -l ubuntu-pdc-vda.img
  fallocate -l theSizeInBytesFromAbove ubuntu-pdc-vda.img

  The problem is that the image is growing progressively and has
  obviously no limit, although I gave it one. The root filesystem's
  image is the same case:

  qemu-img info ubuntu-pdc-vda.img
   image: ubuntu-pdc-vda.img
   file format: qcow2
   virtual size: 10G (10737418240 bytes)
   disk size: 14G
   cluster_size: 65536

  and for confirmation:
   du -sh ubuntu-pdc-vda.img
   15G ubuntu-pdc-vda.img

  I made a test and saw that when I delete something from the guest, the real 
size of the image is not decreasing (I read it is normal). OK, but when I write 
something again, it doesn't use the freed space, but instead grows the image. 
So for example:
   1. The initial physical size of the image is 1GB.
   2. I copy 1GB of data in the guest. It's physical size becomes 2GB.
   3. I delete this data (1GB). The physical size of the image remains 2GB.
   4. I copy another 1GB of data to the guest.
   5. The physical size of the image becomes 3GB.
   6. And so on with no limit. It doesn't care if the virtual size is less.

  Is this normal - the real/physical size of the image to be larger than
  the virtual limit???

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1025244/+subscriptions

[Qemu-devel] Which qemu version to use for CentOS6?

2013-02-25 Thread Andy Cress

Folks,

I have a project that needs to run on CentOS 6.3, but the qemu that comes with 
that is 0.12, and I'm sure that I should apply a newer version.
I had backported the qemu 1.0-17 from Fedora 17, but I see that the upstream 
qemu-1.1 is stable.   Are there kernel dependencies that would prevent running 
qemu-1.1 on the 2.6.32 CentOS 6 kernel?

What would you recommend?

Andy

Re: [Qemu-devel] Which qemu version to use for CentOS6?

2013-02-26 Thread Andy Cress

Stefan,

That is what I needed to know.  Thanks.
It was somewhat surprising to see how many patches have been backported to the 
CentOS6.3 source rpm.  

Andy

-Original Message-
From: Stefan Hajnoczi [mailto:stefa...@gmail.com] 
Sent: Monday, February 25, 2013 9:41 AM
To: Andy Cress
Cc: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] Which qemu version to use for CentOS6?

On Mon, Feb 25, 2013 at 02:24:01PM +, Andy Cress wrote:
> I have a project that needs to run on CentOS 6.3, but the qemu that comes 
> with that is 0.12, and I'm sure that I should apply a newer version.
> I had backported the qemu 1.0-17 from Fedora 17, but I see that the upstream 
> qemu-1.1 is stable.   Are there kernel dependencies that would prevent 
> running qemu-1.1 on the 2.6.32 CentOS 6 kernel?

The qemu-kvm in CentOS 6.3 is not a vanilla QEMU 0.12.  Look at the source rpm 
and you'll see a lot of features and fixes have been backported.

Is there a feature missing that forced you to build your own qemu-kvm?
If not, don't worry about the version number and use the distro package.

Stefan

Re: [Qemu-devel] [PATCH 10/17] mm: rmap preparation for remap_anon_pages

2014-10-07 Thread Andy Lutomirski

On Tue, Oct 7, 2014 at 8:52 AM, Andrea Arcangeli  wrote:
> On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote:
>> mremap like interface, or file+commands protocol interface. I tend to
>> like mremap more, that's why I opted for a remap_anon_pages syscall
>> kept orthogonal to the userfaultfd functionality (remap_anon_pages
>> could be also used standalone as an accelerated mremap in some
>> circumstances) but nothing prevents to just embed the same mechanism
>
> Sorry for the self followup, but something else comes to mind to
> elaborate this further.
>
> In term of interfaces, the most efficient I could think of to minimize
> the enter/exit kernel, would be to append the "source address" of the
> data received from the network transport, to the userfaultfd_write()
> command (by appending 8 bytes to the wakeup command). Said that,
> mixing the mechanism to be notified about userfaults with the
> mechanism to resolve an userfault to me looks a complication. I kind
> of liked to keep the userfaultfd protocol is very simple and doing
> just its thing. The userfaultfd doesn't need to know how the userfault
> was resolved, even mremap would work theoretically (until we run out
> of vmas). I thought it was simpler to keep it that way. However if we
> want to resolve the fault with a "write()" syscall this may be the
> most efficient way to do it, as we're already doing a write() into the
> pseudofd to wakeup the page fault that contains the destination
> address, I just need to append the source address to the wakeup command.
>
> I probably grossly overestimated the benefits of resolving the
> userfault with a zerocopy page move, sorry. So if we entirely drop the
> zerocopy behavior and the TLB flush of the old page like you
> suggested, the way to keep the userfaultfd mechanism decoupled from
> the userfault resolution mechanism would be to implement an
> atomic-copy syscall. That would work for SIGBUS userfaults too without
> requiring a pseudofd then. It would be enough then to call
> mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints
> that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic
> wouldn't page fault or call GUP into the destination address (it can't
> otherwise the in-flight partial copy would be visible to the process,
> breaking the atomicity of the copy), but it would fill in the
> pte/trans_huge_pmd with the same strict behavior that remap_anon_pages
> currently has (in turn it would by design bypass the VM_USERFAULT
> check and be ideal for resolving userfaults).

At the risk of asking a possibly useless question, would it make sense
to splice data into a userfaultfd?

--Andy

>
> mcopy_atomic could then be also extended to tmpfs and it would work
> without requiring the source page to be a tmpfs page too without
> having to convert page types on the fly.
>
> If I add mcopy_atomic, the patch in subject (10/17) can be dropped of
> course so it'd be even less intrusive than the current
> remap_anon_pages and it would require zero TLB flush during its
> runtime (it would just require an atomic copy).
>
> So should I try to embed a mcopy_atomic inside userfault_write or can
> I expose it to userland as a standalone new syscall? Or should I do
> something different? Comments?
>
> Thanks,
> Andrea



-- 
Andy Lutomirski
AMA Capital Management, LLC

Re: [Qemu-devel] [PATCH RFC 00/11] qemu: towards virtio-1 host support

2014-10-07 Thread Andy Lutomirski

On 10/07/2014 07:39 AM, Cornelia Huck wrote:
> This patchset aims to get us some way to implement virtio-1 compliant
> and transitional devices in qemu. Branch available at
> 
> git://github.com/cohuck/qemu virtio-1
> 
> I've mainly focused on:
> - endianness handling
> - extended feature bits
> - virtio-ccw new/changed commands

At the risk of some distraction, would it be worth thinking about a
solution to the IOMMU bypassing mess as part of this?

--Andy

Re: [Qemu-devel] tcmu-runner and QEMU

2014-08-29 Thread Andy Grover


On 08/29/2014 10:22 AM, Benoît Canet wrote:

The truth is that QEMU block drivers don't know how to do much on their own
so we probably must bring the whole QEMU  block layer in a tcmu-runner handler 
plugin.


Woah! Really? ok...


Another reason to do this is that the QEMU block layer brings features like 
taking
snapshots or streaming snaphots that a cloud provider would want to keep while 
exporting
QCOW2 as ISCSI or FCOE.

Doing these operations is usually done by passing something like
"--qmp tcp:localhost,,server,nowait" as a QEMU command line argument then
connecting on this JSON processing socket then send orders to QEMU.


The LIO TCMU backend and tcmu-runner provide for a configstring that is 
associated with a given backstore. This is made available to the 
handler, and sounds like just what qmp needs.



I made some patches to split this QMP machinery from the QEMU binary but still
I don't know how a tcmu-runner plugin handler would be able to receive this 
command
line configuration.


The flow would be:
1) admin configures a LIO backstore of type "user", size 10G, and gives 
it a configstring like "qmp/tcp:localhost,,server,nowait"
2) admin exports the backstore via whatever LIO-supported fabric(s) 
(e.g. iSCSI)
3) tcmu-runner is notified of the new user backstore from step 1, finds 
the handler associated with "qmp", calls 
handler->open("tcp:localhost,,server,nowait")

4) qmp handler parses string and does whatever it needs to do
5) handler receives SCSI commands as they arrive


The second problem is that the QEMU block layer is big and filled with scary 
stuff like
threads and coroutines but I think only trying to write the tcmu-runner handler 
will
tell if it's doable.


Yeah, could be tricky but would be pretty cool if it works. Let me know 
how I can help, or with any questions.


Regards -- Andy

Re: [Qemu-devel] tcmu-runner and QEMU

2014-08-29 Thread Andy Grover


On 08/29/2014 11:51 AM, Benoît Canet wrote:

QMP is just a way to control QEMU via a socket: it is not particularly block 
related.

On the other hand bringing the whole block layers into a tcmu-runner handler
would mean that there would be _one_ QMP socket opened
(by mean of wonderfull QEMU modules static variables :) to control multiple 
block devices
exported.

So I think the configuration passed must be done before an individual open 
occurs:
being global to the .so implementing the tcmu-runner handler.

But I don't see how to do it with the current API.


This discussion leads me to think we need to step back and discuss our 
requirements. I am looking for flexible backstores for SCSI-based 
fabrics, with as little new code as possible. I think you are looking 
for a way to export QEMU block devices over iSCSI and other fabrics?


I don't think making a LIO userspace handler into basically a 
full-fledged secondary QEMU server instance is the way to go. What I 
think better serves your requirements is to enable QEMU to configure LIO.


In a previous email you wrote:

Another reason to do this is that the QEMU block layer brings
features like taking snapshots or streaming snaphots that a cloud
provider would want to keep while exporting QCOW2 as ISCSI or FCOE.


Whether a volume is exported over iSCSI or FCoE or not shouldn't affect 
how it is managed. QMP commands should go to the single QEMU server, 
which can then optionally configure LIO to export the volume. That 
leaves us with the issue that we'd need to arbitrate access to the 
backing file if taking a streaming snapshot (qemu and tcmu-runner 
processes both accessing the img), but that should be straightforward, 
or at least work that can be done in a second phase of development.


Thoughts?

Regards -- Andy

p.s. offline Monday.

Re: [Qemu-devel] tcmu-runner and QEMU

2014-08-31 Thread Andy Grover


On 08/30/2014 09:02 AM, Richard W.M. Jones wrote:

On Sat, Aug 30, 2014 at 05:53:43PM +0200, Benoît Canet wrote:

If the cloud provider want to be able to boot QCOW2 or QED images on
bare metal machines he will need to export QCOW2 or QED images on
the network.

So far only qemu-nbd allows to do this and it is neither well
performing nor really convenient to boot on a bare metal machine.


So I think what you want is a `qemu-iscsi'?  ie. the same as qemu-nbd,
but with an iSCSI frontend (to replace the NBD server).


You want qemu to be able to issue SCSI commands over iSCSI? I thought 
qemu used libiscsi for this, to be the initiator. What Benoit and I have 
been discussing is the other side, enabling qemu to configure LIO to 
handle requests from other initiators (either VMs or iron) over iSCSI or 
FCoE, but backed by qcow2 disk images. The problem being LIO doesn't 
speak qcow2 yet.



I guess so.  Are you planning to integrate bits of LIO into qemu, or
bits of qemu into LIO?


My current thinking is 1) enable qemu to configure the LIO kernel target 
(it's all straightforward via configfs, but add a nice library to qemu 
to hide the details) and 2) enable LIO to use qcow2 and other formats 
besides raw images to back exported LUNs. This is where the LIO 
userspace passthrough and tcmu-runner come in, because we want to do 
this in userspace, not as kernel code, so we have to pass SCSI commands 
up to a userspace helper daemon.



The latter has been tried various times, without much success.  See
the many examples of people trying to make the qemu block driver code
into a separate library, and failing.


What's been the sticking point?

Regards -- Andy

Re: [Qemu-devel] tcmu-runner and QEMU

2014-09-02 Thread Andy Grover

On 09/02/2014 02:25 AM, Stefan Hajnoczi wrote:

The easiest approach is to write a tool similar to qemu-nbd that speaks
the userspace target protocol (i.e. mmap the shared memory).

If the tcmu setup code is involved, maybe providing a libtcmu with the
setup code would be useful.  I suspect that other projects may want to
integrate userspace target support too.  It's easier to let people add
it to their codebase rather than hope they bring their codebase into
tcmu-runner.

What other projects were you thinking of?

From my perspective, QEMU is singular. QEMU's block support seems to 
cover just about everything, even ceph, gluster, and sheepdog!

We certainly don't want to duplicate that code so a qemu-lio-tcmu in 
qemu.git like qemu-nbd, basically statically linking the BlockDriver 
object files, sounds like the first thing to try.

We can make tcmu-runner a library (libtcmu) if it makes sense, but let's 
do some work to try the current way and see how it goes before 
"flipping" it.

> The qemu-lio tool would live in the QEMU codebase and reuse all the
> infrastructure.  For example, it could include a QMP monitor just like
> the one you are adding to qemu-nbd.

Benoit and I talked a little about QMP on another part of the thread... 
I said I didn't think we needed a QMP monitor in qemu-lio-tcmu, but let 
me spin up on qemu a little more and I'll be able to speak more 
intelligently.

-- Andy

Re: [Qemu-devel] tcmu-runner and QEMU

2014-09-04 Thread Andy Grover


On 09/04/2014 06:24 AM, Benoît Canet wrote:

There are other commands for snapshots and backup which are issued via
QMP.

It might even make sense to make the tcmu interface available at
run-time in QEMU like the run-time NBD server.  This allows you to get
at read-only point-in-time snapshots while the guest is accessing the
disk.  See the nbd-server-start command in qapi/block.json.

Stefan


Andy: ping

I hope we didn't scaried you with our monster block backend and it's
associated QMP socket ;)


Hi Benoît,

No, I've gone off to work on a initial proof-of-concept implementation 
of a qemu-lio-tcmu.so module, hopefully it'll be ready to look at 
shortly and then we can shoot arrows at it. :)


But in the meantime, do you have a use case or user story for the QMP 
support that might help me understand better how it might all fit together?


Regards -- Andy

[Qemu-devel] [PATCH qemu] i386, linux-headers: Add support for kvm_get_rng_seed

2014-07-16 Thread Andy Lutomirski

This updates x86's kvm_para.h for the feature bit definition and
target-i386/cpu.c for the feature name and default.

Signed-off-by: Andy Lutomirski 
---
 linux-headers/asm-x86/kvm_para.h | 2 ++
 target-i386/cpu.c| 5 +++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/linux-headers/asm-x86/kvm_para.h b/linux-headers/asm-x86/kvm_para.h
index e41c5c1..a9b27ce 100644
--- a/linux-headers/asm-x86/kvm_para.h
+++ b/linux-headers/asm-x86/kvm_para.h
@@ -24,6 +24,7 @@
 #define KVM_FEATURE_STEAL_TIME 5
 #define KVM_FEATURE_PV_EOI 6
 #define KVM_FEATURE_PV_UNHALT  7
+#define KVM_FEATURE_GET_RNG_SEED   8
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -40,6 +41,7 @@
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN  0x4b564d04
+#define MSR_KVM_GET_RNG_SEED 0x4b564d05
 
 struct kvm_steal_time {
__u64 steal;
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 8fd1497..4ea7e6c 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -236,7 +236,7 @@ static const char *ext4_feature_name[] = {
 static const char *kvm_feature_name[] = {
 "kvmclock", "kvm_nopiodelay", "kvm_mmu", "kvmclock",
 "kvm_asyncpf", "kvm_steal_time", "kvm_pv_eoi", "kvm_pv_unhalt",
-NULL, NULL, NULL, NULL,
+"kvm_get_rng_seed", NULL, NULL, NULL,
 NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL,
@@ -368,7 +368,8 @@ static uint32_t kvm_default_features[FEATURE_WORDS] = {
 (1 << KVM_FEATURE_ASYNC_PF) |
 (1 << KVM_FEATURE_STEAL_TIME) |
 (1 << KVM_FEATURE_PV_EOI) |
-(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT),
+(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+(1 << KVM_FEATURE_GET_RNG_SEED),
 [FEAT_1_ECX] = CPUID_EXT_X2APIC,
 };
 
-- 
1.9.3

[Qemu-devel] [Bug 784977] [NEW] qemu-img convert fails to convert, generates a 512byte file output

2011-05-18 Thread Andy Brook

Public bug reported:

I have a Vmware image, so I have files like 'Ubuntu.vmdk', want to
convert to VirtualBox .vdi format using qemu, the first stage of
extracting the image with 'qemu-img convert Ubuntu.vmdk output.bin' just
generates a 512byte file:

{quote}
# Disk DescriptorFile
version=1
CID=36be9761
parentCID=
createType="twoGbMaxExtentSparse"

# Extent description
RW 4192256 SPARSE "Ubuntu-s001.vmdk"
RW 4192256 SPARSE "Ubuntu-s002.vmdk"
RW 4192256 SPARSE "Ubuntu-s003.vmdk"
RW 4192256 SPARSE "Ubuntu-s004.vmdk"
RW 4192256 SPARSE "Ubuntu-s005.vmdk"
RW 4192256 SPARSE "Ubuntu-s006.vmdk"
RW 4192256 SPARSE "Ubuntu-s007.vmdk"
RW 4192256 SPARSE "Ubuntu-s008.vmdk"
RW 4192256 SPARSE "Ubuntu-s009.vmdk"
RW 4192256 SPARSE "Ubuntu-s010.vmdk"
RW 20480 SPARSE "Ubunt
{quote}

Here is the input Ubuntu.vmdk file:
{quote}
# Disk DescriptorFile
version=1
CID=36be9761
parentCID=
createType="twoGbMaxExtentSparse"

# Extent description
RW 4192256 SPARSE "Ubuntu-s001.vmdk"
RW 4192256 SPARSE "Ubuntu-s002.vmdk"
RW 4192256 SPARSE "Ubuntu-s003.vmdk"
RW 4192256 SPARSE "Ubuntu-s004.vmdk"
RW 4192256 SPARSE "Ubuntu-s005.vmdk"
RW 4192256 SPARSE "Ubuntu-s006.vmdk"
RW 4192256 SPARSE "Ubuntu-s007.vmdk"
RW 4192256 SPARSE "Ubuntu-s008.vmdk"
RW 4192256 SPARSE "Ubuntu-s009.vmdk"
RW 4192256 SPARSE "Ubuntu-s010.vmdk"
RW 20480 SPARSE "Ubuntu-s011.vmdk"

# The Disk Data Base 
#DDB

ddb.toolsVersion = "7240"
ddb.adapterType = "lsilogic"
ddb.geometry.sectors = "63"
ddb.geometry.heads = "255"
ddb.geometry.cylinders = "2610"
ddb.virtualHWVersion = "6"
{quote}

No stack trace or other output was found.  Anything I can add (other
than the 20G VM image to reproduce and I'll be happy to provide)

** Affects: qemu
 Importance: Undecided
 Status: New


** Tags: convert quemu-img vmdk

** Description changed:

  I have a Vmware image, so I have files like 'Ubuntu.vmdk', want to
  convert to VirtualBox .vdi format using qemu, the first stage of
  extracting the image with 'qemu-img convert Ubuntu.vmdk output.bin' just
  generates a 512byte file:
  
  {quote}
  # Disk DescriptorFile
  version=1
  CID=36be9761
  parentCID=
  createType="twoGbMaxExtentSparse"
  
  # Extent description
  RW 4192256 SPARSE "Ubuntu-s001.vmdk"
  RW 4192256 SPARSE "Ubuntu-s002.vmdk"
  RW 4192256 SPARSE "Ubuntu-s003.vmdk"
  RW 4192256 SPARSE "Ubuntu-s004.vmdk"
  RW 4192256 SPARSE "Ubuntu-s005.vmdk"
  RW 4192256 SPARSE "Ubuntu-s006.vmdk"
  RW 4192256 SPARSE "Ubuntu-s007.vmdk"
  RW 4192256 SPARSE "Ubuntu-s008.vmdk"
  RW 4192256 SPARSE "Ubuntu-s009.vmdk"
  RW 4192256 SPARSE "Ubuntu-s010.vmdk"
  RW 20480 SPARSE "Ubunt
  {quote}
  
+ Here is the input Ubuntu.vmdk file:
+ {quote}
+ # Disk DescriptorFile
+ version=1
+ CID=36be9761
+ parentCID=
+ createType="twoGbMaxExtentSparse"
+ 
+ # Extent description
+ RW 4192256 SPARSE "Ubuntu-s001.vmdk"
+ RW 4192256 SPARSE "Ubuntu-s002.vmdk"
+ RW 4192256 SPARSE "Ubuntu-s003.vmdk"
+ RW 4192256 SPARSE "Ubuntu-s004.vmdk"
+ RW 4192256 SPARSE "Ubuntu-s005.vmdk"
+ RW 4192256 SPARSE "Ubuntu-s006.vmdk"
+ RW 4192256 SPARSE "Ubuntu-s007.vmdk"
+ RW 4192256 SPARSE "Ubuntu-s008.vmdk"
+ RW 4192256 SPARSE "Ubuntu-s009.vmdk"
+ RW 4192256 SPARSE "Ubuntu-s010.vmdk"
+ RW 20480 SPARSE "Ubuntu-s011.vmdk"
+ 
+ # The Disk Data Base 
+ #DDB
+ 
+ ddb.toolsVersion = "7240"
+ ddb.adapterType = "lsilogic"
+ ddb.geometry.sectors = "63"
+ ddb.geometry.heads = "255"
+ ddb.geometry.cylinders = "2610"
+ ddb.virtualHWVersion = "6"
+ {quote}
+ 
  No stack trace or other output was found.  Anything I can add (other
  than the 20G VM image to reproduce and I'll be happy to provide)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/784977

Title:
  qemu-img convert fails to convert, generates a 512byte file output

Status in QEMU:
  New

Bug description:
  I have a Vmware image, so I have files like 'Ubuntu.vmdk', want to
  convert to VirtualBox .vdi format using qemu, the first stage of
  extracting the image with 'qemu-img convert Ubuntu.vmdk output.bin'
  just generates a 512byte file:

  {quote}
  # Disk DescriptorFile
  version=1
  CID=36be9761
  parentCID=
  createType="twoGbMaxExtentSparse"

  # Extent description
  RW 4192256 SPARSE "Ubuntu-s001.vmdk"
  RW 4192256 SPARSE "Ubuntu-s002.vmdk"
  RW 4192256 SPARSE "Ubuntu-s003.vmdk"
  RW 4192256 SPARSE "Ubuntu-s004.vmdk"
  RW 4192256 SPARSE "Ubuntu-s005.vmdk"
  RW 4192256 SPARSE "Ubuntu-s006.vmdk"
  RW 4192256 SPARSE "Ubuntu-s007.vmdk"
  RW 4192256 SPARSE "Ubuntu-s008.vmdk"
  RW 4192256 SPARSE "Ubuntu-s009.vmdk"
  RW 4192256 SPARSE "Ubuntu-s010.vmdk"
  RW 20480 SPARSE "Ubunt
  {quote}

  Here is the input Ubuntu.vmdk file:
  {quote}
  # Disk DescriptorFile
  version=1
  CID=36be9761
  parentCID=
  createType="twoGbMaxExtentSparse"

  # Extent description
  RW 4192256 SPARSE "Ubuntu-s001.vmdk"
  RW 4192256 SPARSE "Ubuntu-s002.vmdk"
  RW 4192256 SPARSE "Ubuntu-s003.vmdk"
  RW 4192256

[Qemu-devel] Framebuffer corruption in QEMU or Linux's cirrus driver

2014-04-02 Thread Andy Lutomirski

Running:

./virtme-run --installed-kernel

from this virtme commit:

https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git/commit/?id=2b409a086d15b7a878c7d5204b1f44a6564a341f

results in a bunch of missing lines of text once bootup finishes.
Pressing enter a few times gradually fixes it.

I don't know whether this is a qemu bug or a Linux bug.

I'm seeing this on Fedora's 3.13.7 kernel and on a fairly recent
3.14-rc kernel.  For the latter, cirrus is built-in (not a module),
I'm running:

virtme-run --kimg arch/x86/boot/bzImage

and I see more profound corruption.

--Andy

Re: [Qemu-devel] Framebuffer corruption in QEMU or Linux's cirrus driver

2014-04-02 Thread Andy Lutomirski

On Tue, Apr 1, 2014 at 3:09 PM, Andy Lutomirski  wrote:
> Running:
>
> ./virtme-run --installed-kernel
>
> from this virtme commit:
>
> https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git/commit/?id=2b409a086d15b7a878c7d5204b1f44a6564a341f
>
> results in a bunch of missing lines of text once bootup finishes.
> Pressing enter a few times gradually fixes it.
>
> I don't know whether this is a qemu bug or a Linux bug.
>
> I'm seeing this on Fedora's 3.13.7 kernel and on a fairly recent
> 3.14-rc kernel.  For the latter, cirrus is built-in (not a module),
> I'm running:
>
> virtme-run --kimg arch/x86/boot/bzImage
>
> and I see more profound corruption.

I'm guessing this is a cirrus drm bug.  bochs-drm (using virtme-run
--installed-kernel --qemu-opts -vga std) does not appear to have the
same issue.  Neither does qxl.

(qxl is painfully slow, though, and it doesn't seem to be using UC memory.)

--Andy

[Qemu-devel] Turning off default storage devices?

2014-04-09 Thread Andy Lutomirski

Currently, -M q35 boots linux quite a bit slower than the default
machine type.  This seems to be because it takes a few hundred ms to
determine that there's nothing attached to the AHCI controller.

In virtio setups, there will probably never be anything attached to
the AHCI controller.  Would it be possible to add something like
-machine default_storage=off to turn off default storage devices?
This could include the AHCI on q35 and the cdrom and such on pc.

There's precedent: -machine usb=off turns off the default USB
controllers, which is great for setups that use xhci.

Thanks,
Andy

Re: [Qemu-devel] Turning off default storage devices?

2014-04-09 Thread Andy Lutomirski

On Wed, Apr 9, 2014 at 4:53 PM, Peter Crosthwaite
 wrote:
> Hi Andy,
>
> On Thu, Apr 10, 2014 at 5:55 AM, Andy Lutomirski  wrote:
>> Currently, -M q35 boots linux quite a bit slower than the default
>> machine type.  This seems to be because it takes a few hundred ms to
>> determine that there's nothing attached to the AHCI controller.
>>
>> In virtio setups, there will probably never be anything attached to
>> the AHCI controller.  Would it be possible to add something like
>> -machine default_storage=off to turn off default storage devices?
>> This could include the AHCI on q35 and the cdrom and such on pc.
>>
>> There's precedent: -machine usb=off turns off the default USB
>> controllers, which is great for setups that use xhci.
>>
>
> Is there a more generic solution to your problem? Can you implement
> command line device removal in a non specific way and avoid having to
> invent AHCI or even "storage" specific arguments. You could
> considering bringing the xhci use case you mentioned under the same
> umbrella.

An option like -suppress-default-device foobar to turn off the device
named foobar would work, but what happens if that device is a bus?
Will this just cause QEMU to crash?  Maybe the machine code would have
to opt in to allowing this kind of suppression, and there could be a
general error of you try to suppress a device that can't be
suppressed.

I can try to code this up, but I know nothing about QEMU internals.
I'm just a user :)

--Andy

Re: [Qemu-devel] Turning off default storage devices?

2014-04-13 Thread Andy Lutomirski

On Wed, Apr 9, 2014 at 8:13 PM, Peter Crosthwaite
 wrote:
> On Thu, Apr 10, 2014 at 9:57 AM, Andy Lutomirski  wrote:
>> On Wed, Apr 9, 2014 at 4:53 PM, Peter Crosthwaite
>>  wrote:
>>> Hi Andy,
>>>
>>> On Thu, Apr 10, 2014 at 5:55 AM, Andy Lutomirski  
>>> wrote:
>>>> Currently, -M q35 boots linux quite a bit slower than the default
>>>> machine type.  This seems to be because it takes a few hundred ms to
>>>> determine that there's nothing attached to the AHCI controller.
>>>>
>>>> In virtio setups, there will probably never be anything attached to
>>>> the AHCI controller.  Would it be possible to add something like
>>>> -machine default_storage=off to turn off default storage devices?
>>>> This could include the AHCI on q35 and the cdrom and such on pc.
>>>>
>>>> There's precedent: -machine usb=off turns off the default USB
>>>> controllers, which is great for setups that use xhci.
>>>>
>>>
>>> Is there a more generic solution to your problem? Can you implement
>>> command line device removal in a non specific way and avoid having to
>>> invent AHCI or even "storage" specific arguments. You could
>>> considering bringing the xhci use case you mentioned under the same
>>> umbrella.
>>
>> An option like -suppress-default-device foobar to turn off the device
>> named foobar would work, but what happens if that device is a bus?
>
> Lets call that a misuse in the first instance. But in general, when
> attaching devices QEMU should be able to gracefully fail on unresolved
> deps. So it would be reasonable to work on that assumption given that
> every device should be able to handle a missing bus/gpio/interrupt
> etc. due to -device misuseability.
>
>> Will this just cause QEMU to crash?  Maybe the machine code would have
>> to opt in to allowing this kind of suppression, and there could be a
>> general error of you try to suppress a device that can't be
>> suppressed.
>>
>
> I would argue that there is no such thing. You may end up with a
> useless machine but its still valid to supress something and then by
> extension all its dependants are non functional.

The q35 code is:

/* ahci and SATA device, for q35 1 ahci controller is built-in */
ahci = pci_create_simple_multifunction(host_bus,
   PCI_DEVFN(ICH9_SATA1_DEV,
 ICH9_SATA1_FUNC),
   true, "ich9-ahci");
idebus[0] = qdev_get_child_bus(&ahci->qdev, "ide.0");
idebus[1] = qdev_get_child_bus(&ahci->qdev, "ide.1");

It looks like making pci_create_simple_multifunction return null will
crash quite quickly.  Even fixing the next two lines will just cause
null pointer dereferences later on.

Is there a different way to indicate that a device wasn't actually created?

--Andy

Re: [Qemu-devel] Turning off default storage devices?

2014-04-16 Thread Andy Lutomirski

On Mon, Apr 14, 2014 at 1:15 AM, Markus Armbruster  wrote:
> Peter Crosthwaite  writes:
>
>> Hi Andy,
>>
>> On Thu, Apr 10, 2014 at 5:55 AM, Andy Lutomirski  wrote:
>>> Currently, -M q35 boots linux quite a bit slower than the default
>>> machine type.  This seems to be because it takes a few hundred ms to
>>> determine that there's nothing attached to the AHCI controller.
>>>
>>> In virtio setups, there will probably never be anything attached to
>>> the AHCI controller.  Would it be possible to add something like
>>> -machine default_storage=off to turn off default storage devices?
>>> This could include the AHCI on q35 and the cdrom and such on pc.
>>>
>>> There's precedent: -machine usb=off turns off the default USB
>>> controllers, which is great for setups that use xhci.
>>>
>>
>> Is there a more generic solution to your problem? Can you implement
>> command line device removal in a non specific way and avoid having to
>> invent AHCI or even "storage" specific arguments. You could
>> considering bringing the xhci use case you mentioned under the same
>> umbrella.
>
> USB has always been off by default, at least for the boards I'm familiar
> with, due to the USB emulation's non-trivial CPU use.
>
> There's no such thing as a Q35 board without USB in the physical world.
> Can't stop us from making a virtual one, of course.
>
> Likewise, there's no such thing as a Q35 board without AHCI in the
> physical world, and again that can't stop us from making a virtual one.
>
> The difference to USB is that our q35 machines have always had AHCI even
> with -nodefaults.  You seem to propose adding a switch to disable AHCI,
> yet leave it enabled with -nodefaults.
>
> -nodefaults should give you a board with all the optional components
> suppressed.

Will this break libvirt, which may expect -nodefaults to still come
with an IDE bus?

>
> On the one hand, I'd rather not add exceptions to -nodefaults "give me
> the board with all its optional components suppressed" semantics.
>
> On the other hand, a few hundred ms are a long time.

That's why I proposed a new option.  Yes, it's ugly :/

--Andy

[Qemu-devel] [Bug 1349277] Re: AArch64 emulation ignores SPSel=0 when taking (or returning from) an exception at EL1 or greater

2014-12-04 Thread Andy Whitcroft

** Also affects: qemu (Ubuntu)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1349277

Title:
  AArch64 emulation ignores SPSel=0 when taking (or returning from) an
  exception at EL1 or greater

Status in QEMU:
  New
Status in qemu package in Ubuntu:
  New

Bug description:
  The AArch64 emulation ignores SPSel=0 when:

  (1) taking an interrupt from an exception level greater than EL0
  (e.g., EL1t),

  (2) returning from an exception (via ERET) to an exception level
  greater than EL0 (e.g., EL1t), with SPSR_ELx[SPSel]=0.

  The attached patch fixes the problem in my application.

  Background:

  I'm running a standalone application (toy OS) that is performing
  preemptive multithreading between threads running at EL1t, with
  exception handling / context switching occurring at EL1h.  This bug
  causes the stack pointer to be corrupted in the threads running at
  EL1t (they end up with a version of the EL1h stack pointer (SP_EL1)).

  Occurs in:
qemu-2.1.0-rc1 (found in)
commit c60a57ff497667780132a3fcdc1500c83af5d5c0 (current master)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1349277/+subscriptions

Re: [Qemu-devel] [PATCH v3 1/4] firmware: introduce sysfs driver for QEMU's fw_cfg device

2015-10-06 Thread Andy Lutomirski

On Sat, Oct 3, 2015 at 4:28 PM, Gabriel L. Somlo  wrote:
> From: Gabriel Somlo 
>
> Make fw_cfg entries of type "file" available via sysfs. Entries
> are listed under /sys/firmware/qemu_fw_cfg/by_key, in folders
> named after each entry's selector key. Filename, selector value,
> and size read-only attributes are included for each entry. Also,
> a "raw" attribute allows retrieval of the full binary content of
> each entry.
>
> This patch also provides a documentation file outlining the
> guest-side "hardware" interface exposed by the QEMU fw_cfg device.
>

What's the status of "by_name"?  There's a single (presumably
incorrect) mention of it in a comment in this patch.

I would prefer if the kernel populated by_name itself rather than
deferring that to udev, since I'd like to use this facility in virtme,
and I'd like to use fw_cfg very early on boot before I even start
udev.

--Andy

[Qemu-devel] [Bug 1025244] Re: qcow2 image increasing disk size above the virtual limit

2013-01-11 Thread Andy Menzel

Thanks for your advices. I have no more problems with VM-size since
deleting snapshot in shutdown-mode. I reduced the overlarge qcow2-images
by converting in qcow2 again (that detects unused sectors and omits
this).

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1025244

Title:
  qcow2 image increasing disk size above the virtual limit

Status in QEMU:
  New
Status in “qemu-kvm” package in Ubuntu:
  Triaged

Bug description:
  Using qemu/kvm, qcow2 images, ext4 file systems on both guest and host
   Host and Guest: Ubuntu server 12.04 64bit
  To create an image I did this:

  qemu-img create -f qcow2 -o preallocation=metadata ubuntu-pdc-vda.img 
10737418240 (not sure about the exact bytes, but around this)
  ls -l ubuntu-pdc-vda.img
  fallocate -l theSizeInBytesFromAbove ubuntu-pdc-vda.img

  The problem is that the image is growing progressively and has
  obviously no limit, although I gave it one. The root filesystem's
  image is the same case:

  qemu-img info ubuntu-pdc-vda.img
   image: ubuntu-pdc-vda.img
   file format: qcow2
   virtual size: 10G (10737418240 bytes)
   disk size: 14G
   cluster_size: 65536

  and for confirmation:
   du -sh ubuntu-pdc-vda.img
   15G ubuntu-pdc-vda.img

  I made a test and saw that when I delete something from the guest, the real 
size of the image is not decreasing (I read it is normal). OK, but when I write 
something again, it doesn't use the freed space, but instead grows the image. 
So for example:
   1. The initial physical size of the image is 1GB.
   2. I copy 1GB of data in the guest. It's physical size becomes 2GB.
   3. I delete this data (1GB). The physical size of the image remains 2GB.
   4. I copy another 1GB of data to the guest.
   5. The physical size of the image becomes 3GB.
   6. And so on with no limit. It doesn't care if the virtual size is less.

  Is this normal - the real/physical size of the image to be larger than
  the virtual limit???

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1025244/+subscriptions

Re: [Qemu-devel] [PATCH 2/2] e1000: CTRL.RST emulation

2011-09-27 Thread Andy Gospodarek

On Tue, Sep 27, 2011 at 8:58 AM, Michael S. Tsirkin  wrote:
> e1000 spec says CTRL.RST write should have the same effect
> as bus reset, except that is preserves PCI Config.
> Reset device registers and interrupts.
>
> Fix suggested by Andy Gospodarek 
> Similar fix proposed by Anthony PERARD 
>
> Signed-off-by: Michael S. Tsirkin 
> ---
>  hw/e1000.c |    9 +++--
>  1 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/hw/e1000.c b/hw/e1000.c
> index 87a1104..b51e089 100644
> --- a/hw/e1000.c
> +++ b/hw/e1000.c
> @@ -241,8 +241,13 @@ static void e1000_reset(void *opaque)
>  static void
>  set_ctrl(E1000State *s, int index, uint32_t val)
>  {
> -    /* RST is self clearing */
> -    s->mac_reg[CTRL] = val & ~E1000_CTRL_RST;
> +    if (val & E1000_CTRL_RST) {
> +        e1000_reset(s);
> +        qemu_set_irq(s->dev.irq[0], 0);
> +        return;
> +    }
> +
> +    s->mac_reg[CTRL] = val;
>  }
>
>  static void
> --
> 1.7.5.53.gc233e
>

Looks good to me.  Thanks for following up with this, Michael.

Re: [Qemu-devel] [PATCH 00/10] RFC: userfault

2014-07-02 Thread Andy Lutomirski

On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> Hello everyone,
> 
> There's a large CC list for this RFC because this adds two new
> syscalls (userfaultfd and remap_anon_pages) and
> MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
> or on a completely different API if somebody has better ideas are
> welcome now.

cc:linux-api -- this is certainly worthy of linux-api discussion.

> 
> The combination of these features are what I would propose to
> implement postcopy live migration in qemu, and in general demand
> paging of remote memory, hosted in different cloud nodes.
> 
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.
> 
> If the access could ever happen in kernel context through syscalls
> (not not just from userland context), then userfaultfd has to be used
> to make the userfault unnoticeable to the syscall (no error will be
> returned). This latter feature is more advanced than what volatile
> ranges alone could do with SIGBUS so far (but it's optional, if the
> process doesn't call userfaultfd, the regular SIGBUS will fire, if the
> fd is closed SIGBUS will also fire for any blocked userfault that was
> waiting a userfaultfd_write ack).
> 
> userfaultfd is also a generic enough feature, that it allows KVM to
> implement postcopy live migration without having to modify a single
> line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
> other GUP features works just fine in combination with userfaults
> (userfaults trigger async page faults in the guest scheduler so those
> guest processes that aren't waiting for userfaults can keep running in
> the guest vcpus).
> 
> remap_anon_pages is the syscall to use to resolve the userfaults (it's
> not mandatory, vmsplice will likely still be used in the case of local
> postcopy live migration just to upgrade the qemu binary, but
> remap_anon_pages is faster and ideal for transferring memory across
> the network, it's zerocopy and doesn't touch the vma: it only holds
> the mmap_sem for reading).
> 
> The current behavior of remap_anon_pages is very strict to avoid any
> chance of memory corruption going unnoticed. mremap is not strict like
> that: if there's a synchronization bug it would drop the destination
> range silently resulting in subtle memory corruption for
> example. remap_anon_pages would return -EEXIST in that case. If there
> are holes in the source range remap_anon_pages will return -ENOENT.
> 
> If remap_anon_pages is used always with 2M naturally aligned
> addresses, transparent hugepages will not be splitted. In there could
> be 4k (or any size) holes in the 2M (or any size) source range,
> remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
> relax some of its strict checks (-ENOENT won't be returned if
> RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
> a noop on any hole in the source range). This flag is generally useful
> when implementing userfaults with THP granularity, but it shouldn't be
> set if doing the userfaults with PAGE_SIZE granularity if the
> developer wants to benefit from the strict -ENOENT behavior.
> 
> The remap_anon_pages syscall API is not vectored, as I expect it to be
> used mainly for demand paging (where there can be just one faulting
> range per userfault) or for large ranges (with the THP model as an
> alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
> granularity before starting the guest in the destination node) where
> vectoring isn't going to provide much performance advantages (thanks
> to the THP coarser granularity).
> 
> On the rmap side remap_anon_pages doesn't add much complexity: there's
> no need of nonlinear anon vmas to support it because I added the
> constraint that it will fail if the mapcount is more than 1. So in
> general the source range of remap_anon_pages should be marked
> MADV_DONTFORK to prevent any risk of failure if the process ever
> forks (like qemu can in some case).
> 
> One part that hasn't been tested is the poll() syscall on the
> userfaultfd because the postcopy migration thread currently is more
> efficient waiting on blocking read()s (I'll write some code to test
> poll() too). I also appended below a patch to trinity to exercise
> remap_anon_pages and userfaultfd and it completes trinity
> successfully.
> 
> The code can be found here:
> 
> git clone --reference linux 
> git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 
> 
> The branch is rebased so you can get updates for example with:
> 
> git fetch && git checkout -f origin/userfault
> 
> Comments welcome, thanks!
> Andrea
> 
> From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli 
> Date: Wed, 2 Jul 2014 18:32:35 +0200
> Subject: [PATCH] add remap_anon_pages and userfaultfd
> 
> Signed-off-by: Andrea Arcangeli 
> ---
>  include/syscall

Re: [Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization

2014-07-02 Thread Andy Lutomirski

On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> Once an userfaultfd is created MADV_USERFAULT regions talks through
> the userfaultfd protocol with the thread responsible for doing the
> memory externalization of the process.
> 
> The protocol starts by userland writing the requested/preferred
> USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
> kernel knows it, it will ack it by allowing userland to read 64bit
> from the userfault fd that will contain the same 64bit
> USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
> will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
> will have to try again by writing an older protocol version if
> suitable for its usage too, and read it back again until it stops
> reading -1ULL. After that the userfaultfd protocol starts.
> 
> The protocol consists in the userfault fd reads 64bit in size
> providing userland the fault addresses. After a userfault address has
> been read and the fault is resolved by userland, the application must
> write back 128bits in the form of [ start, end ] range (64bit each)
> that will tell the kernel such a range has been mapped. Multiple read
> userfaults can be resolved in a single range write. poll() can be used
> to know when there are new userfaults to read (POLLIN) and when there
> are threads waiting a wakeup through a range write (POLLOUT).
> 
> Signed-off-by: Andrea Arcangeli 

> +#ifdef CONFIG_PROC_FS
> +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
> +{
> + struct userfaultfd_ctx *ctx = f->private_data;
> + int ret;
> + wait_queue_t *wq;
> + struct userfaultfd_wait_queue *uwq;
> + unsigned long pending = 0, total = 0;
> +
> + spin_lock(&ctx->fault_wqh.lock);
> + list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> + uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> + if (uwq->pending)
> + pending++;
> + total++;
> + }
> + spin_unlock(&ctx->fault_wqh.lock);
> +
> + ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);

This should show the protocol version, too.

> +
> +SYSCALL_DEFINE1(userfaultfd, int, flags)
> +{
> + int fd, error;
> + struct file *file;

This looks like it can't be used more than once in a process.  That will
be unfortunate for libraries.  Would it be feasible to either have
userfaultfd claim a range of addresses or for a vma to be explicitly
associated with a userfaultfd?  (In the latter case, giant PROT_NONE
MAP_NORESERVE mappings could be used.)

Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-18 Thread Andy Lutomirski

On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse  wrote:
> For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
> the truth, and even legacy kernels ought to cope with that.
> FSVO 'ought to' where I suspect some of them will actually crash with a
> NULL pointer dereference if there's no "catch-all" DMAR unit in the
> tables, which puts it back into the same camp as ARM and Power.

I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
implementation on x86 has always been "experimental", so it just might
be okay to change it in a way that causes some older kernels to OOPS.

--Andy

Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski

On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin  wrote:
> On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
>> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse  
>> wrote:
>> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
>> > the truth, and even legacy kernels ought to cope with that.
>> > FSVO 'ought to' where I suspect some of them will actually crash with a
>> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
>> > tables, which puts it back into the same camp as ARM and Power.
>>
>> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
>> implementation on x86 has always been "experimental", so it just might
>> be okay to change it in a way that causes some older kernels to OOPS.
>>
>> --Andy
>
> Since it's experimental, it might be OK to change *guest kernels*
> such that they oops on old QEMU.
> But guest kernels were not experimental - so we need a QEMU mode that
> makes them work fine. The more functionality is available in this QEMU
> mode, the betterm because it's going to be the default for a while. For
> the same reason, it is preferable to also have new kernels not crash in
> this mode.
>

People add QEMU features that need new guest kernels all time time.
If you enable virtio-scsi and try to boot a guest that's too old, it
won't work.  So I don't see anything fundamentally wrong with saying
that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest
kernel is too old.  It might be annoying, since old kernels do work on
actual Q35 hardware, but it at least seems to be that it might be
okay.

--Andy

Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski

On Apr 19, 2016 2:13 AM, "Michael S. Tsirkin"  wrote:
>
>
> I guess you are right in that we should split this part out.
> What I wanted is really the combination
> PASSTHROUGH && !PLATFORM so that we can say "ok we don't
> need to guess, this device actually bypasses the IOMMU".

What happens when you use a device like this on Xen or with a similar
software translation layer?

I think that a "please bypass IOMMU" feature would be better in the
PCI, IOMMU, or platform code.  For Xen, virtio would still want to use
the DMA API, just without translating at the DMAR or hardware level.
Doing it in virtio is awkward, because virtio is involved at the
device level and the driver level, but the translation might be
entirely in between.

I think a nicer long-term approach would be to have a way to ask the
guest to set up a full 1:1 mapping for performance, but to still
handle the case where the guest refuses to do so or where there's more
than one translation layer involved.

But I agree that this part shouldn't delay the other part of your series.

--Andy

Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski

On Tue, Apr 19, 2016 at 9:09 AM, Michael S. Tsirkin  wrote:
> On Tue, Apr 19, 2016 at 09:02:14AM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin  wrote:
>> > On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
>> >> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse  
>> >> wrote:
>> >> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
>> >> > the truth, and even legacy kernels ought to cope with that.
>> >> > FSVO 'ought to' where I suspect some of them will actually crash with a
>> >> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
>> >> > tables, which puts it back into the same camp as ARM and Power.
>> >>
>> >> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
>> >> implementation on x86 has always been "experimental", so it just might
>> >> be okay to change it in a way that causes some older kernels to OOPS.
>> >>
>> >> --Andy
>> >
>> > Since it's experimental, it might be OK to change *guest kernels*
>> > such that they oops on old QEMU.
>> > But guest kernels were not experimental - so we need a QEMU mode that
>> > makes them work fine. The more functionality is available in this QEMU
>> > mode, the betterm because it's going to be the default for a while. For
>> > the same reason, it is preferable to also have new kernels not crash in
>> > this mode.
>> >
>>
>> People add QEMU features that need new guest kernels all time time.
>> If you enable virtio-scsi and try to boot a guest that's too old, it
>> won't work.  So I don't see anything fundamentally wrong with saying
>> that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest
>> kernel is too old.  It might be annoying, since old kernels do work on
>> actual Q35 hardware, but it at least seems to be that it might be
>> okay.
>>
>> --Andy
>
> Yes but we need a mode that makes both old and new kernels work, and
> that should be the default for a while.  this is what the
> IOMMU_PASSTHROUGH flag was about: old kernels ignore it and bypass DMA
> API, new kernels go "oh compatibility mode" and bypass the IOMMU
> within DMA API.

I thought that PLATFORM served that purpose.  Woudn't the host
advertise PLATFORM support and, if the guest doesn't ack it, the host
device would skip translation?  Or is that problematic for vfio?

>
> --
> MST



-- 
Andy Lutomirski
AMA Capital Management, LLC

Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski

On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin  wrote:
> On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >
>> > > I thought that PLATFORM served that purpose.  Woudn't the host
>> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
>> > > device would skip translation?  Or is that problematic for vfio?
>> >
>> > Exactly that's problematic for security.
>> > You can't allow guest driver to decide whether device skips security.
>>
>> Right. Because fundamentally, this *isn't* a property of the endpoint
>> device, and doesn't live in virtio itself.
>>
>> It's a property of the platform IOMMU, and lives there.
>
> It's a property of the hypervisor virtio implementation, and lives there.

It is now, but QEMU could, in principle, change the way it thinks
about it so that virtio devices would use the QEMU DMA API but ask
QEMU to pass everything through 1:1.  This would be entirely invisible
to guests but would make it be a property of the IOMMU implementation.
At that point, maybe QEMU could find a (platform dependent) way to
tell the guest what's going on.

FWIW, as far as I can tell, PPC and SPARC really could, in principle,
set up 1:1 mappings in the guest so that the virtio devices would work
regardless of whether QEMU is ignoring the IOMMU or not -- I think the
only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
up with an offset.  I don't know too much about those platforms, but
presumably the layout could be changed so that 1:1 really was 1:1.

--Andy

Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski

On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin  wrote:
> On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin  wrote:
>> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >> >
>> >> > > I thought that PLATFORM served that purpose.  Woudn't the host
>> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
>> >> > > device would skip translation?  Or is that problematic for vfio?
>> >> >
>> >> > Exactly that's problematic for security.
>> >> > You can't allow guest driver to decide whether device skips security.
>> >>
>> >> Right. Because fundamentally, this *isn't* a property of the endpoint
>> >> device, and doesn't live in virtio itself.
>> >>
>> >> It's a property of the platform IOMMU, and lives there.
>> >
>> > It's a property of the hypervisor virtio implementation, and lives there.
>>
>> It is now, but QEMU could, in principle, change the way it thinks
>> about it so that virtio devices would use the QEMU DMA API but ask
>> QEMU to pass everything through 1:1.  This would be entirely invisible
>> to guests but would make it be a property of the IOMMU implementation.
>> At that point, maybe QEMU could find a (platform dependent) way to
>> tell the guest what's going on.
>>
>> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
>> set up 1:1 mappings in the guest so that the virtio devices would work
>> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
>> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
>> up with an offset.  I don't know too much about those platforms, but
>> presumably the layout could be changed so that 1:1 really was 1:1.
>>
>> --Andy
>
> Sure. Do you see any reason why the decision to do this can't be
> keyed off the virtio feature bit?

I can think of three types of virtio host:

a) virtio always bypasses the IOMMU.

b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
it does) -- i.e. virtio works like any other device.

c) virtio may bypass the IOMMU depending on what the guest asks it to do.

If this is keyed off a virtio feature bit and anyone tries to
implement (c), the vfio is going to have a problem.  And, if it's
keyed off a virtio feature bit, then (a) won't work on Xen or similar
setups unless the Xen hypervisor adds a giant and probably unreliable
kludge to support it.  Meanwhile, 4.6-rc works fine under Xen on a
default x86 QEMU configuration, and I'd really like to keep it that
way.

What could plausibly work using a virtio feature bit is for a device
to say "hey, I'm a new device and I support the platform-defined IOMMU
mechanism".  This bit would be *set* on default IOMMU-less QEMU
configurations and on physical virtio PCI cards.  The guest could
operate accordingly.  I'm not sure I see a good way for feature
negotiation to work the other direction, though.

PPC and SPARC could only set this bit on emulated devices if they know
that new guest kernels are in use.

--Andy

Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski

On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin  wrote:
> On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin  wrote:
>> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
>> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin  
>> >> wrote:
>> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >> >> >
>> >> >> > > I thought that PLATFORM served that purpose.  Woudn't the host
>> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the 
>> >> >> > > host
>> >> >> > > device would skip translation?  Or is that problematic for vfio?
>> >> >> >
>> >> >> > Exactly that's problematic for security.
>> >> >> > You can't allow guest driver to decide whether device skips security.
>> >> >>
>> >> >> Right. Because fundamentally, this *isn't* a property of the endpoint
>> >> >> device, and doesn't live in virtio itself.
>> >> >>
>> >> >> It's a property of the platform IOMMU, and lives there.
>> >> >
>> >> > It's a property of the hypervisor virtio implementation, and lives 
>> >> > there.
>> >>
>> >> It is now, but QEMU could, in principle, change the way it thinks
>> >> about it so that virtio devices would use the QEMU DMA API but ask
>> >> QEMU to pass everything through 1:1.  This would be entirely invisible
>> >> to guests but would make it be a property of the IOMMU implementation.
>> >> At that point, maybe QEMU could find a (platform dependent) way to
>> >> tell the guest what's going on.
>> >>
>> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
>> >> set up 1:1 mappings in the guest so that the virtio devices would work
>> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
>> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
>> >> up with an offset.  I don't know too much about those platforms, but
>> >> presumably the layout could be changed so that 1:1 really was 1:1.
>> >>
>> >> --Andy
>> >
>> > Sure. Do you see any reason why the decision to do this can't be
>> > keyed off the virtio feature bit?
>>
>> I can think of three types of virtio host:
>>
>> a) virtio always bypasses the IOMMU.
>>
>> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
>> it does) -- i.e. virtio works like any other device.
>>
>> c) virtio may bypass the IOMMU depending on what the guest asks it to do.
>
> d) some virtio devices bypass the IOMMU and some don't,
> e.g. it's harder to support IOMMU with vhost.
>
>
>> If this is keyed off a virtio feature bit and anyone tries to
>> implement (c), the vfio is going to have a problem.  And, if it's
>> keyed off a virtio feature bit, then (a) won't work on Xen or similar
>> setups unless the Xen hypervisor adds a giant and probably unreliable
>> kludge to support it.  Meanwhile, 4.6-rc works fine under Xen on a
>> default x86 QEMU configuration, and I'd really like to keep it that
>> way.
>>
>> What could plausibly work using a virtio feature bit is for a device
>> to say "hey, I'm a new device and I support the platform-defined IOMMU
>> mechanism".  This bit would be *set* on default IOMMU-less QEMU
>> configurations and on physical virtio PCI cards.
>
> And clear on xen.

How?  QEMU has no idea that the guest is running Xen.

Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-20 Thread Andy Lutomirski

On Apr 20, 2016 6:14 AM, "Michael S. Tsirkin"  wrote:
>
> On Tue, Apr 19, 2016 at 02:07:01PM -0700, Andy Lutomirski wrote:
> > On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin  wrote:
> > > On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote:
> > >> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin  
> > >> wrote:
> > >> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
> > >> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin 
> > >> >>  wrote:
> > >> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
> > >> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
> > >> >> >> >
> > >> >> >> > > I thought that PLATFORM served that purpose.  Woudn't the host
> > >> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, 
> > >> >> >> > > the host
> > >> >> >> > > device would skip translation?  Or is that problematic for 
> > >> >> >> > > vfio?
> > >> >> >> >
> > >> >> >> > Exactly that's problematic for security.
> > >> >> >> > You can't allow guest driver to decide whether device skips 
> > >> >> >> > security.
> > >> >> >>
> > >> >> >> Right. Because fundamentally, this *isn't* a property of the 
> > >> >> >> endpoint
> > >> >> >> device, and doesn't live in virtio itself.
> > >> >> >>
> > >> >> >> It's a property of the platform IOMMU, and lives there.
> > >> >> >
> > >> >> > It's a property of the hypervisor virtio implementation, and lives 
> > >> >> > there.
> > >> >>
> > >> >> It is now, but QEMU could, in principle, change the way it thinks
> > >> >> about it so that virtio devices would use the QEMU DMA API but ask
> > >> >> QEMU to pass everything through 1:1.  This would be entirely invisible
> > >> >> to guests but would make it be a property of the IOMMU implementation.
> > >> >> At that point, maybe QEMU could find a (platform dependent) way to
> > >> >> tell the guest what's going on.
> > >> >>
> > >> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
> > >> >> set up 1:1 mappings in the guest so that the virtio devices would work
> > >> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
> > >> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
> > >> >> up with an offset.  I don't know too much about those platforms, but
> > >> >> presumably the layout could be changed so that 1:1 really was 1:1.
> > >> >>
> > >> >> --Andy
> > >> >
> > >> > Sure. Do you see any reason why the decision to do this can't be
> > >> > keyed off the virtio feature bit?
> > >>
> > >> I can think of three types of virtio host:
> > >>
> > >> a) virtio always bypasses the IOMMU.
> > >>
> > >> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
> > >> it does) -- i.e. virtio works like any other device.
> > >>
> > >> c) virtio may bypass the IOMMU depending on what the guest asks it to do.
> > >
> > > d) some virtio devices bypass the IOMMU and some don't,
> > > e.g. it's harder to support IOMMU with vhost.
> > >
> > >
> > >> If this is keyed off a virtio feature bit and anyone tries to
> > >> implement (c), the vfio is going to have a problem.  And, if it's
> > >> keyed off a virtio feature bit, then (a) won't work on Xen or similar
> > >> setups unless the Xen hypervisor adds a giant and probably unreliable
> > >> kludge to support it.  Meanwhile, 4.6-rc works fine under Xen on a
> > >> default x86 QEMU configuration, and I'd really like to keep it that
> > >> way.
> > >>
> > >> What could plausibly work using a virtio feature bit is for a device
> > >> to say "hey, I'm a new device and I support the platform-defined IOMMU
> > >> mechanism".  This bit would be *set* on default IOMMU-less QEMU
> > >> configurations and on physical virtio PCI cards.
> > >
> > > And clear on xen.
> >
> > How?  QEMU has no idea that the guest is running Xen.
>
> I was under impression xen_enabled() is true in QEMU.
> Am I wrong?

I'd be rather surprised, given that QEMU would have to inspect the
guest kernel to figure it out.  I'm talking about Xen under QEMU.  For
example, if you feed QEMU a guest disk image that contains Fedora with
the xen packages installed, you can boot it and get a grub menu.  If
you ask grub to boot Xen, you get Xen.  If you ask grub to boot Linux
directly, you don't get Xen.

I assume xen_enabled is for QEMU under Xen, i.e. QEMU, running under
Xen, supplying emulated devices to a Xen domU guest.  Since QEMU is
seeing the guest address space directly, this should be much the same
as QEMU !xen_enabled -- if you boot plain Linux, everything works, but
if you do Xen -> QEMU -> HVM guest running Xen PV -> Linux, then
virtio drivers in the Xen PV Linux guest need to translate addresses.
--Andy

>
> --
> MST

[Qemu-devel] [Bug 1574346] [NEW] TCG: mov to segment register is incorrectly emulated for AMD CPUs

2016-04-24 Thread Andy Lutomirski

Public bug reported:

In TCG mode, the effect of:

xorl %eax, %eax
movl %eax, %gs

is to mark the GS segment unusable and set its base to zero.  After
doing this, reading MSR_GS_BASE will return zero and using a GS prefix
in long mode will treat the GS base as zero.

This is correct for Intel CPUs but is incorrect for AMD CPUs.  On an AMD
CPU, writing 0 to %gs using mov, pop, or (I think) lgs will leave the
base unchanged.

To make it easier to use TCG to validate behavior on different CPUs,
please consider changing the TCG behavior to match actual CPU behavior
when emulating an AMD CPU.

** Affects: qemu
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1574346

Title:
  TCG: mov to segment register is incorrectly emulated for AMD CPUs

Status in QEMU:
  New

Bug description:
  In TCG mode, the effect of:

  xorl %eax, %eax
  movl %eax, %gs

  is to mark the GS segment unusable and set its base to zero.  After
  doing this, reading MSR_GS_BASE will return zero and using a GS prefix
  in long mode will treat the GS base as zero.

  This is correct for Intel CPUs but is incorrect for AMD CPUs.  On an
  AMD CPU, writing 0 to %gs using mov, pop, or (I think) lgs will leave
  the base unchanged.

  To make it easier to use TCG to validate behavior on different CPUs,
  please consider changing the TCG behavior to match actual CPU behavior
  when emulating an AMD CPU.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1574346/+subscriptions

Re: [Qemu-devel] [PATCH V2 RFC] fixup! virtio: convert to use DMA api

2016-04-27 Thread Andy Lutomirski

On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel  wrote:
> On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> One correction: it's a feature of the device in the system.
>> There could be a mix of devices bypassing and not
>> bypassing the IOMMU.
>
> No, it really is not. A device can't chose to bypass the IOMMU. But the
> IOMMU can chose to let the device bypass. So any fix here belongs
> into the platform/iommu code too and not into some driver.
>
>> Sounds good. And a way to detect appropriate devices could
>> be by looking at the feature flag, perhaps?
>
> Again, no! The way to detect that is to look into the iommu description
> structures provided by the firmware. They provide everything necessary
> to tell the iommu code which devices are not translated.
>

Except on PPC and SPARC.  As far as I know, those are the only
problematic platforms.

Is it too late to *disable* QEMU's q35-iommu thingy until it can be
fixed to report correct data in the DMAR tables?

--Andy

Re: [Qemu-devel] [PATCH V2 RFC] fixup! virtio: convert to use DMA api

2016-04-27 Thread Andy Lutomirski

On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin  wrote:
> On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel  wrote:
>> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> >> One correction: it's a feature of the device in the system.
>> >> There could be a mix of devices bypassing and not
>> >> bypassing the IOMMU.
>> >
>> > No, it really is not. A device can't chose to bypass the IOMMU. But the
>> > IOMMU can chose to let the device bypass. So any fix here belongs
>> > into the platform/iommu code too and not into some driver.
>> >
>> >> Sounds good. And a way to detect appropriate devices could
>> >> be by looking at the feature flag, perhaps?
>> >
>> > Again, no! The way to detect that is to look into the iommu description
>> > structures provided by the firmware. They provide everything necessary
>> > to tell the iommu code which devices are not translated.
>> >
>>
>> Except on PPC and SPARC.  As far as I know, those are the only
>> problematic platforms.
>>
>> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
>> fixed to report correct data in the DMAR tables?
>>
>> --Andy
>
> Meaning virtio or assigned devices?
> For virtio - it's way too late since these are working configurations.
> For assigned devices - they don't work on x86 so it doesn't have
> to be disabled, it's safe to ignore.

I mean actually prevent QEMU from running in q35-iommu mode with any
virtio devices attached or maybe even turn off q35-iommu mode entirely
[1].  Doesn't it require that the user literally pass the word
"experimental" into QEMU right now?  It did at some point IIRC.

The reason I'm asking is that, other than q35-iommu, QEMU's virtio
devices *don't* bypass the IOMMU except on PPC and SPARC, simply
because there is no other configuration AFAICT that has virtio and and
IOMMU.  So maybe the right solution is to fix q35-iommu to use DMAR
correctly (thus breaking q35-iommu users with older guest kernels,
which hopefully don't actually exist) and to come up with a PPC- and
SPARC-specific solution, or maybe OpenFirmware-specific solution, to
handle PPC and SPARC down the road.

[1] I'm pretty sure I emailed the QEMU list before q35-iommu ever
showed up in a release asking the QEMU team to please not do that
until this issue was resolved.  Sadly, that email was ignored :(

--Andy

Re: [Qemu-devel] [PATCH V2 RFC] fixup! virtio: convert to use DMA api

2016-04-27 Thread Andy Lutomirski

On Wed, Apr 27, 2016 at 7:54 AM, Michael S. Tsirkin  wrote:
> On Wed, Apr 27, 2016 at 07:43:07AM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin  wrote:
>> > On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
>> >> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel  wrote:
>> >> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> >> >> One correction: it's a feature of the device in the system.
>> >> >> There could be a mix of devices bypassing and not
>> >> >> bypassing the IOMMU.
>> >> >
>> >> > No, it really is not. A device can't chose to bypass the IOMMU. But the
>> >> > IOMMU can chose to let the device bypass. So any fix here belongs
>> >> > into the platform/iommu code too and not into some driver.
>> >> >
>> >> >> Sounds good. And a way to detect appropriate devices could
>> >> >> be by looking at the feature flag, perhaps?
>> >> >
>> >> > Again, no! The way to detect that is to look into the iommu description
>> >> > structures provided by the firmware. They provide everything necessary
>> >> > to tell the iommu code which devices are not translated.
>> >> >
>> >>
>> >> Except on PPC and SPARC.  As far as I know, those are the only
>> >> problematic platforms.
>> >>
>> >> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
>> >> fixed to report correct data in the DMAR tables?
>> >>
>> >> --Andy
>> >
>> > Meaning virtio or assigned devices?
>> > For virtio - it's way too late since these are working configurations.
>> > For assigned devices - they don't work on x86 so it doesn't have
>> > to be disabled, it's safe to ignore.
>>
>> I mean actually prevent QEMU from running in q35-iommu mode with any
>> virtio devices attached or maybe even turn off q35-iommu mode entirely
>> [1].  Doesn't it require that the user literally pass the word
>> "experimental" into QEMU right now?  It did at some point IIRC.
>>
>> The reason I'm asking is that, other than q35-iommu, QEMU's virtio
>> devices *don't* bypass the IOMMU except on PPC and SPARC, simply
>> because there is no other configuration AFAICT that has virtio and and
>> IOMMU.  So maybe the right solution is to fix q35-iommu to use DMAR
>> correctly (thus breaking q35-iommu users with older guest kernels,
>> which hopefully don't actually exist) and to come up with a PPC- and
>> SPARC-specific solution, or maybe OpenFirmware-specific solution, to
>> handle PPC and SPARC down the road.
>>
>> [1] I'm pretty sure I emailed the QEMU list before q35-iommu ever
>> showed up in a release asking the QEMU team to please not do that
>> until this issue was resolved.  Sadly, that email was ignored :(
>>
>> --Andy
>
> Sorry, I didn't make myself clear.
> Point is, QEMU is not the only virtio implementation out there.
> So we can't know no virtio implementations have an IOMMU as long as
> linux supports this IOMMU.
> virtio always used physical addresses since it was born and if it
> changes that it must do this in a way that does not break existing
> users.

Is there any non-QEMU virtio implementation can provide an
IOMMU-bypassing virtio device on a platform that has a nontrivial
IOMMU?

--Andy

[Qemu-devel] Re: [PATCH-V4 0/7] virtio-9p:Introducing security model for VirtFS

2010-05-26 Thread Andy Lutomirski


Venkateswararao Jujjuri (JV) wrote:

This patch series introduces the security model for VirtFS.

Brief description of this patch series:

It introduces two type of security models for VirtFS.
They are: mapped and passthrough.

The following is common to both security models.

* Client's VFS determines/enforces the access control.
  Largely server should never return EACCESS.

* Client sends gid/mode-bit information as part of creation only.

Changes from V3
---
o Return NULL instead of exit(1) on failure in virtio_9p_init()
o Capitalized sm_passthrough, sm_mappe
o Added handling for EINTR for read/write.
o Corrected default permissions for mkdir in mapped mode.
o Added additional error handling.

Changes from V2
---
o Removed warnings resulting from chmod/chown.
o Added code to fail normally if secuirty_model option is not specified.

Changes from V1
---
o Added support for chmod and chown.
o Used chmod/chown to set credentials instead of setuid/setgid.
o Fixed a bug where uid used instated of uid.


Security model: mapped
--

VirtFS server(QEMU) intercepts and maps all the file object create requests.
Files on the fileserver will be created with QEMU's user credentials and the
client-user's credentials are stored in extended attributes.
During getattr() server extracts the client-user's credentials from extended
attributes and sends to the client.

Given that only the user space extended attributes are available to regular
files, special files are created as regular files on the fileserver and the
appropriate mode bits are stored in xattrs and will be extracted during
getattr.

If the extended attributes are missing, server sends back the filesystem
stat() unaltered. This provision will make the files created on the
fileserver usable to client.

Points to be considered

* Filesystem will be VirtFS'ized. Meaning, other filesystems may not
 understand the credentials of the files created under this model.


How hard would it be to make this compatible with rsync's --fake-super? 
 (--fake-super already does almost what you're doing, and if you make 
the formats compatible, then rsync could be used to translate.  OTOH, 
rsyncing a VirtFS-ified filesystem to a remote --fake-super system might 
have odd side-effects.)


--Andy

[Qemu-devel] [Bug 597351] Re: Slow UDP performance with virtio device

2010-06-22 Thread Andy Ross


** Attachment added: "udp-pong.c"
   http://launchpadlibrarian.net/50751155/udp-pong.c

-- 
Slow UDP performance with virtio device
https://bugs.launchpad.net/bugs/597351
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.

Status in QEMU: New

Bug description:
I'm working on an app that is very sensitive to round-trip latency
between the guest and host, and qemu/kvm seems to be significantly
slower than it needs to be.

The attached program is a ping/pong over UDP.  Call it with a single
argument to start a listener/echo server on that port.  With three
arguments it becomes a counted "pinger" that will exit after a
specified number of round trips for performance measurements.  For
example:

  $ gcc -o udp-pong udp-pong.c
  $ ./udp-pong 12345 &   # start a listener on port 12345
  $ time ./udp-pong 127.0.0.1 12345 100  # time a million round trips

When run on the loopback device on a single machine (true on the host
or within a guest), I get about 100k/s.

When run across a port forward using "user" networking on qemu (or
kvm, the performance is the same) and the default rtl8139 driver (both
the host and guest are Ubuntu Lucid), I get about 10k/s.  This seems
very slow, but perhaps unavoidably so?

When run in the same configuration using the "virtio" driver, I get
only 2k/s.  This is almost certainly a bug in the virtio driver, given
that it's a paravirtualized device that is 5x slower than the "slow"
hardware emulation.

I get no meaningful change in performance between kvm/qemu.

[Qemu-devel] [Bug 597351] [NEW] Slow UDP performance with virtio device

2010-06-22 Thread Andy Ross

Public bug reported:

I'm working on an app that is very sensitive to round-trip latency
between the guest and host, and qemu/kvm seems to be significantly
slower than it needs to be.

The attached program is a ping/pong over UDP.  Call it with a single
argument to start a listener/echo server on that port.  With three
arguments it becomes a counted "pinger" that will exit after a
specified number of round trips for performance measurements.  For
example:

  $ gcc -o udp-pong udp-pong.c
  $ ./udp-pong 12345 &   # start a listener on port 12345
  $ time ./udp-pong 127.0.0.1 12345 100  # time a million round trips

When run on the loopback device on a single machine (true on the host
or within a guest), I get about 100k/s.

When run across a port forward using "user" networking on qemu (or
kvm, the performance is the same) and the default rtl8139 driver (both
the host and guest are Ubuntu Lucid), I get about 10k/s.  This seems
very slow, but perhaps unavoidably so?

When run in the same configuration using the "virtio" driver, I get
only 2k/s.  This is almost certainly a bug in the virtio driver, given
that it's a paravirtualized device that is 5x slower than the "slow"
hardware emulation.

I get no meaningful change in performance between kvm/qemu.

** Affects: qemu
 Importance: Undecided
 Status: New

-- 
Slow UDP performance with virtio device
https://bugs.launchpad.net/bugs/597351
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.

Status in QEMU: New

Bug description:
I'm working on an app that is very sensitive to round-trip latency
between the guest and host, and qemu/kvm seems to be significantly
slower than it needs to be.

The attached program is a ping/pong over UDP.  Call it with a single
argument to start a listener/echo server on that port.  With three
arguments it becomes a counted "pinger" that will exit after a
specified number of round trips for performance measurements.  For
example:

  $ gcc -o udp-pong udp-pong.c
  $ ./udp-pong 12345 &   # start a listener on port 12345
  $ time ./udp-pong 127.0.0.1 12345 100  # time a million round trips

When run on the loopback device on a single machine (true on the host
or within a guest), I get about 100k/s.

When run across a port forward using "user" networking on qemu (or
kvm, the performance is the same) and the default rtl8139 driver (both
the host and guest are Ubuntu Lucid), I get about 10k/s.  This seems
very slow, but perhaps unavoidably so?

When run in the same configuration using the "virtio" driver, I get
only 2k/s.  This is almost certainly a bug in the virtio driver, given
that it's a paravirtualized device that is 5x slower than the "slow"
hardware emulation.

I get no meaningful change in performance between kvm/qemu.

[Qemu-devel] qemu-kvm problem with DOS/4GW extender and EMM386.EXE

2010-05-11 Thread Andy Walls

pu = 0x  pid = 0x1997 [ 
errorcode = 0x, virt = 0x 0001a072 ]
28471049807815 (+4000)  VMENTRY   vcpu = 0x  pid = 
0x1997
28471049811815 (+4000)  VMEXITvcpu = 0x  pid = 
0x1997 [ exitcode = 0x, rip = 0x 2a69 ]
0 (+   0)  CR_READ   vcpu = 0x  pid = 0x1997 [ CR# = 0, 
value = 0x 8011 ]
28471049815815 (+4000)  VMENTRY   vcpu = 0x  pid = 
0x1997
28471049818815 (+3000)  VMEXITvcpu = 0x  pid = 
0x1997 [ exitcode = 0x0010, rip = 0x 2a73 ]
0 (+   0)  LMSW  vcpu = 0x  pid = 0x1997 [ value = 
0x8010 ]
28471049840815 (+   22000)  VMENTRY   vcpu = 0x  pid = 
0x1997
28471049844815 (+4000)  VMEXITvcpu = 0x  pid = 
0x1997 [ exitcode = 0x007b, rip = 0x 1fd6 ]
0 (+   0)  IO_WRITE  vcpu = 0x  pid = 0x1997 [ port = 
0x0020, size = 1 ]
28471049846815 (+2000)  VMENTRY   vcpu = 0x  pid = 
0x1997
28471049849815 (+3000)  VMEXITvcpu = 0x  pid = 
0x1997 [ exitcode = 0x007b, rip = 0x 1fd9 ]
0 (+   0)  IO_READ   vcpu = 0x  pid = 0x1997 [ port = 
0x0020, size = 1 ]
28471049851815 (+2000)  VMENTRY   vcpu = 0x  pid = 
0x1997
28471049855815 (+4000)  VMEXITvcpu = 0x  pid = 
0x1997 [ exitcode = 0x, rip = 0x 2a69 ]
0 (+   0)  CR_READ   vcpu = 0x  pid = 0x1997 [ CR# = 0, 
value = 0x 8011 ]
28471049858815 (+3000)  VMENTRY   vcpu = 0x  pid = 
0x1997
28471049861815 (+3000)  VMEXITvcpu = 0x  pid = 
0x1997 [ exitcode = 0x0010, rip = 0x 2a73 ]
0 (+   0)  LMSW  vcpu = 0x  pid = 0x1997 [ value = 
0x8010 ]
28471049882815 (+   21000)  VMENTRY   vcpu = 0x  pid = 
0x1997
28471049885815 (+3000)  VMEXITvcpu = 0x  pid = 
0x1997 [ exitcode = 0x007b, rip = 0x 1fd6 ]
0 (+   0)  IO_WRITE  vcpu = 0x  pid = 0x1997 [ port = 
0x0020, size = 1 ]
28471049887815 (+2000)  VMENTRY   vcpu = 0x  pid = 
0x1997
28471049890815 (+3000)  VMEXITvcpu = 0x  pid = 
0x1997 [ exitcode = 0x007b, rip = 0x 1fd9 ]
0 (+   0)  IO_READ   vcpu = 0x  pid = 0x1997 [ port = 
0x0020, size = 1 ]
28471049892815 (+2000)  VMENTRY   vcpu = 0x  pid = 
0x1997
28471049896815 (+4000)  VMEXITvcpu = 0x  pid = 
0x1997 [ exitcode = 0x, rip = 0x 2a69 ]
0 (+   0)  CR_READ   vcpu = 0x  pid = 0x1997 [ CR# = 0, 
value = 0x 8011 ]
28471049900815 (+4000)  VMENTRY   vcpu = 0x  pid = 
0x1997
28471049903815 (+3000)  VMEXITvcpu = 0x  pid = 
0x1997 [ exitcode = 0x0010, rip = 0x 2a73 ]
0 (+   0)  LMSW  vcpu = 0x  pid = 0x1997 [ value = 
0x8010 ]
28471049933815 (+   3)  VMENTRY   vcpu = 0x  pid = 
0x1997
28471049936815 (+3000)  VMEXITvcpu = 0x  pid = 
0x1997 [ exitcode = 0x007b, rip = 0x 1fd6 ]




To me it appears EMM386.EXE enables paging, and the DOS/4GW DOS extender
tries to manipulate the PE bit in CR0 with LMSW but doesn't succeed.

These programs appear to work fine in VMWare and on real hardware.


Any ideas on how to make EMM386.EXE and the DOS/$GW extender work in
qemu-kvm?

Regards,
Andy

Re: [Qemu-devel] qemu-kvm problem with DOS/4GW extender and EMM386.EXE

2010-05-11 Thread Andy Walls

On Wed, 2010-05-12 at 00:09 +0300, Mohammed Gamal wrote:
> On Tue, May 11, 2010 at 11:56 PM, Andy Walls  wrote:
> > Running an MS-DOS 6.22 image with qemu-kvm on a RedHat Linux OS, I
> > noticed the guest OS becomes hung and my dmesg gets spammed with
> >
> >set_cr0: #GP, set PG flag with a clear PE flag
> >
> > That message appears to be the linux kernel's kvm emulator griping about
> > Paging Enable bit being enabled while the Protection Enable bit is set
> > for real mode.  (The Intel manual says this should be a protection
> > fault).
> >
> > The program that causes this has the DOS/4GW DOS extender runtime
> > compiled into it.
> >
> > I found that when I don't load the EMM386.EXE memory manager, the
> > problem doesn't occur.
> >
> > Here's a kvmtrace segment of when things are not working:
> >
> > 0 (+   0)  IO_WRITE  vcpu = 0x  pid = 0x1997 [ port 
> > = 0x0070, size = 1 ]
> > 28471049668815 (+4000)  VMENTRY   vcpu = 0x  pid = 
> > 0x1997
> > 28471049671815 (+3000)  VMEXITvcpu = 0x  pid = 
> > 0x1997 [ exitcode = 0x004e, rip = 0x 2a18 ]
> > 0 (+   0)  PAGE_FAULTvcpu = 0x  pid = 0x1997 [ 
> > errorcode = 0x, virt = 0x 0001ba28 ]
> > 28471049675815 (+4000)  VMENTRY   vcpu = 0x  pid = 
> > 0x1997
> > 28471049678815 (+3000)  VMEXITvcpu = 0x  pid = 
> > 0x1997 [ exitcode = 0x004e, rip = 0x 0334 ]
> > 0 (+   0)  PAGE_FAULTvcpu = 0x  pid = 0x1997 [ 
> > errorcode = 0x, virt = 0x 00019344 ]
> > 28471049681815 (+3000)  VMENTRY   vcpu = 0x  pid = 
> > 0x1997
> > 28471049685815 (+4000)  VMEXITvcpu = 0x  pid = 
> > 0x1997 [ exitcode = 0x, rip = 0x 02a7 ]
> > 0 (+   0)  CR_READ   vcpu = 0x  pid = 0x1997 [ CR# 
> > = 0, value = 0x 8011 ]
> > 28471049688815 (+3000)  VMENTRY   vcpu = 0x  pid = 
> > 0x1997
> > 28471049691815 (+3000)  VMEXITvcpu = 0x  pid = 
> > 0x1997 [ exitcode = 0x0010, rip = 0x 02ae ]
> > 0 (+   0)  LMSW  vcpu = 0x  pid = 0x1997 [ 
> > value = 0x8011 ]
> > 28471049696815 (+5000)  VMENTRY   vcpu = 0x  pid = 
> > 0x1997
> > 28471049699815 (+3000)  VMEXITvcpu = 0x  pid = 
> > 0x1997 [ exitcode = 0x004e, rip = 0x 5593 ]
> > 0 (+   0)  PAGE_FAULTvcpu = 0x  pid = 0x1997 [ 
> > errorcode = 0x, virt = 0x 000262e3 ]
> > 28471049703815 (+4000)  VMENTRY   vcpu = 0x  pid = 
> > 0x1997
> > 28471049706815 (+3000)  VMEXITvcpu = 0x  pid = 
> > 0x1997 [ exitcode = 0x004e, rip = 0x 44d6 ]
> > 0 (+   0)  PAGE_FAULTvcpu = 0x  pid = 0x1997 [ 
> > errorcode = 0x, virt = 0x 00025226 ]
> > 28471049709815 (+3000)  VMENTRY   vcpu = 0x  pid = 
> > 0x1997
> > 28471049713815 (+4000)  VMEXITvcpu = 0x  pid = 
> > 0x1997 [ exitcode = 0x004e, rip = 0x 55c0 ]
> > 0 (+   0)  PAGE_FAULTvcpu = 0x  pid = 0x1997 [ 
> > errorcode = 0x0002, virt = 0x 00024f79 ]
> > 28471049717815 (+4000)  VMENTRY   vcpu = 0x  pid = 
> > 0x1997
> > 28471049721815 (+4000)  VMEXITvcpu = 0x  pid = 
> > 0x1997 [ exitcode = 0x, rip = 0x 2a69 ]
> > 0 (+   0)  CR_READ   vcpu = 0x  pid = 0x1997 [ CR# 
> > = 0, value = 0x 8011 ]
> > 28471049723815 (+2000)  VMENTRY   vcpu = 0x  pid = 
> > 0x1997
> > 28471049726815 (+3000)  VMEXITvcpu = 0x  pid = 
> > 0x1997 [ exitcode = 0x0010, rip = 0x 2a73 ]
> > 0 (+   0)  LMSW  vcpu = 0x  pid = 0x1997 [ 
> > value = 0x8010 ]
> > 28471049781815 (+   55000)  VMENTRY   vcpu = 0x  pid = 
> > 0x1997
> > 28471049784815 (+3000)  VMEXITvcpu = 0x  pid = 
> > 0x1997 [ exitcode = 0x004e, rip = 0x 1fb8 ]
> > 0 (+   0)  PAGE_FAULTvcpu = 0x  pid = 0x1997 [ 
> > errorcode = 0x,

[Qemu-devel] Qemu-win

2006-08-20 Thread Klemann Andy

Hello,

Sorry, my english is very bad but i have some problems:

When I start die qemu with this command:

qemu.exe -hda suse.img -vnc 55 -k de

then i became this message:

Could not load keymap files '/C/Program Files/Qemu

the file in c:\program files\qemu\keymap\ "de" is there.

i looked in many forums, can´t find everything

can you help me ?


by Andy


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-11-16 Thread Andy Lutomirski




On Tue, Oct 25, 2022, at 8:13 AM, Chao Peng wrote:
> This new KVM exit allows userspace to handle memory-related errors. It
> indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> The flags includes additional information for userspace to handle the
> error. Currently bit 0 is defined as 'private memory' where '1'
> indicates error happens due to private memory access and '0' indicates
> error happens due to shared memory access.
>
> When private memory is enabled, this new exit will be used for KVM to
> exit to userspace for shared <-> private memory conversion in memory
> encryption usage. In such usage, typically there are two kind of memory
> conversions:
>   - explicit conversion: happens when guest explicitly calls into KVM
> to map a range (as private or shared), KVM then exits to userspace
> to perform the map/unmap operations.
>   - implicit conversion: happens in KVM page fault handler where KVM
> exits to userspace for an implicit conversion when the page is in a
> different state than requested (private or shared).
>
> Suggested-by: Sean Christopherson 
> Co-developed-by: Yu Zhang 
> Signed-off-by: Yu Zhang 
> Signed-off-by: Chao Peng 
> ---
>  Documentation/virt/kvm/api.rst | 23 +++
>  include/uapi/linux/kvm.h   |  9 +
>  2 files changed, 32 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst 
> b/Documentation/virt/kvm/api.rst
> index f3fa75649a78..975688912b8c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6537,6 +6537,29 @@ array field represents return values. The 
> userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on 
> RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
> 
> +::
> +
> + /* KVM_EXIT_MEMORY_FAULT */
> + struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1 << 0)
> + __u32 flags;
> + __u32 padding;
> + __u64 gpa;
> + __u64 size;
> + } memory;
> +

Would it make sense to also have a field for the access type (read, write, 
execute, etc)?  I realize that shared <-> private conversion doesn't strictly 
need this, but it seems like it could be useful for logging failures and also 
for avoiding a second immediate fault if the type gets converted but doesn't 
have the right protection yet.

(Obviously, if this were changed, KVM would need the ability to report that it 
doesn't actually know the mode.)

--Andy

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-07-22 Thread Andy Lutomirski


On 7/21/22 14:19, Sean Christopherson wrote:

On Thu, Jul 21, 2022, Gupta, Pankaj wrote:





I view it as a performance problem because nothing stops KVM from copying from
userspace into the private fd during the SEV ioctl().  What's missing is the
ability for userspace to directly initialze the private fd, which may or may not
avoid an extra memcpy() depending on how clever userspace is.

Can you please elaborate more what you see as a performance problem? And
possible ways to solve it?


Oh, I'm not saying there actually _is_ a performance problem.  What I'm saying 
is
that in-place encryption is not a functional requirement, which means it's 
purely
an optimization, and thus we should other bother supporting in-place encryption
_if_ it would solve a performane bottleneck.


Even if we end up having a performance problem, I think we need to 
understand the workloads that we want to optimize before getting too 
excited about designing a speedup.


In particular, there's (depending on the specific technology, perhaps, 
and also architecture) a possible tradeoff between trying to reduce 
copying and trying to reduce unmapping and the associated flushes.  If a 
user program maps an fd, populates it, and then converts it in place 
into private memory (especially if it doesn't do it in a single shot), 
then that memory needs to get unmapped both from the user mm and 
probably from the kernel direct map.  On the flip side, it's possible to 
imagine an ioctl that does copy-and-add-to-private-fd that uses a 
private mm and doesn't need any TLB IPIs.


All of this is to say that trying to optimize right now seems quite 
premature to me.

[PATCH] hw/vhost-user-blk: turn on VIRTIO_BLK_F_SIZE_MAX feature for virtio blk device

2021-11-25 Thread Andy Pei

Turn on pre-defined feature VIRTIO_BLK_F_SIZE_MAX virtio blk device
to avoid guest DMA request size is too large to exceed hardware spec.

Signed-off-by: Andy Pei 
---
 hw/block/vhost-user-blk.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index ba13cb8..eb1264a 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -252,6 +252,7 @@ static uint64_t vhost_user_blk_get_features(VirtIODevice 
*vdev,
 VHostUserBlk *s = VHOST_USER_BLK(vdev);
 
 /* Turn on pre-defined features */
+virtio_add_feature(&features, VIRTIO_BLK_F_SIZE_MAX);
 virtio_add_feature(&features, VIRTIO_BLK_F_SEG_MAX);
 virtio_add_feature(&features, VIRTIO_BLK_F_GEOMETRY);
 virtio_add_feature(&features, VIRTIO_BLK_F_TOPOLOGY);
-- 
1.8.3.1

Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-05-20 Thread Andy Lutomirski


On 5/19/22 08:37, Chao Peng wrote:

Extend the memslot definition to provide guest private memory through a
file descriptor(fd) instead of userspace_addr(hva). Such guest private
memory(fd) may never be mapped into userspace so no userspace_addr(hva)
can be used. Instead add another two new fields
(private_fd/private_offset), plus the existing memory_size to represent
the private memory range. Such memslot can still have the existing
userspace_addr(hva). When use, a single memslot can maintain both
private memory through private fd(private_fd/private_offset) and shared
memory through hva(userspace_addr). A GPA is considered private by KVM
if the memslot has private fd and that corresponding page in the private
fd is populated, otherwise, it's shared.




So this is a strange API and, IMO, a layering violation.  I want to make 
sure that we're all actually on board with making this a permanent part 
of the Linux API.  Specifically, we end up with a multiplexing situation 
as you have described. For a given GPA, there are *two* possible host 
backings: an fd-backed one (from the fd, which is private for now might 
might end up potentially shared depending on future extensions) and a 
VMA-backed one.  The selection of which one backs the address is made 
internally by whatever backs the fd.


This, IMO, a clear layering violation.  Normally, an fd has an 
associated address space, and pages in that address space can have 
contents, can be holes that appear to contain all zeros, or could have 
holes that are inaccessible.  If you try to access a hole, you get 
whatever is in the hole.


But now, with this patchset, the fd is more of an overlay and you get 
*something else* if you try to access through the hole.


This results in operations on the fd bubbling up to the KVM mapping in 
what is, IMO, a strange way.  If the user punches a hole, KVM has to 
modify its mappings such that the GPA goes to whatever VMA may be there. 
 (And update the RMP, the hypervisor's tables, or whatever else might 
actually control privateness.)  Conversely, if the user does fallocate 
to fill a hole, the guest mapping *to an unrelated page* has to be 
zapped so that the fd's page shows up.  And the RMP needs updating, etc.


I am lukewarm on this for a few reasons.

1. This is weird.  AFAIK nothing else works like this.  Obviously this 
is subjecting, but "weird" and "layering violation" sometimes translate 
to "problematic locking".


2. fd-backed private memory can't have normal holes.  If I make a memfd, 
punch a hole in it, and mmap(MAP_SHARED) it, I end up with a page that 
reads as zero.  If I write to it, the page gets allocated.  But with 
this new mechanism, if I punch a hole and put it in a memslot, reads and 
writes go somewhere else.  So what if I actually wanted lazily allocated 
private zeros?


2b. For a hypothetical future extension in which an fd can also have 
shared pages (for conversion, for example, or simply because the fd 
backing might actually be more efficient than indirecting through VMAs 
and therefore get used for shared memory or entirely-non-confidential 
VMs), lazy fd-backed zeros sound genuinely useful.


3. TDX hardware capability is not fully exposed.  TDX can have a private 
page and a shared page at GPAs that differ only by the private bit. 
Sure, no one plans to use this today, but baking this into the user ABI 
throws away half the potential address space.


3b. Any software solution that works like TDX (which IMO seems like an 
eminently reasonable design to me) has the same issue.



The alternative would be to have some kind of separate table or bitmap 
(part of the memslot?) that tells KVM whether a GPA should map to the fd.


What do you all think?

Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-05-21 Thread Andy Lutomirski




On Fri, May 20, 2022, at 11:31 AM, Sean Christopherson wrote:

> But a dedicated KVM ioctl() to add/remove shared ranges would be easy 
> to implement
> and wouldn't necessarily even need to interact with the memslots.  It 
> could be a
> consumer of memslots, e.g. if we wanted to disallow registering regions 
> without an
> associated memslot, but I think we'd want to avoid even that because 
> things will
> get messy during memslot updates, e.g. if dirty logging is toggled or a 
> shared
> memory region is temporarily removed then we wouldn't want to destroy 
> the tracking.
>
> I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> should be far more efficient.
>
> One benefit to explicitly tracking this in KVM is that it might be 
> useful for
> software-only protected VMs, e.g. KVM could mark a region in the XArray 
> as "pending"
> based on guest hypercalls to share/unshare memory, and then complete 
> the transaction
> when userspace invokes the ioctl() to complete the share/unshare.

That makes sense.

If KVM goes this route, perhaps there the allowed states for a GPA should 
include private, shared, and also private-and-shared.  Then anyone who wanted 
to use the same masked GPA for shared and private on TDX could do so if they 
wanted to.

Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

2022-02-11 Thread Andy Lutomirski


On 1/18/22 05:21, Chao Peng wrote:

From: "Kirill A. Shutemov" 

Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
the file is inaccessible from userspace through ordinary MMU access
(e.g., read/write/mmap). However, the file content can be accessed
via a different mechanism (e.g. KVM MMU) indirectly.

It provides semantics required for KVM guest private memory support
that a file descriptor with this seal set is going to be used as the
source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV but may not be accessible from host userspace.

At this time only shmem implements this seal.



I don't dislike this *that* much, but I do dislike this. 
F_SEAL_INACCESSIBLE essentially transmutes a memfd into a different type 
of object.  While this can apparently be done successfully and without 
races (as in this code), it's at least awkward.  I think that either 
creating a special inaccessible memfd should be a single operation that 
create the correct type of object or there should be a clear 
justification for why it's a two-step process.


(Imagine if the way to create an eventfd would be to call 
timerfd_create() and then do a special fcntl to turn it into an eventfd 
but only if it's not currently armed.  This would be weird.)

Re: [PATCH v4 04/12] mm/shmem: Support memfile_notifier

2022-02-11 Thread Andy Lutomirski


On 1/18/22 05:21, Chao Peng wrote:

It maintains a memfile_notifier list in shmem_inode_info structure and
implements memfile_pfn_ops callbacks defined by memfile_notifier. It
then exposes them to memfile_notifier via
shmem_get_memfile_notifier_info.

We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be
allocated by userspace for private memory. If there is no pages
allocated at the offset then error should be returned so KVM knows that
the memory is not private memory.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Chao Peng 



  static int memfile_get_notifier_info(struct inode *inode,
 struct memfile_notifier_list **list,
 struct memfile_pfn_ops **ops)
  {
-   return -EOPNOTSUPP;
+   int ret = -EOPNOTSUPP;
+#ifdef CONFIG_SHMEM
+   ret = shmem_get_memfile_notifier_info(inode, list, ops);
+#endif
+   return ret;
  }



+int shmem_get_memfile_notifier_info(struct inode *inode,
+   struct memfile_notifier_list **list,
+   struct memfile_pfn_ops **ops)
+{
+   struct shmem_inode_info *info;
+
+   if (!shmem_mapping(inode->i_mapping))
+   return -EINVAL;
+
+   info = SHMEM_I(inode);
+   *list = &info->memfile_notifiers;
+   if (ops)
+   *ops = &shmem_pfn_ops;
+
+   return 0;


I can't wrap my head around exactly who is supposed to call these 
functions and when, but there appears to be a missing check that the 
inode is actually a shmem inode.


What is this code trying to do?  It's very abstract.

Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

2022-02-17 Thread Andy Lutomirski

On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:
> On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
>> On 1/18/22 05:21, Chao Peng wrote:
>> > From: "Kirill A. Shutemov" 
>> > 
>> > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
>> > the file is inaccessible from userspace through ordinary MMU access
>> > (e.g., read/write/mmap). However, the file content can be accessed
>> > via a different mechanism (e.g. KVM MMU) indirectly.
>> > 
>> > It provides semantics required for KVM guest private memory support
>> > that a file descriptor with this seal set is going to be used as the
>> > source of guest memory in confidential computing environments such
>> > as Intel TDX/AMD SEV but may not be accessible from host userspace.
>> > 
>> > At this time only shmem implements this seal.
>> > 
>> 
>> I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
>> essentially transmutes a memfd into a different type of object.  While this
>> can apparently be done successfully and without races (as in this code),
>> it's at least awkward.  I think that either creating a special inaccessible
>> memfd should be a single operation that create the correct type of object or
>> there should be a clear justification for why it's a two-step process.
>
> Now one justification maybe from Stever's comment to patch-00: for ARM
> usage it can be used with creating a normal memfd, (partially)populate
> it with initial guest memory content (e.g. firmware), and then
> F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
> KVM (definitely the current code needs to be changed to support that).

Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right?  So this 
won't work.

In any case, the whole confidential VM initialization story is a bit buddy.  
From the earlier emails, it sounds like ARM expects the host to fill in guest 
memory and measure it.  From my recollection of Intel's scheme (which may well 
be wrong, and I could easily be confusing it with SGX), TDX instead measures 
what is essentially a transcript of the series of operations that initializes 
the VM.  These are fundamentally not the same thing even if they accomplish the 
same end goal.  For TDX, we unavoidably need an operation (ioctl or similar) 
that initializes things according to the VM's instructions, and ARM ought to be 
able to use roughly the same mechanism.

Also, if we ever get fancy and teach the page allocator about memory with 
reduced directmap permissions, it may well be more efficient for userspace to 
shove data into a memfd via ioctl than it is to mmap it and write the data.

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-06-10 Thread Andy Lutomirski

On Mon, Apr 25, 2022 at 1:31 PM Sean Christopherson  wrote:
>
> On Mon, Apr 25, 2022, Andy Lutomirski wrote:
> >
> >
> > On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> > > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
> > >>
> >
> > >>
> > >> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's 
> > >> in
> > >> the initial state appropriate for that VM.
> > >>
> > >> For TDX, this completely bypasses the cases where the data is 
> > >> prepopulated
> > >> and TDX can't handle it cleanly.
>
> I believe TDX can handle this cleanly, TDH.MEM.PAGE.ADD doesn't require that 
> the
> source and destination have different HPAs.  There's just no pressing need to
> support such behavior because userspace is highly motivated to keep the 
> initial
> image small for performance reasons, i.e. burning a few extra pages while 
> building
> the guest is a non-issue.

Following up on this, rather belatedly.  After re-reading the docs,
TDX can populate guest memory using TDH.MEM.PAGE.ADD, but see Intel®
TDX Module Base Spec v1.5, section 2.3, step D.4 substeps 1 and 2
here:

https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1.5-base-spec-348549001.pdf

For each TD page:

1. The host VMM specifies a TDR as a parameter and calls the
TDH.MEM.PAGE.ADD function. It copies the contents from the TD
image page into the target TD page which is encrypted with the TD
ephemeral key. TDH.MEM.PAGE.ADD also extends the TD
measurement with the page GPA.

2. The host VMM extends the TD measurement with the contents of
the new page by calling the TDH.MR.EXTEND function on each 256-
byte chunk of the new TD page.

So this is a bit like SGX.  There is a specific series of operations
that have to be done in precisely the right order to reproduce the
intended TD measurement.  Otherwise the guest will boot and run until
it tries to get a report and then it will have a hard time getting
anyone to believe its report.

So I don't think the host kernel can get away with host userspace just
providing pre-populated memory.  Userspace needs to tell the host
kernel exactly what sequence of adds, extends, etc to perform and in
what order, and the host kernel needs to do precisely what userspace
asks it to do.  "Here's the contents of memory" doesn't cut it unless
the tooling that builds the guest image matches the exact semantics
that the host kernel provides.

--Andy

Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-06-14 Thread Andy Lutomirski

On Tue, Jun 14, 2022 at 12:32 AM Chao Peng  wrote:
>
> On Thu, Jun 09, 2022 at 08:29:06PM +, Sean Christopherson wrote:
> > On Wed, Jun 08, 2022, Vishal Annapurve wrote:
> >
> > One argument is that userspace can simply rely on cgroups to detect 
> > misbehaving
> > guests, but (a) those types of OOMs will be a nightmare to debug and (b) an 
> > OOM
> > kill from the host is typically considered a _host_ issue and will be 
> > treated as
> > a missed SLO.
> >
> > An idea for handling this in the kernel without too much complexity would 
> > be to
> > add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page faults 
> > from
> > allocating pages, i.e. holes can only be filled by an explicit fallocate(). 
> >  Minor
> > faults, e.g. due to NUMA balancing stupidity, and major faults due to swap 
> > would
> > still work, but writes to previously unreserved/unallocated memory would 
> > get a
> > SIGSEGV on something it has mapped.  That would allow the userspace VMM to 
> > prevent
> > unintentional allocations without having to coordinate unmapping/remapping 
> > across
> > multiple processes.
>
> Since this is mainly for shared memory and the motivation is catching
> misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark
> those range backed by private fd as PROT_NONE during the conversion so
> subsequence misbehaved accesses will be blocked instead of causing double
> allocation silently.

This patch series is fairly close to implementing a rather more
efficient solution.  I'm not familiar enough with hypervisor userspace
to really know if this would work, but:

What if shared guest memory could also be file-backed, either in the
same fd or with a second fd covering the shared portion of a memslot?
This would allow changes to the backing store (punching holes, etc) to
be some without mmap_lock or host-userspace TLB flushes?  Depending on
what the guest is doing with its shared memory, userspace might need
the memory mapped or it might not.

--Andy

Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-06-14 Thread Andy Lutomirski

On Tue, Jun 14, 2022 at 12:09 PM Sean Christopherson  wrote:
>
> On Tue, Jun 14, 2022, Andy Lutomirski wrote:
> > On Tue, Jun 14, 2022 at 12:32 AM Chao Peng  
> > wrote:
> > >
> > > On Thu, Jun 09, 2022 at 08:29:06PM +, Sean Christopherson wrote:
> > > > On Wed, Jun 08, 2022, Vishal Annapurve wrote:
> > > >
> > > > One argument is that userspace can simply rely on cgroups to detect 
> > > > misbehaving
> > > > guests, but (a) those types of OOMs will be a nightmare to debug and 
> > > > (b) an OOM
> > > > kill from the host is typically considered a _host_ issue and will be 
> > > > treated as
> > > > a missed SLO.
> > > >
> > > > An idea for handling this in the kernel without too much complexity 
> > > > would be to
> > > > add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page 
> > > > faults from
> > > > allocating pages, i.e. holes can only be filled by an explicit 
> > > > fallocate().  Minor
> > > > faults, e.g. due to NUMA balancing stupidity, and major faults due to 
> > > > swap would
> > > > still work, but writes to previously unreserved/unallocated memory 
> > > > would get a
> > > > SIGSEGV on something it has mapped.  That would allow the userspace VMM 
> > > > to prevent
> > > > unintentional allocations without having to coordinate 
> > > > unmapping/remapping across
> > > > multiple processes.
> > >
> > > Since this is mainly for shared memory and the motivation is catching
> > > misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark
> > > those range backed by private fd as PROT_NONE during the conversion so
> > > subsequence misbehaved accesses will be blocked instead of causing double
> > > allocation silently.
>
> PROT_NONE, a.k.a. mprotect(), has the same vma downsides as munmap().
>
> > This patch series is fairly close to implementing a rather more
> > efficient solution.  I'm not familiar enough with hypervisor userspace
> > to really know if this would work, but:
> >
> > What if shared guest memory could also be file-backed, either in the
> > same fd or with a second fd covering the shared portion of a memslot?
> > This would allow changes to the backing store (punching holes, etc) to
> > be some without mmap_lock or host-userspace TLB flushes?  Depending on
> > what the guest is doing with its shared memory, userspace might need
> > the memory mapped or it might not.
>
> That's what I'm angling for with the F_SEAL_FAULT_ALLOCATIONS idea.  The 
> issue,
> unless I'm misreading code, is that punching a hole in the shared memory 
> backing
> store doesn't prevent reallocating that hole on fault, i.e. a helper process 
> that
> keeps a valid mapping of guest shared memory can silently fill the hole.
>
> What we're hoping to achieve is a way to prevent allocating memory without a 
> very
> explicit action from userspace, e.g. fallocate().

Ah, I misunderstood.  I thought your goal was to mmap it and prevent
page faults from allocating.

It is indeed the case (and has been since before quite a few of us
were born) that a hole in a sparse file is logically just a bunch of
zeros.  A way to make a file for which a hole is an actual hole seems
like it would solve this problem nicely.  It could also be solved more
specifically for KVM by making sure that the private/shared mode that
userspace programs is strict enough to prevent accidental allocations
-- if a GPA is definitively private, shared, neither, or (potentially,
on TDX only) both, then a page that *isn't* shared will never be
accidentally allocated by KVM.  If the shared backing is not mmapped,
it also won't be accidentally allocated by host userspace on a stray
or careless write.


--Andy

Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

2022-03-04 Thread Andy Lutomirski


On 2/23/22 04:05, Steven Price wrote:

On 23/02/2022 11:49, Chao Peng wrote:

On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote:

On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:

On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:

On 1/18/22 05:21, Chao Peng wrote:

From: "Kirill A. Shutemov" 

Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
the file is inaccessible from userspace through ordinary MMU access
(e.g., read/write/mmap). However, the file content can be accessed
via a different mechanism (e.g. KVM MMU) indirectly.

It provides semantics required for KVM guest private memory support
that a file descriptor with this seal set is going to be used as the
source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV but may not be accessible from host userspace.

At this time only shmem implements this seal.



I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
essentially transmutes a memfd into a different type of object.  While this
can apparently be done successfully and without races (as in this code),
it's at least awkward.  I think that either creating a special inaccessible
memfd should be a single operation that create the correct type of object or
there should be a clear justification for why it's a two-step process.


Now one justification maybe from Stever's comment to patch-00: for ARM
usage it can be used with creating a normal memfd, (partially)populate
it with initial guest memory content (e.g. firmware), and then
F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
KVM (definitely the current code needs to be changed to support that).


Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right?  So this 
won't work.


Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will
need to make sure access to existing mmap-ed area should be prevented,
but that is hard.



In any case, the whole confidential VM initialization story is a bit buddy.  
From the earlier emails, it sounds like ARM expects the host to fill in guest 
memory and measure it.  From my recollection of Intel's scheme (which may well 
be wrong, and I could easily be confusing it with SGX), TDX instead measures 
what is essentially a transcript of the series of operations that initializes 
the VM.  These are fundamentally not the same thing even if they accomplish the 
same end goal.  For TDX, we unavoidably need an operation (ioctl or similar) 
that initializes things according to the VM's instructions, and ARM ought to be 
able to use roughly the same mechanism.


Yes, TDX requires a ioctl. Steven may comment on the ARM part.


The Arm story is evolving so I can't give a definite answer yet. Our
current prototyping works by creating the initial VM content in a
memslot as with a normal VM and then calling an ioctl which throws the
big switch and converts all the (populated) pages to be protected. At
this point the RMM performs a measurement of the data that the VM is
being populated with.

The above (in our prototype) suffers from all the expected problems with
a malicious VMM being able to trick the host kernel into accessing those
pages after they have been protected (causing a fault detected by the
hardware).

The ideal (from our perspective) approach would be to follow the same
flow but where the VMM populates a memfd rather than normal anonymous
pages. The memfd could then be sealed and the pages converted to
protected ones (with the RMM measuring them in the process).

The question becomes how is that memfd populated? It would be nice if
that could be done using normal operations on a memfd (i.e. using
mmap()) and therefore this code could be (relatively) portable. This
would mean that any pages mapped from the memfd would either need to
block the sealing or be revoked at the time of sealing.

The other approach is we could of course implement a special ioctl which
effectively does a memcpy into the (created empty and sealed) memfd and
does the necessary dance with the RMM to measure the contents. This
would match the "transcript of the series of operations" described above
- but seems much less ideal from the viewpoint of the VMM.


A VMM that supports Other Vendors will need to understand this sort of 
model regardless.


I don't particularly mind the idea of having the kernel consume a normal 
memfd and spit out a new object, but I find the concept of changing the 
type of the object in place, even if it has other references, and trying 
to control all the resulting races to be somewhat alarming.


In pseudo-Rust, this is the difference between:

fn convert_to_private(in: &mut Memfd)

and

fn convert_to_private(in: Memfd) -> PrivateMemoryFd

This doesn't map particularly nicely to the kernel, though.

--Andy\

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-07-13 Thread Andy Lutomirski




On Wed, Jul 13, 2022, at 3:35 AM, Gupta, Pankaj wrote:
>>>> This is the v7 of this series which tries to implement the fd-based KVM
>>>> guest private memory. The patches are based on latest kvm/queue branch
>>>> commit:
>>>>
>>>> b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>>> split_desc_cache only by default capacity
>>>>
>>>> Introduction
>>>> 
>>>> In general this patch series introduce fd-based memslot which provides
>>>> guest memory through memory file descriptor fd[offset,size] instead of
>>>> hva/size. The fd can be created from a supported memory filesystem
>>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>
>>> Thinking a bit, As host side fd on tmpfs or shmem will store memory on host
>>> page cache instead of mapping pages into userspace address space. Can we hit
>>> double (un-coordinated) page cache problem with this when guest page cache
>>> is also used?
>> 
>> This is my understanding: in host it will be indeed in page cache (in
>> current shmem implementation) but that's just the way it allocates and
>> provides the physical memory for the guest. In guest, guest OS will not
>> see this fd (absolutely), it only sees guest memory, on top of which it
>> can build its own page cache system for its own file-mapped content but
>> that is unrelated to host page cache.
>
> yes. If guest fills its page cache with file backed memory, this at host 
> side(on shmem fd backend) will also fill the host page cache fast. This 
> can have an impact on performance of guest VM's if host goes to memory 
> pressure situation sooner. Or else we end up utilizing way less System 
> RAM.

Is this in any meaningful way different from a regular VM?

--Andy

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-28 Thread Andy Lutomirski

On Thu, Mar 10, 2022 at 6:09 AM Chao Peng  wrote:
>
> This is the v5 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
>
>   d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2

Can this series be run and a VM booted without TDX?  A feature like
that might help push it forward.

--Andy

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-31 Thread Andy Lutomirski

On Wed, Mar 30, 2022, at 10:58 AM, Sean Christopherson wrote:
> On Wed, Mar 30, 2022, Quentin Perret wrote:
>> On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote:
>> > On 29/03/2022 18:01, Quentin Perret wrote:
>> > > Is implicit sharing a thing? E.g., if a guest makes a memory access in
>> > > the shared gpa range at an address that doesn't have a backing memslot,
>> > > will KVM check whether there is a corresponding private memslot at the
>> > > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
>> > > would that just generate an MMIO exit as usual?
>> > 
>> > My understanding is that the guest needs some way of tagging whether a
>> > page is expected to be shared or private. On the architectures I'm aware
>> > of this is done by effectively stealing a bit from the IPA space and
>> > pretending it's a flag bit.
>> 
>> Right, and that is in fact the main point of divergence we have I think.
>> While I understand this might be necessary for TDX and the likes, this
>> makes little sense for pKVM. This would effectively embed into the IPA a
>> purely software-defined non-architectural property/protocol although we
>> don't actually need to: we (pKVM) can reasonably expect the guest to
>> explicitly issue hypercalls to share pages in-place. So I'd be really
>> keen to avoid baking in assumptions about that model too deep in the
>> host mm bits if at all possible.
>
> There is no assumption about stealing PA bits baked into this API.  Even 
> within
> x86 KVM, I consider it a hard requirement that the common flows not assume the
> private vs. shared information is communicated through the PA.

Quentin, I think we might need a clarification.  The API in this patchset 
indeed has no requirement that a PA bit distinguish between private and shared, 
but I think it makes at least a weak assumption that *something*, a priori, 
distinguishes them.  In particular, there are private memslots and shared 
memslots, so the logical flow of resolving a guest memory access looks like:

1. guest accesses a GVA

2. read guest paging structures

3. determine whether this is a shared or private access

4. read host (KVM memslots and anything else, EPT, NPT, RMP, etc) structures 
accordingly.  In particular, the memslot to reference is different depending on 
the access type.

For TDX, this maps on to the fd-based model perfectly: the host-side paging 
structures for the shared and private slots are completely separate.  For SEV, 
the structures are shared and KVM will need to figure out what to do in case a 
private and shared memslot overlap.  Presumably it's sufficient to declare that 
one of them wins, although actually determining which one is active for a given 
GPA may involve checking whether the backing store for a given page actually 
exists.

But I don't understand pKVM well enough to understand how it fits in.  Quentin, 
how is the shared vs private mode of a memory access determined?  How do the 
paging structures work?  Can a guest switch between shared and private by 
issuing a hypercall without changing any guest-side paging structures or 
anything else?

It's plausible that SEV and (maybe) pKVM would be better served if memslots 
could be sparse or if there was otherwise a direct way for host userspace to 
indicate to KVM which address ranges are actually active (not hole-punched) in 
a given memslot or to otherwise be able to make a rule that two different 
memslots (one shared and one private) can't claim the same address.

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-01 Thread Andy Lutomirski

On Fri, Apr 1, 2022, at 7:59 AM, Quentin Perret wrote:
> On Thursday 31 Mar 2022 at 09:04:56 (-0700), Andy Lutomirski wrote:

> To answer your original question about memory 'conversion', the key
> thing is that the pKVM hypervisor controls the stage-2 page-tables for
> everyone in the system, all guests as well as the host. As such, a page
> 'conversion' is nothing more than a permission change in the relevant
> page-tables.
>

So I can see two different ways to approach this.

One is that you split the whole address space in half and, just like SEV and 
TDX, allocate one bit to indicate the shared/private status of a page.  This 
makes it work a lot like SEV and TDX.

The other is to have shared and private pages be distinguished only by their 
hypercall history and the (protected) page tables.  This saves some address 
space and some page table allocations, but it opens some cans of worms too.  In 
particular, the guest and the hypervisor need to coordinate, in a way that the 
guest can trust, to ensure that the guest's idea of which pages are private 
match the host's.  This model seems a bit harder to support nicely with the 
private memory fd model, but not necessarily impossible.

Also, what are you trying to accomplish by having the host userspace mmap 
private pages?  Is the idea that multiple guest could share the same page until 
such time as one of them tries to write to it?  That would be kind of like 
having a third kind of memory that's visible to host and guests but is 
read-only for everyone.  TDX and SEV can't support this at all (a private page 
belongs to one guest and one guest only, at least in SEV and in the current TDX 
SEAM spec).  I imagine that this could be supported with private memory fds 
with some care without mmap, though -- the host could still populate the page 
with memcpy.  Or I suppose a memslot could support using MAP_PRIVATE fds and 
have approximately the right semantics.

--Andy

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-04 Thread Andy Lutomirski

On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote:
> On Mon, Apr 04, 2022, Quentin Perret wrote:
>> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
>> FWIW, there are a couple of reasons why I'd like to have in-place
>> conversions:
>> 
>>  - one goal of pKVM is to migrate some things away from the Arm
>>Trustzone environment (e.g. DRM and the likes) and into protected VMs
>>instead. This will give Linux a fighting chance to defend itself
>>against these things -- they currently have access to _all_ memory.
>>And transitioning pages between Linux and Trustzone (donations and
>>shares) is fast and non-destructive, so we really do not want pKVM to
>>regress by requiring the hypervisor to memcpy things;
>
> Is there actually a _need_ for the conversion to be non-destructive?  
> E.g. I assume
> the "trusted" side of things will need to be reworked to run as a pKVM 
> guest, at
> which point reworking its logic to understand that conversions are 
> destructive and
> slow-ish doesn't seem too onerous.
>
>>  - it can be very useful for protected VMs to do shared=>private
>>conversions. Think of a VM receiving some data from the host in a
>>shared buffer, and then it wants to operate on that buffer without
>>risking to leak confidential informations in a transient state. In
>>that case the most logical thing to do is to convert the buffer back
>>to private, do whatever needs to be done on that buffer (decrypting a
>>frame, ...), and then share it back with the host to consume it;
>
> If performance is a motivation, why would the guest want to do two 
> conversions
> instead of just doing internal memcpy() to/from a private page?  I 
> would be quite
> surprised if multiple exits and TLB shootdowns is actually faster, 
> especially at
> any kind of scale where zapping stage-2 PTEs will cause lock contention 
> and IPIs.

I don't know the numbers or all the details, but this is arm64, which is a 
rather better architecture than x86 in this regard.  So maybe it's not so bad, 
at least in very simple cases, ignoring all implementation details.  (But see 
below.)  Also the systems in question tend to have fewer CPUs than some of the 
massive x86 systems out there.

If we actually wanted to support transitioning the same page between shared and 
private, though, we have a bit of an awkward situation.  Private to shared is 
conceptually easy -- do some bookkeeping, reconstitute the direct map entry, 
and it's done.  The other direction is a mess: all existing uses of the page 
need to be torn down.  If the page has been recently used for DMA, this 
includes IOMMU entries.

Quentin: let's ignore any API issues for now.  Do you have a concept of how a 
nondestructive shared -> private transition could work well, even in principle? 
 The best I can come up with is a special type of shared page that is not 
GUP-able and maybe not even mmappable, having a clear option for transitions to 
fail, and generally preventing the nasty cases from happening in the first 
place.

Maybe there could be a special mode for the private memory fds in which 
specific pages are marked as "managed by this fd but actually shared".  pread() 
and pwrite() would work on those pages, but not mmap().  (Or maybe mmap() but 
the resulting mappings would not permit GUP.)  And transitioning them would be 
a special operation on the fd that is specific to pKVM and wouldn't work on TDX 
or SEV.

Hmm.  Sean and Chao, are we making a bit of a mistake by making these fds 
technology-agnostic?  That is, would we want to distinguish between a TDX 
backing fd, a SEV backing fd, a software-based backing fd, etc?  API-wise this 
could work by requiring the fd to be bound to a KVM VM instance and possibly 
even configured a bit before any other operations would be allowed.

(Destructive transitions nicely avoid all the nasty cases.  If something is 
still pinning a shared page when it's "transitioned" to private (really just 
replaced with a new page), then the old page continues existing for as long as 
needed as a separate object.)

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-05 Thread Andy Lutomirski




On Tue, Apr 5, 2022, at 3:36 AM, Quentin Perret wrote:
> On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
>> 
>> 
>> On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote:
>> > On Mon, Apr 04, 2022, Quentin Perret wrote:
>> >> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
>> >> FWIW, there are a couple of reasons why I'd like to have in-place
>> >> conversions:
>> >> 
>> >>  - one goal of pKVM is to migrate some things away from the Arm
>> >>Trustzone environment (e.g. DRM and the likes) and into protected VMs
>> >>instead. This will give Linux a fighting chance to defend itself
>> >>against these things -- they currently have access to _all_ memory.
>> >>And transitioning pages between Linux and Trustzone (donations and
>> >>shares) is fast and non-destructive, so we really do not want pKVM to
>> >>regress by requiring the hypervisor to memcpy things;
>> >
>> > Is there actually a _need_ for the conversion to be non-destructive?  
>> > E.g. I assume
>> > the "trusted" side of things will need to be reworked to run as a pKVM 
>> > guest, at
>> > which point reworking its logic to understand that conversions are 
>> > destructive and
>> > slow-ish doesn't seem too onerous.
>> >
>> >>  - it can be very useful for protected VMs to do shared=>private
>> >>conversions. Think of a VM receiving some data from the host in a
>> >>shared buffer, and then it wants to operate on that buffer without
>> >>risking to leak confidential informations in a transient state. In
>> >>that case the most logical thing to do is to convert the buffer back
>> >>to private, do whatever needs to be done on that buffer (decrypting a
>> >>frame, ...), and then share it back with the host to consume it;
>> >
>> > If performance is a motivation, why would the guest want to do two 
>> > conversions
>> > instead of just doing internal memcpy() to/from a private page?  I 
>> > would be quite
>> > surprised if multiple exits and TLB shootdowns is actually faster, 
>> > especially at
>> > any kind of scale where zapping stage-2 PTEs will cause lock contention 
>> > and IPIs.
>> 
>> I don't know the numbers or all the details, but this is arm64, which is a 
>> rather better architecture than x86 in this regard.  So maybe it's not so 
>> bad, at least in very simple cases, ignoring all implementation details.  
>> (But see below.)  Also the systems in question tend to have fewer CPUs than 
>> some of the massive x86 systems out there.
>
> Yep. I can try and do some measurements if that's really necessary, but
> I'm really convinced the cost of the TLBI for the shared->private
> conversion is going to be significantly smaller than the cost of memcpy
> the buffer twice in the guest for us. To be fair, although the cost for
> the CPU update is going to be low, the cost for IOMMU updates _might_ be
> higher, but that very much depends on the hardware. On systems that use
> e.g. the Arm SMMU, the IOMMUs can use the CPU page-tables directly, and
> the iotlb invalidation is done on the back of the CPU invalidation. So,
> on systems with sane hardware the overhead is *really* quite small.
>
> Also, memcpy requires double the memory, it is pretty bad for power, and
> it causes memory traffic which can't be a good thing for things running
> concurrently.
>
>> If we actually wanted to support transitioning the same page between shared 
>> and private, though, we have a bit of an awkward situation.  Private to 
>> shared is conceptually easy -- do some bookkeeping, reconstitute the direct 
>> map entry, and it's done.  The other direction is a mess: all existing uses 
>> of the page need to be torn down.  If the page has been recently used for 
>> DMA, this includes IOMMU entries.
>>
>> Quentin: let's ignore any API issues for now.  Do you have a concept of how 
>> a nondestructive shared -> private transition could work well, even in 
>> principle?
>
> I had a high level idea for the workflow, but I haven't looked into the
> implementation details.
>
> The idea would be to allow KVM *or* userspace to take a reference
> to a page in the fd in an exclusive manner. KVM could take a reference
> on a page (which would be necessary before to donating it to a guest)
> using some kind of memfile_notifier as proposed in this series, and
>

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-06 Thread Andy Lutomirski




On Tue, Apr 5, 2022, at 11:30 AM, Sean Christopherson wrote:
> On Tue, Apr 05, 2022, Andy Lutomirski wrote:

>
>> resume guest
>> *** host -> hypervisor -> guest ***
>> Guest unshares the page.
>> *** guest -> hypervisor ***
>> Hypervisor removes PTE.  TLBI.
>> *** hypervisor -> guest ***
>> 
>> Obviously considerable cleverness is needed to make a virt IOMMU like this
>> work well, but still.
>> 
>> Anyway, my suggestion is that the fd backing proposal get slightly modified
>> to get it ready for multiple subtypes of backing object, which should be a
>> pretty minimal change.  Then, if someone actually needs any of this
>> cleverness, it can be added later.  In the mean time, the
>> pread()/pwrite()/splice() scheme is pretty good.
>
> Tangentially related to getting private-fd ready for multiple things, 
> what about
> implementing the pread()/pwrite()/splice() scheme in pKVM itself?  I.e. 
> read() on
> the VM fd, with the offset corresponding to gfn in some way.
>

Hmm, could make sense.

Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK

2022-04-07 Thread Andy Lutomirski

On Thu, Apr 7, 2022, at 9:05 AM, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
>> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
>> memory behave like longterm pinned pages and thus should be accounted to
>> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.
>> 
>> Signed-off-by: Chao Peng 
>> ---
>>  mm/shmem.c | 25 -
>>  1 file changed, 24 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 7b43e274c9a2..ae46fb96494b 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, 
>> pgoff_t start, pgoff_t end)
>>  static void notify_invalidate_page(struct inode *inode, struct folio *folio,
>> pgoff_t start, pgoff_t end)
>>  {
>> -#ifdef CONFIG_MEMFILE_NOTIFIER
>>  struct shmem_inode_info *info = SHMEM_I(inode);
>>  
>> +#ifdef CONFIG_MEMFILE_NOTIFIER
>>  start = max(start, folio->index);
>>  end = min(end, folio->index + folio_nr_pages(folio));
>>  
>>  memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
>>  #endif
>> +
>> +if (info->xflags & SHM_F_INACCESSIBLE)
>> +atomic64_sub(end - start, ¤t->mm->pinned_vm);
>
> As Vishal's to-be-posted selftest discovered, this is broken as 
> current->mm may
> be NULL.  Or it may be a completely different mm, e.g. AFAICT there's 
> nothing that
> prevents a different process from punching hole in the shmem backing.
>

How about just not charging the mm in the first place?  There’s precedent: 
ramfs and hugetlbfs (at least sometimes — I’ve lost track of the current 
status).

In any case, for an administrator to try to assemble the various rlimits into a 
coherent policy is, and always has been, quite messy. ISTM cgroup limits, which 
can actually add across processes usefully, are much better.

So, aside from the fact that these fds aren’t in a filesystem and are thus 
available by default, I’m not convinced that this accounting is useful or 
necessary.

Maybe we could just have some switch require to enable creation of private 
memory in the first place, and anyone who flips that switch without configuring 
cgroups is subject to DoS.

Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK

2022-04-12 Thread Andy Lutomirski

On Tue, Apr 12, 2022, at 7:36 AM, Jason Gunthorpe wrote:
> On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:
>
>> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
>> past already with secretmem, it's not 100% that good of a fit (unmovable
>> is worth than mlocked). But it gets the job done for now at least.
>
> No, it doesn't. There are too many different interpretations how
> MELOCK is supposed to work
>
> eg VFIO accounts per-process so hostile users can just fork to go past
> it.
>
> RDMA is per-process but uses a different counter, so you can double up
>
> iouring is per-user and users a 3rd counter, so it can triple up on
> the above two
>
>> So I'm open for alternative to limit the amount of unmovable memory we
>> might allocate for user space, and then we could convert seretmem as well.
>
> I think it has to be cgroup based considering where we are now :\
>

So this is another situation where the actual backend (TDX, SEV, pKVM, pure 
software) makes a difference -- depending on exactly what backend we're using, 
the memory may not be unmoveable.  It might even be swappable (in the 
potentially distant future).

Anyway, here's a concrete proposal, with a bit of handwaving:

We add new cgroup limits:

memory.unmoveable
memory.locked

These can be set to an actual number or they can be set to the special value 
ROOT_CAP.  If they're set to ROOT_CAP, then anyone in the cgroup with 
capable(CAP_SYS_RESOURCE) (i.e. the global capability) can allocate movable or 
locked memory with this (and potentially other) new APIs.  If it's 0, then they 
can't.  If it's another value, then the memory can be allocated, charged to the 
cgroup, up to the limit, with no particular capability needed.  The default at 
boot is ROOT_CAP.  Anyone who wants to configure it differently is free to do 
so.  This avoids introducing a DoS, makes it easy to run tests without 
configuring cgroup, and lets serious users set up their cgroups.

Nothing is charge per mm.

To make this fully sensible, we need to know what the backend is for the 
private memory before allocating any so that we can charge it accordingly.

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-08 Thread Andy Lutomirski

On 8/18/22 06:24, Kirill A . Shutemov wrote:

On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:

On Wed, 6 Jul 2022, Chao Peng wrote:

This is the v7 of this series which tries to implement the fd-based KVM
guest private memory.

Here at last are my reluctant thoughts on this patchset.

fd-based approach for supporting KVM guest private memory: fine.

Use or abuse of memfd and shmem.c: mistaken.

memfd_create() was an excellent way to put together the initial prototype.

But since then, TDX in particular has forced an effort into preventing
(by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.

Are any of the shmem.c mods useful to existing users of shmem.c? No.
Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.

What use do you have for a filesystem here? Almost none.
IIUC, what you want is an fd through which QEMU can allocate kernel
memory, selectively free that memory, and communicate fd+offset+length
to KVM. And perhaps an interface to initialize a little of that memory
from a template (presumably copied from a real file on disk somewhere).

You don't need shmem.c or a filesystem for that!

If your memory could be swapped, that would be enough of a good reason
to make use of shmem.c: but it cannot be swapped; and although there
are some references in the mailthreads to it perhaps being swappable
in future, I get the impression that will not happen soon if ever.

If your memory could be migrated, that would be some reason to use
filesystem page cache (because page migration happens to understand
that type of memory): but it cannot be migrated.

Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
theoretically possible, but I'm not aware of any plans as of now.

[1]
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html

This thing?

https://cdrdv2.intel.com/v1/dl/getContent/733578

That looks like migration between computers, not between NUMA nodes. Or
am I missing something?

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-08 Thread Andy Lutomirski

On 8/19/22 17:27, Kirill A. Shutemov wrote:

On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:

On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:

On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:

If your memory could be migrated, that would be some reason to use
filesystem page cache (because page migration happens to understand
that type of memory): but it cannot be migrated.

Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
theoretically possible, but I'm not aware of any plans as of now.

[1]
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html

I always forget, migration means different things to different audiences.
As an mm person, I was meaning page migration, whereas a virtualization
person thinks VM live migration (which that reference appears to be about),
a scheduler person task migration, an ornithologist bird migration, etc.

But you're an mm person too: you may have cited that reference in the
knowledge that TDX 1.5 Live Migration will entail page migration of the
kind I'm thinking of. (Anyway, it's not important to clarify that here.)

TDX 1.5 brings both.

In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.

This seems to be a pretty bad fit for the way that the core mm migrates
pages. The core mm unmaps the page, then moves (in software) the
contents to a new address, then faults it in. TDH.MEM.PAGE.RELOCATE
doesn't fit into that workflow very well. I'm not saying it can't be
done, but it won't just work.

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-08 Thread Andy Lutomirski


On 8/24/22 02:41, Chao Peng wrote:

On Tue, Aug 23, 2022 at 04:05:27PM +, Sean Christopherson wrote:

On Tue, Aug 23, 2022, David Hildenbrand wrote:

On 19.08.22 05:38, Hugh Dickins wrote:

On Fri, 19 Aug 2022, Sean Christopherson wrote:

On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:

On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:

On Wed, 6 Jul 2022, Chao Peng wrote:
But since then, TDX in particular has forced an effort into preventing
(by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.

Are any of the shmem.c mods useful to existing users of shmem.c? No.
Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.


But QEMU and other VMMs are users of shmem and memfd.  The new features 
certainly
aren't useful for _all_ existing users, but I don't think it's fair to say that
they're not useful for _any_ existing users.


Okay, I stand corrected: there exist some users of memfd_create()
who will also have use for "INACCESSIBLE" memory.


As raised in reply to the relevant patch, I'm not sure if we really have
to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a
requirement of specific memfd_notifer (memfile_notifier) implementations
-- such as TDX that will convert the memory and MCE-kill the machine on
ordinary write access. We might be able to set/enforce this when
registering a notifier internally instead, and fail notifier
registration if a condition isn't met (e.g., existing mmap).

So I'd be curious, which other users of shmem/memfd would benefit from
(MMU)-"INACCESSIBLE" memory obtained via memfd_create()?


I agree that there's no need to expose the inaccessible behavior via uAPI.  
Making
it a kernel-internal thing that's negotiated/resolved when KVM binds to the fd
would align INACCESSIBLE with the UNMOVABLE and UNRECLAIMABLE flags (and any 
other
flags that get added in the future).

AFAICT, the user-visible flag is a holdover from the early RFCs and doesn't 
provide
any unique functionality.


That's also what I'm thinking. And I don't see problem immediately if
user has populated the fd at the binding time. Actually that looks an
advantage for previously discussed guest payload pre-loading.


I think this gets awkward. Trying to define sensible semantics for what 
happens if a shmem or similar fd gets used as secret guest memory and 
that fd isn't utterly and completely empty can get quite nasty.  For 
example:


If there are already mmaps, then TDX (much more so than SEV) really 
doesn't want to also use it as guest memory.


If there is already data in the fd, then maybe some technologies can use 
this for pre-population, but TDX needs explicit instructions in order to 
get the guest's hash right.



In general, it seems like it will be much more likely to actually work 
well if the user (uAPI) is required to declare to the kernel exactly 
what the fd is for (e.g. TDX secret memory, software-only secret memory, 
etc) before doing anything at all with it other than binding it to KVM.


INACCESSIBLE is a way to achieve this.  Maybe it's not the prettiest in 
the world -- I personally would rather see an explicit request for, say, 
TDX or SEV memory or maybe the memory that works for a particular KVM 
instance instead of something generic like INACCESSIBLE, but this is a 
pretty weak preference.  But I think that just starting with a plain 
memfd is a can of worms.

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-09 Thread Andy Lutomirski




On Fri, Sep 9, 2022, at 7:32 AM, Kirill A . Shutemov wrote:
> On Thu, Sep 08, 2022 at 09:48:35PM -0700, Andy Lutomirski wrote:
>> On 8/19/22 17:27, Kirill A. Shutemov wrote:
>> > On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
>> > > On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
>> > > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
>> > > > > 
>> > > > > If your memory could be swapped, that would be enough of a good 
>> > > > > reason
>> > > > > to make use of shmem.c: but it cannot be swapped; and although there
>> > > > > are some references in the mailthreads to it perhaps being swappable
>> > > > > in future, I get the impression that will not happen soon if ever.
>> > > > > 
>> > > > > If your memory could be migrated, that would be some reason to use
>> > > > > filesystem page cache (because page migration happens to understand
>> > > > > that type of memory): but it cannot be migrated.
>> > > > 
>> > > > Migration support is in pipeline. It is part of TDX 1.5 [1]. And 
>> > > > swapping
>> > > > theoretically possible, but I'm not aware of any plans as of now.
>> > > > 
>> > > > [1] 
>> > > > https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
>> > > 
>> > > I always forget, migration means different things to different audiences.
>> > > As an mm person, I was meaning page migration, whereas a virtualization
>> > > person thinks VM live migration (which that reference appears to be 
>> > > about),
>> > > a scheduler person task migration, an ornithologist bird migration, etc.
>> > > 
>> > > But you're an mm person too: you may have cited that reference in the
>> > > knowledge that TDX 1.5 Live Migration will entail page migration of the
>> > > kind I'm thinking of.  (Anyway, it's not important to clarify that here.)
>> > 
>> > TDX 1.5 brings both.
>> > 
>> > In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
>> > 
>> 
>> This seems to be a pretty bad fit for the way that the core mm migrates
>> pages.  The core mm unmaps the page, then moves (in software) the contents
>> to a new address, then faults it in.  TDH.MEM.PAGE.RELOCATE doesn't fit into
>> that workflow very well.  I'm not saying it can't be done, but it won't just
>> work.
>
> Hm. From what I see we have all necessary infrastructure in place.
>
> Unmaping is NOP for inaccessible pages as it is never mapped and we have
> mapping->a_ops->migrate_folio() callback that allows to replace software
> copying with whatever is needed, like TDH.MEM.PAGE.RELOCATE.
>
> What do I miss?

Hmm, maybe this isn't as bad as I thought.

Right now, unless I've missed something, the migration workflow is to unmap 
(via try_to_migrate) all mappings, then migrate the backing store (with 
->migrate_folio(), although it seems like most callers expect the actual copy 
to happen outside of ->migrate_folio(), and then make new mappings.  With the 
*current* (vma-based, not fd-based) model for KVM memory, this won't work -- we 
can't unmap before calling TDH.MEM.PAGE.RELOCATE.

But maybe it's actually okay with some care or maybe mild modifications with 
the fd-based model.  We don't have any mmaps, per se, to unmap for secret / 
INACCESSIBLE memory.  So maybe we can get all the way to ->migrate_folio() 
without zapping anything in the secure EPT and just call TDH-MEM.PAGE.RELOCATE 
from inside migrate_folio().  And there will be nothing to fault back in.  From 
the core code's perspective, it's like migrating a memfd that doesn't happen to 
have my mappings at the time.

--Andy

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-21 Thread Andy Lutomirski

(please excuse any formatting disasters.  my internet went out as I was 
composing this, and i did my best to rescue it.)

On Mon, Sep 19, 2022, at 12:10 PM, Sean Christopherson wrote:
> +Will, Marc and Fuad (apologies if I missed other pKVM folks)
>
> On Mon, Sep 19, 2022, David Hildenbrand wrote:
>> On 15.09.22 16:29, Chao Peng wrote:
>> > From: "Kirill A. Shutemov" 
>> > 
>> > KVM can use memfd-provided memory for guest memory. For normal userspace
>> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
>> > virtual address space and then tells KVM to use the virtual address to
>> > setup the mapping in the secondary page table (e.g. EPT).
>> > 
>> > With confidential computing technologies like Intel TDX, the
>> > memfd-provided memory may be encrypted with special key for special
>> > software domain (e.g. KVM guest) and is not expected to be directly
>> > accessed by userspace. Precisely, userspace access to such encrypted
>> > memory may lead to host crash so it should be prevented.
>> 
>> Initially my thaught was that this whole inaccessible thing is TDX specific
>> and there is no need to force that on other mechanisms. That's why I
>> suggested to not expose this to user space but handle the notifier
>> requirements internally.
>> 
>> IIUC now, protected KVM has similar demands. Either access (read/write) of
>> guest RAM would result in a fault and possibly crash the hypervisor (at
>> least not the whole machine IIUC).
>
> Yep.  The missing piece for pKVM is the ability to convert from shared 
> to private
> while preserving the contents, e.g. to hand off a large buffer 
> (hundreds of MiB)
> for processing in the protected VM.  Thoughts on this at the bottom.
>
>> > This patch introduces userspace inaccessible memfd (created with
>> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
>> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
>> > in-kernel interface so KVM can directly interact with core-mm without
>> > the need to map the memory into KVM userspace.
>> 
>> With secretmem we decided to not add such "concept switch" flags and instead
>> use a dedicated syscall.
>>
>
> I have no personal preference whatsoever between a flag and a dedicated 
> syscall,
> but a dedicated syscall does seem like it would give the kernel a bit more
> flexibility.

The third option is a device node, e.g. /dev/kvm_secretmem or /dev/kvm_tdxmem 
or similar.  But if we need flags or other details in the future, maybe this 
isn't ideal.

>
>> What about memfd_inaccessible()? Especially, sealing and hugetlb are not
>> even supported and it might take a while to support either.
>
> Don't know about sealing, but hugetlb support for "inaccessible" memory 
> needs to
> come sooner than later.  "inaccessible" in quotes because we might want 
> to choose
> a less binary name, e.g. "restricted"?.
>
> Regarding pKVM's use case, with the shim approach I believe this can be done 
> by
> allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> piled on top.
>
> My first thought was to make the uAPI a set of KVM ioctls so that KVM 
> could tightly
> tightly control usage without taking on too much complexity in the 
> kernel, but
> working through things, routing the behavior through the shim itself 
> might not be
> all that horrific.
>
> IIRC, we discarded the idea of allowing userspace to map the "private" 
> fd because
> things got too complex, but with the shim it doesn't seem _that_ bad.

What's the exact use case?  Is it just to pre-populate the memory?

>
> E.g. on the memfd side:
>
>   1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
>  mapping is all or nothing.
>
>   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping 
> for
>  the restricted memfd.
>
>   3. Add notifier hooks to allow downstream users to further restrict things.
>
>   4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything 
> in
>  one shot.
>
>   5. Require that there are no outstanding references at munmap().  Or if this
>  can't be guaranteed by userspace, maybe add some way for userspace to 
> wait
>  until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
>  to do an expensive check every time.

Hmm.  I haven't looked at the code to see if this would really work, but I 
think this could be done more in line with how the rest of the kernel works by 
using the rmap infrastructure.  When the pKVM memfd is in not-yet-private mode, 
just let it be mmapped as usual (but don't allow any form of GUP or pinning).  
Then have an ioctl to switch to to shared mode that takes locks or sets flags 
so that no new faults can be serviced and does unmap_mapping_range.

As long as the shim arranges to have its own vm_ops, I don't immediately see 
any reason this can't work.

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-24 Thread Andy Lutomirski

On Fri, Apr 22, 2022, at 3:56 AM, Chao Peng wrote:
> On Tue, Apr 05, 2022 at 06:03:21PM +, Sean Christopherson wrote:
>> On Tue, Apr 05, 2022, Quentin Perret wrote:
>> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> Only when the register succeeds, the fd is
> converted into a private fd, before that, the fd is just a normal (shared)
> one. During this conversion, the previous data is preserved so you can put
> some initial data in guest pages (whether the architecture allows this is
> architecture-specific and out of the scope of this patch).

I think this can be made to work, but it will be awkward.  On TDX, for example, 
what exactly are the semantics supposed to be?  An error code if the memory 
isn't all zero?  An error code if it has ever been written?

Fundamentally, I think this is because your proposed lifecycle for these 
memfiles results in a lightweight API but is awkward for the intended use 
cases.  You're proposing, roughly:

1. Create a memfile. 

Now it's in a shared state with an unknown virt technology.  It can be read and 
written.  Let's call this state BRAND_NEW.

2. Bind to a VM.

Now it's an a bound state.  For TDX, for example, let's call the new state 
BOUND_TDX.  In this state, the TDX rules are followed (private memory can't be 
converted, etc).

The problem here is that the BOUND_NEW state allows things that are nonsensical 
in TDX, and the binding step needs to invent some kind of semantics for what 
happens when binding a nonempty memfile.

So I would propose a somewhat different order:

1. Create a memfile.  It's in the UNBOUND state and no operations whatsoever 
are allowed except binding or closing.

2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in the 
initial state appropriate for that VM.

For TDX, this completely bypasses the cases where the data is prepopulated and 
TDX can't handle it cleanly.  For SEV, it bypasses a situation in which data 
might be written to the memory before we find out whether that data will be 
unreclaimable or unmovable.

--

Now I have a question, since I don't think anyone has really answered it: how 
does this all work with SEV- or pKVM-like technologies in which private and 
shared pages share the same address space?  I sounds like you're proposing to 
have a big memfile that contains private and shared pages and to use that same 
memfile as pages are converted back and forth.  IO and even real physical DMA 
could be done on that memfile.  Am I understanding correctly?

If so, I think this makes sense, but I'm wondering if the actual memslot setup 
should be different.  For TDX, private memory lives in a logically separate 
memslot space.  For SEV and pKVM, it doesn't.  I assume the API can reflect 
this straightforwardly.

And the corresponding TDX question: is the intent still that shared pages 
aren't allowed at all in a TDX memfile?  If so, that would be the most direct 
mapping to what the hardware actually does.

--Andy

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-25 Thread Andy Lutomirski

On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
>> 

>> 
>> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in 
>> the initial state appropriate for that VM.
>> 
>> For TDX, this completely bypasses the cases where the data is prepopulated 
>> and TDX can't handle it cleanly.  For SEV, it bypasses a situation in which 
>> data might be written to the memory before we find out whether that data 
>> will be unreclaimable or unmovable.
>
> This sounds a more strict rule to avoid semantics unclear.
>
> So userspace needs to know what excatly happens for a 'bind' operation.
> This is different when binds to different technologies. E.g. for SEV, it
> may imply after this call, the memfile can be accessed (through mmap or
> what ever) from userspace, while for current TDX this should be not allowed.

I think this is actually a good thing.  While SEV, TDX, pKVM, etc achieve 
similar goals and have broadly similar ways of achieving them, they really are 
different, and having userspace be aware of the differences seems okay to me.

(Although I don't think that allowing userspace to mmap SEV shared pages is 
particularly wise -- it will result in faults or cache incoherence depending on 
the variant of SEV in use.)

>
> And I feel we still need a third flow/operation to indicate the
> completion of the initialization on the memfile before the guest's 
> first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed
> and prevent future userspace access. After this point, then the memfile
> becomes truely private fd.

Even that is technology-dependent.  For TDX, this operation doesn't really 
exist.  For SEV, I'm not sure (I haven't read the specs in nearly enough 
detail).  For pKVM, I guess it does exist and isn't quite the same as a 
shared->private conversion.

Maybe this could be generalized a bit as an operation "measure and make 
private" that would be supported by the technologies for which it's useful.

>
>> 
>> 
>> --
>> 
>> Now I have a question, since I don't think anyone has really answered it: 
>> how does this all work with SEV- or pKVM-like technologies in which private 
>> and shared pages share the same address space?  I sounds like you're 
>> proposing to have a big memfile that contains private and shared pages and 
>> to use that same memfile as pages are converted back and forth.  IO and even 
>> real physical DMA could be done on that memfile.  Am I understanding 
>> correctly?
>
> For TDX case, and probably SEV as well, this memfile contains private memory
> only. But this design at least makes it possible for usage cases like
> pKVM which wants both private/shared memory in the same memfile and rely
> on other ways like mmap/munmap or mprotect to toggle private/shared instead
> of fallocate/hole punching.

Hmm.  Then we still need some way to get KVM to generate the correct SEV 
pagetables.  For TDX, there are private memslots and shared memslots, and they 
can overlap.  If they overlap and both contain valid pages at the same address, 
then the results may not be what the guest-side ABI expects, but everything 
will work.  So, when a single logical guest page transitions between shared and 
private, no change to the memslots is needed.  For SEV, this is not the case: 
everything is in one set of pagetables, and there isn't a natural way to 
resolve overlaps.

If the memslot code becomes efficient enough, then the memslots could be 
fragmented.  Or the memfile could support private and shared data in the same 
memslot.  And if pKVM does this, I don't see why SEV couldn't also do it and 
hopefully reuse the same code.

>
>> 
>> If so, I think this makes sense, but I'm wondering if the actual memslot 
>> setup should be different.  For TDX, private memory lives in a logically 
>> separate memslot space.  For SEV and pKVM, it doesn't.  I assume the API can 
>> reflect this straightforwardly.
>
> I believe so. The flow should be similar but we do need pass different
> flags during the 'bind' to the backing store for different usages. That
> should be some new flags for pKVM but the callbacks (API here) between
> memfile_notifile and its consumers can be reused.

And also some different flag in the operation that installs the fd as a memslot?

>
>> 
>> And the corresponding TDX question: is the intent still that shared pages 
>> aren't allowed at all in a TDX memfile?  If so, that would be the most 
>> direct mapping to what the hardware actually does.
>
> Exactly. TDX

Re: [RFC v2 PATCH 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2021-12-02 Thread Andy Lutomirski


On 11/19/21 05:47, Chao Peng wrote:

This RFC series try to implement the fd-based KVM guest private memory
proposal described at [1] and an improved 'New Proposal' described at [2].


I generally like this.  Thanks!

Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-12-02 Thread Andy Lutomirski


On 11/19/21 05:47, Chao Peng wrote:

From: "Kirill A. Shutemov" 

The new seal type provides semantics required for KVM guest private
memory support. A file descriptor with the seal set is going to be used
as source of guest memory in confidential computing environments such as
Intel TDX and AMD SEV.

F_SEAL_GUEST can only be set on empty memfd. After the seal is set
userspace cannot read, write or mmap the memfd.


I don't have a strong objection here, but, given that you're only 
supporting it for memfd, would a memfd_create() flag be more 
straightforward?  If nothing else, it would avoid any possible locking 
issue.


I'm also very very slightly nervous about a situation in which one 
program sends a memfd to an untrusted other process and that process 
truncates the memfd and then F_SEAL_GUESTs it.  This could be mostly 
mitigated by also requiring that no other seals be set when F_SEAL_GUEST 
happens, but the alternative MFD_GUEST would eliminate this issue too.

RE: [PATCH] hw/vhost-user-blk: turn on VIRTIO_BLK_F_SIZE_MAX feature for virtio blk device

2022-01-03 Thread Pei, Andy

Hi Raphael,

Thanks for your reply.
I will fix the grammar mistake in V2.

-Original Message-
From: Raphael Norwitz  
Sent: Tuesday, November 30, 2021 5:58 AM
To: Pei, Andy 
Cc: qemu-devel@nongnu.org; qemu-bl...@nongnu.org; Liu, Changpeng 
; Raphael Norwitz ; 
m...@redhat.com; kw...@redhat.com; mre...@redhat.com
Subject: Re: [PATCH] hw/vhost-user-blk: turn on VIRTIO_BLK_F_SIZE_MAX feature 
for virtio blk device

Just a commit message nit. Otherwise I'm happy with this. OFC should not be 
queued for 6.2.

On Fri, Nov 26, 2021 at 10:00:18AM +0800, Andy Pei wrote:
> Turn on pre-defined feature VIRTIO_BLK_F_SIZE_MAX virtio blk device to 
> avoid guest DMA request size is too large to exceed hardware spec.

Grammar here. Should be something like "...DMA request sizes which are to large 
for the hardware spec".

> 
> Signed-off-by: Andy Pei 

Acked-by: Raphael Norwitz 

> ---
>  hw/block/vhost-user-blk.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c 
> index ba13cb8..eb1264a 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -252,6 +252,7 @@ static uint64_t vhost_user_blk_get_features(VirtIODevice 
> *vdev,
>  VHostUserBlk *s = VHOST_USER_BLK(vdev);
>  
>  /* Turn on pre-defined features */
> +virtio_add_feature(&features, VIRTIO_BLK_F_SIZE_MAX);
>  virtio_add_feature(&features, VIRTIO_BLK_F_SEG_MAX);
>  virtio_add_feature(&features, VIRTIO_BLK_F_GEOMETRY);
>  virtio_add_feature(&features, VIRTIO_BLK_F_TOPOLOGY);
> --
> 1.8.3.1
>

[PATCH v2] hw/vhost-user-blk: turn on VIRTIO_BLK_F_SIZE_MAX feature for virtio blk device

2022-01-03 Thread Andy Pei

Turn on pre-defined feature VIRTIO_BLK_F_SIZE_MAX for virtio blk device to
avoid guest DMA request sizes which are too large for hardware spec.

Signed-off-by: Andy Pei 
---
 hw/block/vhost-user-blk.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index ba13cb8..eb1264a 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -252,6 +252,7 @@ static uint64_t vhost_user_blk_get_features(VirtIODevice 
*vdev,
 VHostUserBlk *s = VHOST_USER_BLK(vdev);
 
 /* Turn on pre-defined features */
+virtio_add_feature(&features, VIRTIO_BLK_F_SIZE_MAX);
 virtio_add_feature(&features, VIRTIO_BLK_F_SEG_MAX);
 virtio_add_feature(&features, VIRTIO_BLK_F_GEOMETRY);
 virtio_add_feature(&features, VIRTIO_BLK_F_TOPOLOGY);
-- 
1.8.3.1

Re: [Qemu-devel] [PATCH RFC] tcmu: Introduce qemu-tcmu

2016-10-20 Thread Andy Grover


On 10/20/2016 07:30 AM, Fam Zheng wrote:

On Thu, 10/20 15:08, Stefan Hajnoczi wrote:

If a corrupt image is able to execute arbitrary code in the qemu-tcmu
process, does /dev/uio0 or the tcmu shared memory interface allow get
root or kernel privileges?


I haven't audited the code, but target_core_user.ko should contain the access to
/dev/uioX and make sure there is no security risk regarding buggy or malicious
handlers. Otherwise it's a bug that should be fixed. Andy can correct me if I'm
wrong.


Yes... well, TCMU ensures that a bad handler can't scribble to kernel 
memory outside the shared memory area.


UIO devices are basically a "device drivers in userspace" kind of API so 
they require root to use. I seem to remember somebody mentioning ways 
this might work for less-privileged handlers (fd-passing??) but no way 
to do this exists just yet.


Regards -- Andy

Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-27 Thread Andy Lutomirski




> On Dec 27, 2018, at 10:18 AM, Florian Weimer  wrote:
> 
> We have a bit of an interesting problem with respect to the d_off
> field in struct dirent.
> 
> When running a 64-bit kernel on certain file systems, notably ext4,
> this field uses the full 63 bits even for small directories (strace -v
> output, wrapped here for readability):
> 
> getdents(3, [
>  {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, 
> d_name="authorized_keys", d_type=DT_REG},
>  {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", 
> d_type=DT_DIR},
>  {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", 
> d_type=DT_DIR}
> ], 32768) = 88
> 
> When running in 32-bit compat mode, this value is somehow truncated to
> 31 bits, for both the getdents and the getdents64 (!) system call (at
> least on i386).

I imagine you’re encountering this bug:

https://lkml.org/lkml/2018/10/18/859

Presumably the right fix involves modifying the relevant VFS file operations to 
indicate the relevant ABI to the implementations.

I would guess that 9p is triggering the “not really in the syscall you think 
you’re in” issue.

Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-28 Thread Andy Lutomirski

[sending again, slightly edited, due to email client issues]

On Thu, Dec 27, 2018 at 9:25 AM Florian Weimer  wrote:
>
> We have a bit of an interesting problem with respect to the d_off
> field in struct dirent.
>
> When running a 64-bit kernel on certain file systems, notably ext4,
> this field uses the full 63 bits even for small directories (strace -v
> output, wrapped here for readability):
>
> getdents(3, [
>   {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, 
> d_name="authorized_keys", d_type=DT_REG},
>   {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", 
> d_type=DT_DIR},
>   {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", 
> d_type=DT_DIR}
> ], 32768) = 88
>
> When running in 32-bit compat mode, this value is somehow truncated to
> 31 bits, for both the getdents and the getdents64 (!) system call (at
> least on i386).
>

...

>
> However, both qemu-user and the 9p file system can run in such a way
> that the kernel is entered from a 64-bit process, but the actual usage
> is from a 32-bit process:


I imagine that at least some of the problems you're seeing are due to this bug:

https://lkml.org/lkml/2018/10/18/859

Presumably the right fix involves modifying the relevant VFS file
operations to indicate the relevant ABI to the implementations.  I
would guess that 9p is triggering the “not really in the syscall you
think you’re in” issue.

Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-29 Thread Andy Lutomirski

> On Dec 28, 2018, at 6:54 PM, Matthew Wilcox  wrote:
>
>> On Sat, Dec 29, 2018 at 12:12:27AM +, Peter Maydell wrote:
>> On Fri, 28 Dec 2018 at 23:16, Andreas Dilger  wrot
>>> On Dec 28, 2018, at 4:18 AM, Peter Maydell  wrote:
 The problem is that there is no 32-bit API in some cases
 (unless I have misunderstood the kernel code) -- not all
 host architectures implement compat syscalls or allow them
 to be called from 64-bit processes or implement all the older
 syscall variants that had smaller offets. If there was a guaranteed
 "this syscall always exists and always gives me 32-bit offsets"
 we could use it.
>>>
>>> The "32bitapi" mount option would use 32-bit hash for seekdir
>>> and telldir, regardless of what kernel API was used.  That would
>>> just set the FMODE_32BITHASH flag in the file->f_mode for all files.
>>
>> A mount option wouldn't be much use to QEMU -- we can't tell
>> our users how to mount their filesystems, which they're
>> often doing lots of other things with besides running QEMU.
>> (Otherwise we could just tell them "don't use ext4", which
>> would also solve the problem :-)) We need something we can
>> use at the individual-syscall level.
>
> Could you use a prctl to set whether you were running in 32 or 64 bit
> mode?  Or do you change which kind of task you're emulating too often
> to make this a good idea?


How would this work?  We already have the separate
COMPAT_DEFINE_SYSCALL entries *and* in_compat_syscall(). Now we’d have
a third degree of freedom.

Either the arches people care about should add reasonable ways to
issue 32-bit syscalls from 64-bit mode or there should be an explicit
way to ask for the 32-bit directory offsets.

Re: [Qemu-devel] [PATCH] headers: fix linux/mod_devicetable.h inclusions

2018-07-17 Thread Andy Shevchenko

On Mon, Jul 9, 2018 at 6:19 PM, Arnd Bergmann  wrote:
> A couple of drivers produced build errors after the mod_devicetable.h
> header was split out from the platform_device one, e.g.
>
> drivers/media/platform/davinci/vpbe_osd.c:42:40: error: array type has 
> incomplete element type 'struct platform_device_id'
> drivers/media/platform/davinci/vpbe_venc.c:42:40: error: array type has 
> incomplete element type 'struct platform_device_id'
>
> This adds the inclusion where needed.
>
> Fixes: ac3167257b9f ("headers: separate linux/mod_devicetable.h from 
> linux/platform_device.h")
> Signed-off-by: Arnd Bergmann 

>  drivers/platform/x86/intel_punit_ipc.c | 1 +


> --- a/drivers/platform/x86/intel_punit_ipc.c
> +++ b/drivers/platform/x86/intel_punit_ipc.c
> @@ -12,6 +12,7 @@
>   */
>
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 

Acked-by: Andy Shevchenko 

for the above bits.

-- 
With Best Regards,
Andy Shevchenko

Re: [Qemu-devel] [PATCH v21 1/5] xbitmap: Introduce xbitmap

2018-02-16 Thread Andy Shevchenko

On Fri, Feb 16, 2018 at 8:30 PM, Matthew Wilcox  wrote:
> On Fri, Feb 16, 2018 at 07:44:50PM +0200, Andy Shevchenko wrote:
>> On Tue, Jan 9, 2018 at 1:10 PM, Wei Wang  wrote:
>> > From: Matthew Wilcox 
>> >
>> > The eXtensible Bitmap is a sparse bitmap representation which is
>> > efficient for set bits which tend to cluster. It supports up to
>> > 'unsigned long' worth of bits.
>>
>> >  lib/xbitmap.c| 444 
>> > +++
>>
>> Please, split tests to a separate module.
>
> Hah, I just did this two days ago!  I didn't publish it yet, but I also made
> it compile both in userspace and as a kernel module.
>
> It's the top two commits here:
>
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2018-02-12
>

Thanks!

> Note this is a complete rewrite compared to the version presented here; it
> sits on top of the XArray and no longer has a preload interface.  It has a
> superset of the IDA functionality.

Noted.

Now, the question about test case. Why do you heavily use BUG_ON?
Isn't resulting statistics enough?

See how other lib/test_* modules do.

-- 
With Best Regards,
Andy Shevchenko

Re: [Qemu-devel] [PATCH v21 1/5] xbitmap: Introduce xbitmap

2018-02-16 Thread Andy Shevchenko

On Tue, Jan 9, 2018 at 1:10 PM, Wei Wang  wrote:
> From: Matthew Wilcox 
>
> The eXtensible Bitmap is a sparse bitmap representation which is
> efficient for set bits which tend to cluster. It supports up to
> 'unsigned long' worth of bits.

>  lib/xbitmap.c| 444 
> +++

Please, split tests to a separate module.

-- 
With Best Regards,
Andy Shevchenko

[Bug 1915063] Re: Windows 10 wil not install using qemu-system-x86_64

2021-04-12 Thread Andy Whitcroft

The commit in question is marked for stable:

  commit 841c2be09fe4f495fe5224952a419bd8c7e5b455
  Author: Maxim Levitsky 
  Date:   Wed Jul 8 14:57:31 2020 +0300

kvm: x86: replace kvm_spec_ctrl_test_value with runtime test on the host

To avoid complex and in some cases incorrect logic in
kvm_spec_ctrl_test_value, just try the guest's given value on the host
processor instead, and if it doesn't #GP, allow the guest to set it.

One such case is when host CPU supports STIBP mitigation
but doesn't support IBRS (as is the case with some Zen2 AMD cpus),
and in this case we were giving guest #GP when it tried to use STIBP

The reason why can can do the host test is that IA32_SPEC_CTRL msr is
passed to the guest, after the guest sets it to a non zero value
for the first time (due to performance reasons),
and as as result of this, it is pointless to emulate #GP condition on
this first access, in a different way than what the host CPU does.

This is based on a patch from Sean Christopherson, who suggested this idea.

Fixes: 6441fa6178f5 ("KVM: x86: avoid incorrect writes to host 
MSR_IA32_SPEC_CTRL")
Cc: sta...@vger.kernel.org
Suggested-by: Sean Christopherson 
Signed-off-by: Maxim Levitsky 
Message-Id: <20200708115731.180097-1-mlevi...@redhat.com>
Signed-off-by: Paolo Bonzini 

It appears to be in `v5.4.102` which is currently queued up for the
cycle following the one just starting.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1915063

Title:
  Windows 10 wil not install using qemu-system-x86_64

Status in QEMU:
  New
Status in linux package in Ubuntu:
  Confirmed
Status in linux-oem-5.10 package in Ubuntu:
  Fix Released
Status in linux-oem-5.6 package in Ubuntu:
  Confirmed
Status in qemu package in Ubuntu:
  Invalid

Bug description:
  Steps to reproduce
  install virt-manager and ovmf if nopt already there
  copy windows and virtio iso files to /var/lib/libvirt/images

  Use virt-manager from local machine to create your VMs with the disk, CPUs 
and memory required
  Select customize configuration then select OVMF(UEFI) instead of seabios
  set first CDROM to the windows installation iso (enable in boot options)
  add a second CDROM and load with the virtio iso
change spice display to VNC

Always get a security error from windows and it fails to launch the 
installer (works on RHEL and Fedora)
  I tried updating the qemu version from Focals 4.2 to Groovy 5.0 which was of 
no help
  --- 
  ProblemType: Bug
  ApportVersion: 2.20.11-0ubuntu27.14
  Architecture: amd64
  CasperMD5CheckResult: skip
  CurrentDesktop: ubuntu:GNOME
  DistributionChannelDescriptor:
   # This is the distribution channel descriptor for the OEM CDs
   # For more information see 
http://wiki.ubuntu.com/DistributionChannelDescriptor
   
canonical-oem-sutton-focal-amd64-20201030-422+pc-sutton-bachman-focal-amd64+X00
  DistroRelease: Ubuntu 20.04
  InstallationDate: Installed on 2021-01-20 (19 days ago)
  InstallationMedia: Ubuntu 20.04 "Focal" - Build amd64 LIVE Binary 
20201030-14:39
  MachineType: LENOVO 30E102Z
  NonfreeKernelModules: nvidia_modeset nvidia
  Package: linux (not installed)
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcFB: 0 EFI VGA
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.6.0-1042-oem 
root=UUID=389cd165-fc52-4814-b837-a1090b9c2387 ro locale=en_US quiet splash 
vt.handoff=7
  ProcVersionSignature: Ubuntu 5.6.0-1042.46-oem 5.6.19
  RelatedPackageVersions:
   linux-restricted-modules-5.6.0-1042-oem N/A
   linux-backports-modules-5.6.0-1042-oem  N/A
   linux-firmware  1.187.8
  RfKill:
   
  Tags:  focal
  Uname: Linux 5.6.0-1042-oem x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: adm cdrom dip docker kvm libvirt lpadmin plugdev sambashare sudo
  _MarkForUpload: True
  dmi.bios.date: 07/29/2020
  dmi.bios.vendor: LENOVO
  dmi.bios.version: S07KT08A
  dmi.board.name: 1046
  dmi.board.vendor: LENOVO
  dmi.board.version: Not Defined
  dmi.chassis.type: 3
  dmi.chassis.vendor: LENOVO
  dmi.chassis.version: None
  dmi.modalias: 
dmi:bvnLENOVO:bvrS07KT08A:bd07/29/2020:svnLENOVO:pn30E102Z:pvrThinkStationP620:rvnLENOVO:rn1046:rvrNotDefined:cvnLENOVO:ct3:cvrNone:
  dmi.product.family: INVALID
  dmi.product.name: 30E102Z
  dmi.product.sku: LENOVO_MT_30E1_BU_Think_FM_ThinkStation P620
  dmi.product.version: ThinkStation P620
  dmi.sys.vendor: LENOVO

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1915063/+subscriptions

Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

2020-10-17 Thread Andy Lutomirski

On Fri, Oct 16, 2020 at 6:40 PM Jann Horn  wrote:
>
> [adding some more people who are interested in RNG stuff: Andy, Jason,
> Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this
> concerns some pretty fundamental API stuff related to RNG usage]
>
> On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin
>  wrote:
> > - Background
> >
> > The VM Generation ID is a feature defined by Microsoft (paper:
> > http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
> > multiple hypervisor vendors.
> >
> > The feature is required in virtualized environments by apps that work
> > with local copies/caches of world-unique data such as random values,
> > uuids, monotonically increasing counters, etc.
> > Such apps can be negatively affected by VM snapshotting when the VM
> > is either cloned or returned to an earlier point in time.
> >
> > The VM Generation ID is a simple concept meant to alleviate the issue
> > by providing a unique ID that changes each time the VM is restored
> > from a snapshot. The hw provided UUID value can be used to
> > differentiate between VMs or different generations of the same VM.
> >
> > - Problem
> >
> > The VM Generation ID is exposed through an ACPI device by multiple
> > hypervisor vendors but neither the vendors or upstream Linux have no
> > default driver for it leaving users to fend for themselves.
> >
> > Furthermore, simply finding out about a VM generation change is only
> > the starting point of a process to renew internal states of possibly
> > multiple applications across the system. This process could benefit
> > from a driver that provides an interface through which orchestration
> > can be easily done.
> >
> > - Solution
> >
> > This patch is a driver which exposes the Virtual Machine Generation ID
> > via a char-dev FS interface that provides ID update sync and async
> > notification, retrieval and confirmation mechanisms:
> >
> > When the device is 'open()'ed a copy of the current vm UUID is
> > associated with the file handle. 'read()' operations block until the
> > associated UUID is no longer up to date - until HW vm gen id changes -
> > at which point the new UUID is provided/returned. Nonblocking 'read()'
> > uses EWOULDBLOCK to signal that there is no _new_ UUID available.
> >
> > 'poll()' is implemented to allow polling for UUID updates. Such
> > updates result in 'EPOLLIN' events.
> >
> > Subsequent read()s following a UUID update no longer block, but return
> > the updated UUID. The application needs to acknowledge the UUID update
> > by confirming it through a 'write()'.
> > Only on writing back to the driver the right/latest UUID, will the
> > driver mark this "watcher" as up to date and remove EPOLLIN status.
> >
> > 'mmap()' support allows mapping a single read-only shared page which
> > will always contain the latest UUID value at offset 0.
>
> It would be nicer if that page just contained an incrementing counter,
> instead of a UUID. It's not like the application cares *what* the UUID
> changed to, just that it *did* change and all RNGs state now needs to
> be reseeded from the kernel, right? And an application can't reliably
> read the entire UUID from the memory mapping anyway, because the VM
> might be forked in the middle.
>
> So I think your kernel driver should detect UUID changes and then turn
> those into a monotonically incrementing counter. (Probably 64 bits
> wide?) (That's probably also a little bit faster than comparing an
> entire UUID.)
>
> An option might be to put that counter into the vDSO, instead of a
> separate VMA; but I don't know how the other folks feel about that.
> Andy, do you have opinions on this? That way, normal userspace code
> that uses this infrastructure wouldn't have to mess around with a
> special device at all. And it'd be usable in seccomp sandboxes and so
> on without needing special plumbing. And libraries wouldn't have to
> call open() and mess with file descriptor numbers.

The vDSO might be annoyingly slow for this.  Something like the rseq
page might make sense.  It could be a generic indication of "system
went through some form of suspend".

Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

2020-10-18 Thread Andy Lutomirski

On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin  wrote:
>
> On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote:
> > 4c. The guest kernel maintains an array of physical addresses that are
> > MADV_WIPEONFORK. The hypervisor knows about this array and its
> > location through whatever protocol, and before resuming a
> > moved/snapshotted/duplicated VM, it takes the responsibility for
> > memzeroing this memory. The huge pro here would be that this
> > eliminates all races, and reduces complexity quite a bit, because the
> > hypervisor can perfectly synchronize its bringup (and SMP bringup)
> > with this, and it can even optimize things like on-disk memory
> > snapshots to simply not write out those pages to disk.
> >
> > A 4c-like approach seems like it'd be a lot of bang for the buck -- we
> > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
> > userspace API to deal with, and it'd be race free, and eliminate a lot
> > of kernel complexity.
>
> Clearly this has a chance to break applications, right?
> If there's an app that uses this as a non-system-calls way
> to find out whether there was a fork, it will break
> when wipe triggers without a fork ...
> For example, imagine:
>
> MADV_WIPEONFORK
> copy secret data to MADV_DONTFORK
> fork
>
>
> used to work, with this change it gets 0s instead of the secret data.
>
>
> I am also not sure it's wise to expose each guest process
> to the hypervisor like this. E.g. each process needs a
> guest physical address of its own then. This is a finite resource.
>
>
> The mmap interface proposed here is somewhat baroque, but it is
> certainly simple to implement ...

Wipe of fork/vmgenid/whatever could end up being much more problematic
than it naively appears -- it could be wiped in the middle of a read.
Either the API needs to handle this cleanly, or we need something more
aggressive like signal-on-fork.

--Andy

Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

2020-10-18 Thread Andy Lutomirski

On Sun, Oct 18, 2020 at 8:59 AM Michael S. Tsirkin  wrote:
>
> On Sun, Oct 18, 2020 at 08:54:36AM -0700, Andy Lutomirski wrote:
> > On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin  wrote:
> > >
> > > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote:
> > > > 4c. The guest kernel maintains an array of physical addresses that are
> > > > MADV_WIPEONFORK. The hypervisor knows about this array and its
> > > > location through whatever protocol, and before resuming a
> > > > moved/snapshotted/duplicated VM, it takes the responsibility for
> > > > memzeroing this memory. The huge pro here would be that this
> > > > eliminates all races, and reduces complexity quite a bit, because the
> > > > hypervisor can perfectly synchronize its bringup (and SMP bringup)
> > > > with this, and it can even optimize things like on-disk memory
> > > > snapshots to simply not write out those pages to disk.
> > > >
> > > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we
> > > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
> > > > userspace API to deal with, and it'd be race free, and eliminate a lot
> > > > of kernel complexity.
> > >
> > > Clearly this has a chance to break applications, right?
> > > If there's an app that uses this as a non-system-calls way
> > > to find out whether there was a fork, it will break
> > > when wipe triggers without a fork ...
> > > For example, imagine:
> > >
> > > MADV_WIPEONFORK
> > > copy secret data to MADV_DONTFORK
> > > fork
> > >
> > >
> > > used to work, with this change it gets 0s instead of the secret data.
> > >
> > >
> > > I am also not sure it's wise to expose each guest process
> > > to the hypervisor like this. E.g. each process needs a
> > > guest physical address of its own then. This is a finite resource.
> > >
> > >
> > > The mmap interface proposed here is somewhat baroque, but it is
> > > certainly simple to implement ...
> >
> > Wipe of fork/vmgenid/whatever could end up being much more problematic
> > than it naively appears -- it could be wiped in the middle of a read.
> > Either the API needs to handle this cleanly, or we need something more
> > aggressive like signal-on-fork.
> >
> > --Andy
>
>
> Right, it's not on fork, it's actually when process is snapshotted.
>
> If we assume it's CRIU we care about, then I
> wonder what's wrong with something like
> MADV_CHANGEONPTRACE_SEIZE
> and basically say it's X bytes which change the value...

I feel like we may be approaching this from the wrong end.  Rather
than saying "what data structure can the kernel expose that might
plausibly be useful", how about we try identifying some specific
userspace needs and see what a good solution could look like.  I can
identify two major cryptographic use cases:

1. A userspace RNG.  The API exposed by the userspace end is a
function that generates random numbers.  The userspace code in turn
wants to know some things from the kernel: it wants some
best-quality-available random seed data from the kernel (and possibly
an indication of how good it is) as well as an indication of whether
the userspace memory may have been cloned or rolled back, or, failing
that, an indication of whether a reseed is needed.  Userspace could
implement a wide variety of algorithms on top depending on its goals
and compliance requirements, but the end goal is for the userspace
part to be very, very fast.

2. A userspace crypto stack that wants to avoid shooting itself in the
foot due to inadvertently doing the same thing twice.  For example, an
AES-GCM stack does not want to reuse an IV, *expecially* if there is
even the slightest chance that it might reuse the IV for different
data.  This use case doesn't necessarily involve random numbers, but,
if anything, it needs to be even faster than #1.

The threats here are not really the same.  For #1, a userspace RNG
should be able to recover from a scenario in which an adversary clones
the entire process *and gets to own the clone*.  For example, in
Android, an adversary can often gain complete control of a fork of the
zygote -- this shouldn't adversely affect the security properties of
other forks.  Similarly, a server farm could operate by having one
booted server that is cloned to create more workers.  Those clones
could be provisioned with secrets and permissions post-clone, and at
attacker gaining control of a fresh clone could be considered
acceptable.  For #2, in contrast, if an adversary gains control of a
clone of an AES-GCM session, they learn the key outright -- the
relevant attack scenario is that the adversary gets to interact with
two clones without compromising either clone per se.

It's worth noting that, in both cases, there could possibly be more
than one instance of an RNG or an AES-GCM session in the same process.
This means that using signals is awkward but not necessarily
impossibly.  (This is an area in which Linux, and POSIX in general, is
much weaker than Windows.)

Re: [PATCH v7 0/6] gpio: Add GPIO Aggregator

2020-05-20 Thread Andy Shevchenko

On Wed, May 20, 2020 at 3:40 PM Andy Shevchenko
 wrote:
> On Wed, May 20, 2020 at 3:38 PM Geert Uytterhoeven  
> wrote:
> > On Wed, May 20, 2020 at 2:14 PM Andy Shevchenko
> >  wrote:
> > > On Mon, May 11, 2020 at 04:52:51PM +0200, Geert Uytterhoeven wrote:
>
> ...
>
> > > Sorry for late reply, recently noticed this nice idea.
> > > The comment I have is, please, can we reuse bitmap parse algorithm and 
> > > syntax?
> > > We have too many different formats and parsers in the kernel and bitmap's 
> > > one
> > > seems suitable here.
> >
> > Thank you, I wasn't aware of that.
> >
> > Which one do you mean? The documentation seems to be confusing,
> > and incomplete.
> > My first guess was bitmap_parse(), but that one assumes hex values?
> > And given it processes the unsigned long bitmap in u32 chunks, I guess
> > it doesn't work as expected on big-endian 64-bit?
> >
> > bitmap_parselist() looks more suitable, and the format seems to be

> > compatible with what's currently used, so it won't change ABI.

What ABI? We didn't have a release with it, right? So, we are quite
flexible for few more weeks to amend it.

> > Is that the one you propose?
>
> Yes, sorry for the confusion.
>
> > > (Despite other small clean ups, like strstrip() use)
> >
> > Aka strim()? There are too many of them, to know all of them by heart ;-)
>
> The difference between them is __must_check flag. But yes.



-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH v7 0/6] gpio: Add GPIO Aggregator

2020-05-20 Thread Andy Shevchenko

On Wed, May 20, 2020 at 3:38 PM Geert Uytterhoeven  wrote:
> On Wed, May 20, 2020 at 2:14 PM Andy Shevchenko
>  wrote:
> > On Mon, May 11, 2020 at 04:52:51PM +0200, Geert Uytterhoeven wrote:

...

> > Sorry for late reply, recently noticed this nice idea.
> > The comment I have is, please, can we reuse bitmap parse algorithm and 
> > syntax?
> > We have too many different formats and parsers in the kernel and bitmap's 
> > one
> > seems suitable here.
>
> Thank you, I wasn't aware of that.
>
> Which one do you mean? The documentation seems to be confusing,
> and incomplete.
> My first guess was bitmap_parse(), but that one assumes hex values?
> And given it processes the unsigned long bitmap in u32 chunks, I guess
> it doesn't work as expected on big-endian 64-bit?
>
> bitmap_parselist() looks more suitable, and the format seems to be
> compatible with what's currently used, so it won't change ABI.
> Is that the one you propose?

Yes, sorry for the confusion.

> > (Despite other small clean ups, like strstrip() use)
>
> Aka strim()? There are too many of them, to know all of them by heart ;-)

The difference between them is __must_check flag. But yes.

-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH v7 0/6] gpio: Add GPIO Aggregator

2020-05-20 Thread Andy Shevchenko

gt; destroyed by writing to atribute files in sysfs.
> Sample session on the Renesas Koelsch development board:
> 
>   - Unbind LEDs from leds-gpio driver:
> 
> echo leds > /sys/bus/platform/drivers/leds-gpio/unbind
> 
>   - Create aggregators:
> 
> $ echo e6052000.gpio 19,20 \
> > /sys/bus/platform/drivers/gpio-aggregator/new_device
> 
> gpio-aggregator gpio-aggregator.0: gpio 0 => gpio-953
> gpio-aggregator gpio-aggregator.0: gpio 1 => gpio-954
> gpiochip_find_base: found new base at 758
> gpio gpiochip12: (gpio-aggregator.0): added GPIO chardev (254:13)
> gpiochip_setup_dev: registered GPIOs 758 to 759 on device: gpiochip12 
> (gpio-aggregator.0)
> 
> $ echo e6052000.gpio 21 e605.gpio 20-22 \
> > /sys/bus/platform/drivers/gpio-aggregator/new_device
> 
> gpio-aggregator gpio-aggregator.1: gpio 0 => gpio-955
> gpio-aggregator gpio-aggregator.1: gpio 1 => gpio-1012
> gpio-aggregator gpio-aggregator.1: gpio 2 => gpio-1013
> gpio-aggregator gpio-aggregator.1: gpio 3 => gpio-1014
> gpiochip_find_base: found new base at 754
> gpio gpiochip13: (gpio-aggregator.1): added GPIO chardev (254:13)
> gpiochip_setup_dev: registered GPIOs 754 to 757 on device: gpiochip13 
> (gpio-aggregator.1)
> 
>   - Adjust permissions on /dev/gpiochip1[23] (optional)
> 
>   - Control LEDs:
> 
> $ gpioset gpiochip12 0=0 1=1 # LED6 OFF, LED7 ON
> $ gpioset gpiochip12 0=1 1=0 # LED6 ON, LED7 OFF
> $ gpioset gpiochip13 0=1 # LED8 ON
> $ gpioset gpiochip13 0=0 # LED8 OFF
> 
>   - Destroy aggregators:
> 
> $ echo gpio-aggregator.0 \
> > /sys/bus/platform/drivers/gpio-aggregator/delete_device
> $ echo gpio-aggregator.1 \
> > /sys/bus/platform/drivers/gpio-aggregator/delete_device
> 
> To ease testing, I have pushed this series to the
> topic/gpio-aggregator-v7 branch of my renesas-drivers repository at
> git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-drivers.git.
> 
> Thanks!
> 
> References:
>   [1] "[PATCH QEMU v2 0/5] Add a GPIO backend"
>   
> (https://lore.kernel.org/linux-gpio/20200423090118.11199-1-geert+rene...@glider.be/)
>   [2] "[PATCH V4 2/2] gpio: inverter: document the inverter bindings"
>   
> (https://lore.kernel.org/r/1561699236-18620-3-git-send-email-harish_kand...@mentor.com/)
>   [3] "[PATCH v6 0/8] gpio: Add GPIO Aggregator"
>   
> (https://lore.kernel.org/linux-doc/20200324135328.5796-1-geert+rene...@glider.be/)
>   [4] "[PATCH v5 0/5] gpio: Add GPIO Aggregator"
>   
> (https://lore.kernel.org/r/20200218151812.7816-1-geert+rene...@glider.be/)
>   [5] "[PATCH v4 0/5] gpio: Add GPIO Aggregator"
>   
> (https://lore.kernel.org/r/20200115181523.23556-1-geert+rene...@glider.be)
>   [6] "[PATCH v3 0/7] gpio: Add GPIO Aggregator/Repeater"
>   
> (https://lore.kernel.org/r/20191127084253.16356-1-geert+rene...@glider.be/)
>   [7] "[PATCH/RFC v2 0/5] gpio: Add GPIO Aggregator Driver"
>   
> (https://lore.kernel.org/r/20190911143858.13024-1-geert+rene...@glider.be/)
>   [8] "[PATCH RFC] gpio: Add Virtual Aggregator GPIO Driver"
>   
> (https://lore.kernel.org/r/20190705160536.12047-1-geert+rene...@glider.be/)
>   [9] "[PATCH QEMU POC] Add a GPIO backend"
>   
> (https://lore.kernel.org/r/20181003152521.23144-1-geert+rene...@glider.be/)
>  [10] "Getting To Blinky: Virt Edition / Making device pass-through
>work on embedded ARM"
>   (https://fosdem.org/2019/schedule/event/vai_getting_to_blinky/)
> 
> Geert Uytterhoeven (6):
>   i2c: i801: Use GPIO_LOOKUP() helper macro
>   mfd: sm501: Use GPIO_LOOKUP_IDX() helper macro
>   gpiolib: Add support for GPIO lookup by line name
>   gpio: Add GPIO Aggregator
>   docs: gpio: Add GPIO Aggregator documentation
>   MAINTAINERS: Add GPIO Aggregator section
> 
>  .../admin-guide/gpio/gpio-aggregator.rst  | 111 
>  Documentation/admin-guide/gpio/index.rst  |   1 +
>  Documentation/driver-api/gpio/board.rst   |  15 +-
>  MAINTAINERS   |   7 +
>  drivers/gpio/Kconfig  |  12 +
>  drivers/gpio/Makefile |   1 +
>  drivers/gpio/gpio-aggregator.c| 568 ++
>  drivers/gpio/gpiolib.c|  22 +-
>  drivers/i2c/busses/i2c-i801.c |   6 +-
>  drivers/mfd/sm501.c   |  24 +-
>  include/linux/gpio/machine.h  |  17 +-
>  11 files changed, 748 insertions(+), 36 deletions(-)
>  create mode 100644 Documentation/admin-guide/gpio/gpio-aggregator.rst
>  create mode 100644 drivers/gpio/gpio-aggregator.c
> 
> -- 
> 2.17.1
> 
> Gr{oetje,eeting}s,
> 
>   Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- 
> ge...@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like 
> that.
>   -- Linus Torvalds

-- 
With Best Regards,
Andy Shevchenko

[Qemu-devel] [Bug 586175] Re: Windows XP/2003 doesn't boot

2010-08-02 Thread Andy Lee Robinson

Great solution Andreas, it worked for a Win2k image which I could only
boot previously using an iso from http://www.resoo.org/docs/ntldr/files/

However, I have a w7 image that I have never managed to boot, apart from
its installation cd image using virt-install

20Gb w7 image:

# losetup /dev/loop0 /vm/w7.img; kpartx -a /dev/loop0
# fdisk -l /dev/loop0

Disk /dev/loop0: 21.5 GB, 21474836480 bytes
255 heads, 63 sectors/track, 2610 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xaf12c11f

  Device Boot  Start End  Blocks   Id  System
/dev/loop0p1   *   1  13  1024007  HPFS/NTFS
Partition 1 does not end on cylinder boundary.
/dev/loop0p2  132611208670727  HPFS/NTFS
Partition 2 does not end on cylinder boundary.

# hexedit /dev/mapper/loop0p1
   EB 52 90 4E  54 46 53 20  20 20 20 00  02 08 00 00  00 00 00 00  00 
F8 00 00  3F 00 10 00  00 08 00 00  .R.NTFS.?...
0020   00 00 00 00  80 00 80 00  FF 1F 03 00  00 00 00 00  55 21 00 00  00 
00 00 00  02 00 00 00  00 00 00 00  U!..


# hexedit /dev/mapper/loop0p2
   EB 52 90 4E  54 46 53 20  20 20 20 00  02 08 00 00  00 00 00 00  00 
F8 00 00  3F 00 10 00  00 28 03 00  .R.NTFS.?(..
0020   00 00 00 00  80 00 80 00  FF CF 7C 02  00 00 00 00  00 00 0C 00  00 
00 00 00  02 00 00 00  00 00 00 00  ..|.


# kpartx -d /dev/loop0; losetup -d /dev/loop0

I changed location 0x1a to 0xFF on one or other or both partitions and
it still will not boot in virt-manager.

Cheers,
Andy.

-- 
Windows XP/2003 doesn't boot
https://bugs.launchpad.net/bugs/586175
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.

Status in QEMU: Incomplete
Status in “qemu-kvm” package in Ubuntu: New
Status in Debian GNU/Linux: New
Status in Fedora: Unknown

Bug description:
Hello everyone,

my qemu doesn't boot any Windows XP/2003 installations if I try to boot the 
image.
If I boot the install cd first, it's boot manager counts down and triggers the 
boot on it's own. That's kinda stupid.

I'm using libvirt, but even by a simple
> qemu-kvm -drive file=image.img,media=disk,if=ide,boot=on
it won't boot. Qemu hangs at the message "Booting from Hard Disk..."

I'm using qemu-kvm-0.12.4 with SeaBIOS 0.5.1 on Gentoo (No-Multilib and AMD64). 
It's a server, that means I'm using VNC as the primary graphic output but i 
don't think it should be an issue.

[Qemu-devel] [Bug 586175] Re: Windows XP/2003 doesn't boot

2010-08-02 Thread Andy Lee Robinson

Andreas,

The program that created the disk image seems confused, but it worked for 
creating a VM for FC11.
Windows install seems to run fine, until wanting to boot from the drive it 
created.
I don't know what creates the drive image and geometry, but it is broken.

I think this is what I used to create the VM, but I have messed around
with so many configurations and methods, I'm not sure what is what
anymore.

virt-install --connect qemu:///system -n w7 -r 2048 --vcpus=2 \
--disk path=/vm/w7.img,size=20,sparse=false,format=qcow2 \
-c /vm/w7cd.iso --vnc --noautoconsole \
--os-type windows --os-variant win7 --accelerate --network=bridge:br0 --hvm

How many thousands of people have struggled with this and also got nowhere?
It just looks like the virt-install developers have not tasted their own 
dogfood!

LVM is supposed to be easy - just select vm image and boot, but the more
I read about VMs, kvm, qemu, virtualbox, virsh etc, the more confused I get on 
how they
relate to each other.

testdisk reports this:

~~
Disk /dev/loop0 - 21 GB / 20 GiB - CHS 41943040 1 1 (wtf ??)

 Partition  StartEndSize in sectors
 1 * HPFS - NTFS 2048 206847 204800 [System Reserved]
 2 P HPFS - NTFS   206848   41940991   41734144


Select 1:
Disk /dev/loop0 - 21 GB / 20 GiB - CHS 41943040 1 1
 Partition  StartEndSize in sectors
 1 * HPFS - NTFS 2048 206847 204800 [System Reserved]

Boot sector
Warning: Incorrect number of heads/cylinder 16 (NTFS) != 1 (HD)
Warning: Incorrect number of sectors per track 63 (NTFS) != 1 (HD)
Status: OK

Backup boot sector
Warning: Incorrect number of heads/cylinder 16 (NTFS) != 1 (HD)
Warning: Incorrect number of sectors per track 63 (NTFS) != 1 (HD)
Status: OK

Sectors are identical.

A valid NTFS Boot sector must be present in order to access
any data; even if the partition is not bootable.
~~


Rebuild BS:
~~
Disk /dev/loop0 - 21 GB / 20 GiB - CHS 41943040 1 1
 Partition  StartEndSize in sectors
 1 * HPFS - NTFS 2048 206847 204800 [System Reserved]

filesystem size   204800 204800
sectors_per_cluster   8 8
mft_lcn   8533 8533
mftmirr_lcn   2 2
clusters_per_mft_record   -10 -10
clusters_per_index_record 1 1
Extrapolated boot sector and current boot sector are different.
~~

Q

Select 2:
~~
Disk /dev/loop0 - 21 GB / 20 GiB - CHS 41943040 1 1
 Partition  StartEndSize in sectors
 2 P HPFS - NTFS   206848   41940991   41734144

Boot sector
Warning: Incorrect number of heads/cylinder 16 (NTFS) != 1 (HD)
Warning: Incorrect number of sectors per track 63 (NTFS) != 1 (HD)
Status: OK

Backup boot sector
Warning: Incorrect number of heads/cylinder 16 (NTFS) != 1 (HD)
Warning: Incorrect number of sectors per track 63 (NTFS) != 1 (HD)
Status: OK

Sectors are identical.

A valid NTFS Boot sector must be present in order to access
any data; even if the partition is not bootable.
~~

Rebuild BS:
~~
Disk /dev/loop0 - 21 GB / 20 GiB - CHS 41943040 1 1
 Partition  StartEndSize in sectors
 2 P HPFS - NTFS   206848   41940991   41734144

filesystem size   41734144 41734144
sectors_per_cluster   8 8
mft_lcn   786432 786432
mftmirr_lcn   2 2
clusters_per_mft_record   -10 -10
clusters_per_index_record 1 1
Extrapolated boot sector and current boot sector are different.
~~

It looks a mess.

-- 
Windows XP/2003 doesn't boot
https://bugs.launchpad.net/bugs/586175
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.

Status in QEMU: Incomplete
Status in “qemu-kvm” package in Ubuntu: New
Status in Debian GNU/Linux: New
Status in Fedora: Unknown

Bug description:
Hello everyone,

my qemu doesn't boot any Windows XP/2003 installations if I try to boot the 
image.
If I boot the install cd first, it's boot manager counts down and triggers the 
boot on it's own. That's kinda stupid.

I'm using libvirt, but even by a simple
> qemu-kvm -drive file=image.img,media=disk,if=ide,boot=on
it won't boot. Qemu hangs at the message "Booting from Hard Disk..."

I'm using qemu-kvm-0.12.4 with SeaBIOS 0.5.1 on Gentoo (No-Multilib and AMD64). 
It's a server, that means I'm using VNC as the primary graphic output but i 
don't think it should be an issue.

Re: [Qemu-devel] [PATCH] ahci: enable pci bus master MemoryRegion before loading ahci engines

2019-09-10 Thread Andy via Qemu-devel

(0x7fcc4e19b4a0)[0]: cmd done
ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxSERR] @ 0x30: 
0x
ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxIS] @ 0x10: 
0x0001
ahci_mem_write_host ahci(0x7fcc4e19b4a0) write4 [reg:IS] @ 0x8: 
0x0001
ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxCI] @ 0x38: 
0x8000

handle_cmd_fis_dump ahci(0x7fcc4e19b4a0)[0]: FIS:
0x00: 27 80 ef 02 00 00 00 a0 00 00 00 00 00 00 00 00
0x10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

ahci_cmd_done ahci(0x7fcc4e19b4a0)[0]: cmd done
ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxSERR] @ 0x30: 
0x
ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxIS] @ 0x10: 
0x0001
ahci_mem_write_host ahci(0x7fcc4e19b4a0) write4 [reg:IS] @ 0x8: 
0x0001
ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxCI] @ 0x38: 
0x0001

handle_cmd_fis_dump ahci(0x7fcc4e19b4a0)[0]: FIS:
0x00: 27 80 ec 00 00 00 00 a0 00 00 00 00 00 00 00 00
0x10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

ahci_populate_sglist ahci(0x7fcc4e19b4a0)[0]
ahci_dma_prepare_buf ahci(0x7fcc4e19b4a0)[0]: prepare buf limit=512 
prepared=512
ahci_start_transfer ahci(0x7fcc4e19b4a0)[0]: reading 512 bytes on ata w/ 
sglist

ahci_cmd_done ahci(0x7fcc4e19b4a0)[0]: cmd done
ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxSERR] @ 0x30: 
0x
ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxIS] @ 0x10: 
0x0001
ahci_port_write ahci(0x7fcc4e19b4a0)[0]: port write [reg:PxIS] @ 0x10: 
0x0002
ahci_mem_write_host ahci(0x7fcc4e19b4a0) write4 [reg:IS] @ 0x8: 
0x0001

---


--
Best regards,
Andy Chiu

On 2019/9/10 上午2:13, John Snow wrote:

On 9/9/19 1:18 PM, andychiu via Qemu-devel wrote:

If Windows 10 guests have enabled 'turn off hard disk after idle'
option in power settings, and the guest has a SATA disk plugged in,
the SATA disk will be turned off after a specified idle time.
If the guest is live migrated or saved/loaded with its SATA disk
turned off, the following error will occur:

qemu-system-x86_64: AHCI: Failed to start FIS receive engine: bad FIS receive 
buffer address
qemu-system-x86_64: Failed to load ich9_ahci:ahci
qemu-system-x86_64: error while loading state for instance 0x0 of device 
':00:1a.0/ich9_ahci'
qemu-system-x86_64: load of migration failed: Operation not permitted


Oof. That can't have been fun to discover.


Observation from trace logs shows that a while after Windows 10 turns off
a SATA disk (IDE disks don't have the following behavior),
it will disable the PCI_COMMAND_MASTER flag of the pci device containing
the ahci device. When the the disk is turning back on,
the PCI_COMMAND_MASTER flag will be restored first.
But if the guest is migrated or saved/loaded while the disk is off,
the post_load callback of ahci device, ahci_state_post_load(), will fail
at ahci_cond_start_engines() if the MemoryRegion
pci_dev->bus_master_enable_region is not enabled, with pci_dev pointing
to the PCIDevice struct containing the ahci device.

This patch enables pci_dev->bus_master_enable_region before calling
ahci_cond_start_engines() in ahci_state_post_load(), and restore the
MemoryRegion to its original state afterwards.>

This looks good to me from an AHCI perspective, but I'm not as clear on
the implications of toggling the MemoryRegion, so I have some doubts.


MST, can you chime in and clear my confusion?

I suppose when the PCI_COMMAND_MASTER bit is turned off, we disable the
memory region, as a guest would be unable to establish a new mapping in
this time, so it makes sense that the attempt to map it fails.

What's less clear to me is what happens to existing mappings when a
region is disabled. Are they invalidated? If so, does it make sense that
we are trying to establish a mapping here at all? Maybe it's absolutely
correct that this fails.

(I suppose, though, that the simple toggling of the region won't be a
guest-visible event, so it's probably safe to do. Right?)

What I find weird for AHCI is this: We try to engage the CLB mapping
before the FIS mapping, but we fail at the FIS mapping. So why is
PORT_CMD_FIS_RX set while PORT_CMD_START is unset?

It

1 2 >

100 matches

Mail list logo