[Bug 1747393] Re: nvme is missing support for NVME_ADM_CMD_ASYNC_EV_REQ

2021-04-21 Thread Klaus Jensen
This was fixed in 5.2.0.

** Changed in: qemu
   Status: Incomplete => Fix Released

** Changed in: qemu
 Assignee: (unassigned) => Klaus Jensen (birkelund)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1747393

Title:
  nvme is missing support for NVME_ADM_CMD_ASYNC_EV_REQ

Status in QEMU:
  Fix Released

Bug description:
  NVME_ADM_CMD_ASYNC_EV_REQ is required by specification but apparently
  we will be responded by error when this command is used.

  The Asynchronous Event Request is a mandatory opcode required by
  specification (Figure 40, Section 5 in NVMe 1.2; Figure 41, Section 5
  in NVMe 1.3). A simple way to deal with this in an emulator that
  doesn't really want to use async events is to queue up the requests
  and not do anything with them, and only complete them when the driver
  aborts the command.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1747393/+subscriptions



Re: [PATCH v2] hw/riscv: Fix OT IBEX reset vector

2021-04-21 Thread Alexander Wagner



On 21.04.21 02:00, Alistair Francis wrote:

On Tue, Apr 20, 2021 at 6:01 PM Alexander Wagner
 wrote:

The IBEX documentation [1] specifies the reset vector to be "the most
significant 3 bytes of the boot address and the reset value (0x80) as
the least significant byte".

[1] 
https://github.com/lowRISC/ibex/blob/master/doc/03_reference/exception_interrupts.rst

Signed-off-by: Alexander Wagner 
Reviewed-by: Alistair Francis 

Thanks!

Applied to riscv-to-apply.next


Perfect, thank you :)

Regards

Alex




Re: [RFC PATCH v2 0/6] hw/arm/virt: Introduce cpu topology support

2021-04-21 Thread wangyanan (Y)

Hey guys, any comments will be really welcomed and appreciated! 😉

Thanks,
Yanan
On 2021/4/13 16:07, Yanan Wang wrote:

Hi,

This series is a new version of [0] recently posted by Ying Fang
to introduce cpu topology support for ARM platform. I have taken
over his work about this now, thanks for his contribution.

Description:
An accurate cpu topology may help improve the cpu scheduler's decision
making when dealing with multi-core system. So cpu topology description
is helpful to provide guest with the right view. Dario Faggioli's talk
in [1] also shows the virtual topology could have impact on scheduling
performace. Thus this patch series introduces cpu topology support for
ARM platform.

This series originally comes from Andrew Jones's patches [2], but with
some re-arrangement. Thanks for Andrew's contribution. In this series,
both fdt and ACPI PPTT table are introduced to present cpu topology to
the guest. And a new function virt_smp_parse() not like the default
smp_parse() is introduced, which prefers cores over sockets.

[0] 
https://patchwork.kernel.org/project/qemu-devel/cover/20210225085627.2263-1-fangyi...@huawei.com/
[1] 
https://kvmforum2020.sched.com/event/eE1y/virtual-topology-for-virtual-machines-friend-or-foe-dario-faggioli-suse
[2] 
https://github.com/rhdrjones/qemu/commit/ecfc1565f22187d2c715a99bbcd35cf3a7e428fa

Test results:
After applying this patch series, launch a guest with virt-6.0 and cpu
topology configured with: -smp 96,sockets=2,clusters=6,cores=4,threads=2,
VM's cpu topology description shows as below.

Architecture:aarch64
Byte Order:  Little Endian
CPU(s):  96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):   2
NUMA node(s):1
Vendor ID:   0x48
Model:   0
Stepping:0x1
BogoMIPS:200.00
NUMA node0 CPU(s):   0-95

---

Changelogs:
v1->v2:
- Address Andrew Jones's comments
- Address Michael S. Tsirkin's comments
- Pick up one more patch(patch#6) of Andrew Jones
- Rebased on v6.0.0-rc2 release

---

Andrew Jones (3):
   device_tree: Add qemu_fdt_add_path
   hw/arm/virt: DT: Add cpu-map
   hw/arm/virt: Replace smp_parse with one that prefers cores

Yanan Wang (2):
   hw/acpi/aml-build: Add processor hierarchy node structure
   hw/arm/virt-acpi-build: Add PPTT table

Ying Fang (1):
   hw/arm/virt-acpi-build: Distinguish possible and present cpus

  hw/acpi/aml-build.c  |  27 
  hw/arm/virt-acpi-build.c |  77 --
  hw/arm/virt.c| 120 ++-
  include/hw/acpi/aml-build.h  |   4 ++
  include/hw/arm/virt.h|   1 +
  include/sysemu/device_tree.h |   1 +
  softmmu/device_tree.c|  45 -
  7 files changed, 268 insertions(+), 7 deletions(-)





Re: [Virtio-fs] [for-6.1 2/2] virtiofsd: Add support for FUSE_SYNCFS request

2021-04-21 Thread Greg Kurz
On Tue, 20 Apr 2021 14:57:19 -0400
Vivek Goyal  wrote:

> On Mon, Apr 19, 2021 at 05:11:42PM +0200, Greg Kurz wrote:
> > Honor the expected behavior of syncfs() to synchronously flush all
> > data and metadata on linux systems. Like the ->sync_fs() superblock
> > operation in the linux kernel, FUSE_SYNCFS has a 'wait' argument that
> > tells whether the server should wait for outstanding I/Os to complete
> > before replying to the client. Anything virtiofsd can do to flush
> > the caches implies blocking syscalls, so nothing is done if waiting
> > isn't requested.
> > 
> > Flushing is done with syncfs(). This is suboptimal as it will also
> > flush writes performed by any other process on the same file system,
> > and thus add an unbounded time penalty to syncfs(). This may be
> > optimized in the future, but enforce correctness first.
> > 
> > Signed-off-by: Greg Kurz 
> > ---
> >  tools/virtiofsd/fuse_lowlevel.c   | 19 ++
> >  tools/virtiofsd/fuse_lowlevel.h   | 13 
> >  tools/virtiofsd/passthrough_ll.c  | 29 +++
> >  tools/virtiofsd/passthrough_seccomp.c |  1 +
> >  4 files changed, 62 insertions(+)
> > 
> > diff --git a/tools/virtiofsd/fuse_lowlevel.c 
> > b/tools/virtiofsd/fuse_lowlevel.c
> > index 58e32fc96369..2d0c47a7a60e 100644
> > --- a/tools/virtiofsd/fuse_lowlevel.c
> > +++ b/tools/virtiofsd/fuse_lowlevel.c
> > @@ -1870,6 +1870,24 @@ static void do_lseek(fuse_req_t req, fuse_ino_t 
> > nodeid,
> >  }
> >  }
> >  
> > +static void do_syncfs(fuse_req_t req, fuse_ino_t nodeid,
> > +  struct fuse_mbuf_iter *iter)
> > +{
> > +struct fuse_syncfs_in *arg;
> > +
> > +arg = fuse_mbuf_iter_advance(iter, sizeof(*arg));
> > +if (!arg) {
> > +fuse_reply_err(req, EINVAL);
> > +return;
> > +}
> > +
> > +if (req->se->op.syncfs) {
> > +req->se->op.syncfs(req, arg->wait);
> > +} else {
> > +fuse_reply_err(req, ENOSYS);
> > +}
> > +}
> > +
> >  static void do_init(fuse_req_t req, fuse_ino_t nodeid,
> >  struct fuse_mbuf_iter *iter)
> >  {
> > @@ -2267,6 +2285,7 @@ static struct {
> >  [FUSE_RENAME2] = { do_rename2, "RENAME2" },
> >  [FUSE_COPY_FILE_RANGE] = { do_copy_file_range, "COPY_FILE_RANGE" },
> >  [FUSE_LSEEK] = { do_lseek, "LSEEK" },
> > +[FUSE_SYNCFS] = { do_syncfs, "SYNCFS" },
> >  };
> >  
> >  #define FUSE_MAXOP (sizeof(fuse_ll_ops) / sizeof(fuse_ll_ops[0]))
> > diff --git a/tools/virtiofsd/fuse_lowlevel.h 
> > b/tools/virtiofsd/fuse_lowlevel.h
> > index 3bf786b03485..b5ac42c31799 100644
> > --- a/tools/virtiofsd/fuse_lowlevel.h
> > +++ b/tools/virtiofsd/fuse_lowlevel.h
> > @@ -1225,6 +1225,19 @@ struct fuse_lowlevel_ops {
> >   */
> >  void (*lseek)(fuse_req_t req, fuse_ino_t ino, off_t off, int whence,
> >struct fuse_file_info *fi);
> > +
> > +/**
> > + * Synchronize file system content
> > + *
> > + * If this request is answered with an error code of ENOSYS,
> > + * this is treated as success and future calls to syncfs() will
> > + * succeed automatically without being sent to the filesystem
> > + * process.
> > + *
> > + * @param req request handle
> > + * @param wait whether to wait for outstanding I/Os to complete
> > + */
> > +void (*syncfs)(fuse_req_t req, int wait);
> >  };
> >  
> >  /**
> > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > b/tools/virtiofsd/passthrough_ll.c
> > index 1553d2ef454f..6b66f3208be0 100644
> > --- a/tools/virtiofsd/passthrough_ll.c
> > +++ b/tools/virtiofsd/passthrough_ll.c
> > @@ -3124,6 +3124,34 @@ static void lo_lseek(fuse_req_t req, fuse_ino_t ino, 
> > off_t off, int whence,
> >  }
> >  }
> >  
> > +static void lo_syncfs(fuse_req_t req, int wait)
> > +{
> > +if (wait) {
> > +struct lo_data *lo = lo_data(req);
> > +int fd, ret;
> > +
> > +fd = lo_inode_open(lo, &lo->root, O_RDONLY);
> > +if (fd < 0) {
> > +fuse_reply_err(req, errno);
> > +return;
> > +}
> > +
> > +/*
> > + * FIXME: this is suboptimal because it will also flush unrelated
> > + *writes not coming from the client. This can dramatically
> > + *increase the time spent in syncfs() if some process is
> > + *writing lots of data on the same filesystem as virtiofsd.
> > + */
> > +ret = syncfs(fd);
> > +/* syncfs() never fails on a valid fd */
> 
> Where does this come from. man page says.
> 
>syncfs() can fail for at least the following reason:
> 
>EBADF  fd is not a valid file descriptor.
> 
> It says "can fail for at least the following reason". Its not ruling out
> failures due to other reasons?
> 
> Also kernel implementation of syscall is as follows.
> 
> SYSCALL_DEFINE1(syncfs, int, fd)
> {
>   if (!f.file)
> return -EBADF;
> sb = f.file->

[PATCH] mirror: stop cancelling in-flight requests on non-force cancel in READY

2021-04-21 Thread Vladimir Sementsov-Ogievskiy
If mirror is READY than cancel operation is not discarding the whole
result of the operation, but instead it's a documented way get a
point-in-time snapshot of source disk.

So, we should not cancel any requests if mirror is READ and
force=false. Let's fix that case.

Note, that bug that we have before this commit is not critical, as the
only .bdrv_cancel_in_flight implementation is nbd_cancel_in_flight()
and it cancels only requests waiting for reconnection, so it should be
rare case.

Fixes: 521ff8b779b11c394dbdc43f02e158dd99df308a
Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block_int.h | 2 +-
 include/qemu/job.h| 2 +-
 block/backup.c| 2 +-
 block/mirror.c| 6 --
 job.c | 2 +-
 tests/qemu-iotests/264| 2 +-
 6 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 88e4111939..584381fdb0 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -357,7 +357,7 @@ struct BlockDriver {
  * of in-flight requests, so don't waste the time if possible.
  *
  * One example usage is to avoid waiting for an nbd target node reconnect
- * timeout during job-cancel.
+ * timeout during job-cancel with force=true.
  */
 void (*bdrv_cancel_in_flight)(BlockDriverState *bs);
 
diff --git a/include/qemu/job.h b/include/qemu/job.h
index efc6fa7544..41162ed494 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -254,7 +254,7 @@ struct JobDriver {
 /**
  * If the callback is not NULL, it will be invoked in job_cancel_async
  */
-void (*cancel)(Job *job);
+void (*cancel)(Job *job, bool force);
 
 
 /** Called when the job is freed */
diff --git a/block/backup.c b/block/backup.c
index 6cf2f974aa..bd3614ce70 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -331,7 +331,7 @@ static void coroutine_fn backup_set_speed(BlockJob *job, 
int64_t speed)
 }
 }
 
-static void backup_cancel(Job *job)
+static void backup_cancel(Job *job, bool force)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
 
diff --git a/block/mirror.c b/block/mirror.c
index 5a71bd8bbc..fcd1b56991 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1178,12 +1178,14 @@ static bool mirror_drained_poll(BlockJob *job)
 return !!s->in_flight;
 }
 
-static void mirror_cancel(Job *job)
+static void mirror_cancel(Job *job, bool force)
 {
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
 BlockDriverState *target = blk_bs(s->target);
 
-bdrv_cancel_in_flight(target);
+if (force || !job_is_ready(job)) {
+bdrv_cancel_in_flight(target);
+}
 }
 
 static const BlockJobDriver mirror_job_driver = {
diff --git a/job.c b/job.c
index 4aff13d95a..8775c1803b 100644
--- a/job.c
+++ b/job.c
@@ -716,7 +716,7 @@ static int job_finalize_single(Job *job)
 static void job_cancel_async(Job *job, bool force)
 {
 if (job->driver->cancel) {
-job->driver->cancel(job);
+job->driver->cancel(job, force);
 }
 if (job->user_paused) {
 /* Do not call job_enter here, the caller will handle it.  */
diff --git a/tests/qemu-iotests/264 b/tests/qemu-iotests/264
index 4f96825a22..bc431d1a19 100755
--- a/tests/qemu-iotests/264
+++ b/tests/qemu-iotests/264
@@ -95,7 +95,7 @@ class TestNbdReconnect(iotests.QMPTestCase):
 self.assert_qmp(result, 'return', {})
 
 def cancel_job(self):
-result = self.vm.qmp('block-job-cancel', device='drive0')
+result = self.vm.qmp('block-job-cancel', device='drive0', force=True)
 self.assert_qmp(result, 'return', {})
 
 start_t = time.time()
-- 
2.29.2




[PATCH RFC v3 0/8] Introduce Bypass IOMMU Feature

2021-04-21 Thread Wang Xingang
From: Xingang Wang 

These patches add support for configure bypass_iommu on/off for
pci root bus, including primary bus and pxb root bus. At present,
All root bus will go through iommu when iommu is configured,
which is not flexible, because in many situations the need for using
iommu and bypass iommu aften exists at the same time.

So this add option to enable/disable bypass_iommu for primary bus
and pxb root bus. The bypass_iommu property is set to false default,
meaning that devcies will go through iommu if no explicit configuration
is added. When bypass_iommu is enabled for the root bus, devices
attached to it will bypass iommu, otherwise devices will go through
iommu.

This feature can be used in this manner:
arm: -machine virt,iommu=smmuv3,bypass_iommu=true
x86: -machine q35,bypass_iommu=true
pxb: -device pxb-pcie,bus_nr=0x10,id=pci.10,bus=pcie.0,bypass_iommu=true 

History:

v2 -> v3:
- rebase on top of v6.0.0-rc4
- Took into account Eric's comments, replace with a bypass_iommu
  proerty 
- When building the IORT idmap, cover the whole RID space

v1 -> v2:
- rebase on top of v6.0.0-rc0
- Fix some issues
- Took into account Eric's comments, and remove the PCI_BUS_IOMMU flag,
  replace it with a property in PCIHostState.
- Add support for x86 iommu option

Xingang Wang (8):
  hw/pci/pci_host: Allow bypass iommu for pci host
  hw/pxb: Add a bypass iommu property
  hw/arm/virt: Add a machine option to bypass iommu for primary bus
  hw/i386: Add a pc machine option to bypass iommu for primary bus
  hw/pci: Add pci_bus_range to get bus number range
  hw/arm/virt-acpi-build: Add explicit IORT idmap for smmuv3 node
  hw/i386/acpi-build: Add explicit scope in DMAR table
  hw/i386/acpi-build: Add bypass_iommu check when building IVRS table

 hw/arm/virt-acpi-build.c| 128 +++-
 hw/arm/virt.c   |  26 ++
 hw/i386/acpi-build.c|  70 ++-
 hw/i386/pc.c|  18 
 hw/pci-bridge/pci_expander_bridge.c |   3 +
 hw/pci-host/q35.c   |   1 +
 hw/pci/pci.c|  33 ++-
 hw/pci/pci_host.c   |   2 +
 include/hw/arm/virt.h   |   1 +
 include/hw/i386/pc.h|   1 +
 include/hw/pci/pci.h|   2 +
 include/hw/pci/pci_host.h   |   1 +
 12 files changed, 263 insertions(+), 23 deletions(-)

-- 
2.19.1




[PATCH RFC v3 1/8] hw/pci/pci_host: Allow bypass iommu for pci host

2021-04-21 Thread Wang Xingang
From: Xingang Wang 

This add a bypass_iommu property for pci host, which indicates
whether devices attached to the pci root bus will bypass iommu.
In pci_device_iommu_address_space(), add a bypass_iommu check
to avoid getting iommu address space for devices bypass iommu.

Signed-off-by: Xingang Wang 
Signed-off-by: Jiahui Cen 
---
 hw/pci/pci.c  | 18 +-
 hw/pci/pci_host.c |  2 ++
 include/hw/pci/pci.h  |  1 +
 include/hw/pci/pci_host.h |  1 +
 4 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 8f35e13a0c..301addfb35 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -417,6 +417,22 @@ const char *pci_root_bus_path(PCIDevice *dev)
 return rootbus->qbus.name;
 }
 
+bool pci_bus_bypass_iommu(PCIBus *bus)
+{
+PCIBus *rootbus = bus;
+PCIHostState *host_bridge;
+
+if (!pci_bus_is_root(bus)) {
+rootbus = pci_device_root_bus(bus->parent_dev);
+}
+
+host_bridge = PCI_HOST_BRIDGE(rootbus->qbus.parent);
+
+assert(host_bridge->bus == rootbus);
+
+return host_bridge->bypass_iommu;
+}
+
 static void pci_root_bus_init(PCIBus *bus, DeviceState *parent,
   MemoryRegion *address_space_mem,
   MemoryRegion *address_space_io,
@@ -2719,7 +2735,7 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice 
*dev)
 
 iommu_bus = parent_bus;
 }
-if (iommu_bus && iommu_bus->iommu_fn) {
+if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
 return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
 }
 return &address_space_memory;
diff --git a/hw/pci/pci_host.c b/hw/pci/pci_host.c
index 8ca5fadcbd..2768db53e6 100644
--- a/hw/pci/pci_host.c
+++ b/hw/pci/pci_host.c
@@ -222,6 +222,8 @@ const VMStateDescription vmstate_pcihost = {
 static Property pci_host_properties_common[] = {
 DEFINE_PROP_BOOL("x-config-reg-migration-enabled", PCIHostState,
  mig_enabled, true),
+DEFINE_PROP_BOOL("pci-host-bypass-iommu", PCIHostState,
+ bypass_iommu, false),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6be4e0c460..f4d51b672b 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -480,6 +480,7 @@ void pci_for_each_bus(PCIBus *bus,
 
 PCIBus *pci_device_root_bus(const PCIDevice *d);
 const char *pci_root_bus_path(PCIDevice *dev);
+bool pci_bus_bypass_iommu(PCIBus *bus);
 PCIDevice *pci_find_device(PCIBus *bus, int bus_num, uint8_t devfn);
 int pci_qdev_find_device(const char *id, PCIDevice **pdev);
 void pci_bus_get_w64_range(PCIBus *bus, Range *range);
diff --git a/include/hw/pci/pci_host.h b/include/hw/pci/pci_host.h
index 52e038c019..c6f4eb4585 100644
--- a/include/hw/pci/pci_host.h
+++ b/include/hw/pci/pci_host.h
@@ -43,6 +43,7 @@ struct PCIHostState {
 uint32_t config_reg;
 bool mig_enabled;
 PCIBus *bus;
+bool bypass_iommu;
 
 QLIST_ENTRY(PCIHostState) next;
 };
-- 
2.19.1




[PATCH RFC v3 4/8] hw/i386: Add a pc machine option to bypass iommu for primary bus

2021-04-21 Thread Wang Xingang
From: Xingang Wang 

Add a bypass_iommu pc machine option to bypass iommu translation
for the primary root bus.
The option can be used as manner:
qemu-system-x86_64 -machine q35,bypass_iommu=true

Signed-off-by: Xingang Wang 
Signed-off-by: Jiahui Cen 
---
 hw/i386/pc.c | 18 ++
 hw/pci-host/q35.c|  1 +
 include/hw/i386/pc.h |  1 +
 3 files changed, 20 insertions(+)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 8a84b25a03..2266a0520f 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1529,6 +1529,20 @@ static void pc_machine_set_hpet(Object *obj, bool value, 
Error **errp)
 pcms->hpet_enabled = value;
 }
 
+static bool pc_machine_get_bypass_iommu(Object *obj, Error **errp)
+{
+PCMachineState *pcms = PC_MACHINE(obj);
+
+return pcms->bypass_iommu;
+}
+
+static void pc_machine_set_bypass_iommu(Object *obj, bool value, Error **errp)
+{
+PCMachineState *pcms = PC_MACHINE(obj);
+
+pcms->bypass_iommu = value;
+}
+
 static void pc_machine_get_max_ram_below_4g(Object *obj, Visitor *v,
 const char *name, void *opaque,
 Error **errp)
@@ -1628,6 +1642,7 @@ static void pc_machine_initfn(Object *obj)
 #ifdef CONFIG_HPET
 pcms->hpet_enabled = true;
 #endif
+pcms->bypass_iommu = false;
 
 pc_system_flash_create(pcms);
 pcms->pcspk = isa_new(TYPE_PC_SPEAKER);
@@ -1752,6 +1767,9 @@ static void pc_machine_class_init(ObjectClass *oc, void 
*data)
 object_class_property_add_bool(oc, "hpet",
 pc_machine_get_hpet, pc_machine_set_hpet);
 
+object_class_property_add_bool(oc, "bypass_iommu",
+pc_machine_get_bypass_iommu, pc_machine_set_bypass_iommu);
+
 object_class_property_add(oc, PC_MACHINE_MAX_FW_SIZE, "size",
 pc_machine_get_max_fw_size, pc_machine_set_max_fw_size,
 NULL, NULL);
diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 2eb729dff5..ade05a5539 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -64,6 +64,7 @@ static void q35_host_realize(DeviceState *dev, Error **errp)
 s->mch.address_space_io,
 0, TYPE_PCIE_BUS);
 PC_MACHINE(qdev_get_machine())->bus = pci->bus;
+pci->bypass_iommu = PC_MACHINE(qdev_get_machine())->bypass_iommu;
 qdev_realize(DEVICE(&s->mch), BUS(pci->bus), &error_fatal);
 }
 
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index dcf060b791..83ee8f2a01 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -45,6 +45,7 @@ typedef struct PCMachineState {
 bool sata_enabled;
 bool pit_enabled;
 bool hpet_enabled;
+bool bypass_iommu;
 uint64_t max_fw_size;
 
 /* NUMA information: */
-- 
2.19.1




[PATCH RFC v3 3/8] hw/arm/virt: Add a machine option to bypass iommu for primary bus

2021-04-21 Thread Wang Xingang
From: Xingang Wang 

This add a bypass_iommu option for arm virt machine,
the option can be used in this manner:
qemu -machine virt,iommu=smmuv3,bypass_iommu=true

Signed-off-by: Xingang Wang 
Signed-off-by: Jiahui Cen 
---
 hw/arm/virt.c | 26 ++
 include/hw/arm/virt.h |  1 +
 2 files changed, 27 insertions(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 9f01d9041b..0ce6167aab 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1366,6 +1366,7 @@ static void create_pcie(VirtMachineState *vms)
 }
 
 pci = PCI_HOST_BRIDGE(dev);
+pci->bypass_iommu = vms->bypass_iommu;
 vms->bus = pci->bus;
 if (vms->bus) {
 for (i = 0; i < nb_nics; i++) {
@@ -2319,6 +2320,21 @@ static void virt_set_iommu(Object *obj, const char 
*value, Error **errp)
 }
 }
 
+static bool virt_get_bypass_iommu(Object *obj, Error **errp)
+{
+VirtMachineState *vms = VIRT_MACHINE(obj);
+
+return vms->bypass_iommu;
+}
+
+static void virt_set_bypass_iommu(Object *obj, bool value,
+  Error **errp)
+{
+VirtMachineState *vms = VIRT_MACHINE(obj);
+
+vms->bypass_iommu = value;
+}
+
 static CpuInstanceProperties
 virt_cpu_index_to_props(MachineState *ms, unsigned cpu_index)
 {
@@ -2656,6 +2672,13 @@ static void virt_machine_class_init(ObjectClass *oc, 
void *data)
   "Set the IOMMU type. "
   "Valid values are none and smmuv3");
 
+object_class_property_add_bool(oc, "bypass_iommu",
+   virt_get_bypass_iommu,
+   virt_set_bypass_iommu);
+object_class_property_set_description(oc, "bypass_iommu",
+  "Set on/off to enable/disable "
+  "bypass_iommu for primary bus");
+
 object_class_property_add_bool(oc, "ras", virt_get_ras,
virt_set_ras);
 object_class_property_set_description(oc, "ras",
@@ -2723,6 +2746,9 @@ static void virt_instance_init(Object *obj)
 /* Default disallows iommu instantiation */
 vms->iommu = VIRT_IOMMU_NONE;
 
+/* The primary bus is attached to iommu by default */
+vms->bypass_iommu = false;
+
 /* Default disallows RAS instantiation */
 vms->ras = false;
 
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 921416f918..82bceadb82 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -147,6 +147,7 @@ struct VirtMachineState {
 OnOffAuto acpi;
 VirtGICType gic_version;
 VirtIOMMUType iommu;
+bool bypass_iommu;
 VirtMSIControllerType msi_controller;
 uint16_t virtio_iommu_bdf;
 struct arm_boot_info bootinfo;
-- 
2.19.1




[PATCH RFC v3 8/8] hw/i386/acpi-build: Add bypass_iommu check when building IVRS table

2021-04-21 Thread Wang Xingang
From: Xingang Wang 

When building IVRS table, only devices which go through iommu
will be scanned, and the corresponding ivhd will be inserted.

Signed-off-by: Xingang Wang 
Signed-off-by: Jiahui Cen 
---
 hw/i386/acpi-build.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index fdb26682cb..71fb95737c 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2229,7 +2229,7 @@ ivrs_host_bridges(Object *obj, void *opaque)
 if (object_dynamic_cast(obj, TYPE_PCI_HOST_BRIDGE)) {
 PCIBus *bus = PCI_HOST_BRIDGE(obj)->bus;
 
-if (bus) {
+if (bus && !pci_bus_bypass_iommu(bus)) {
 pci_for_each_device(bus, pci_bus_num(bus), insert_ivhd, ivhd_blob);
 }
 }
-- 
2.19.1




[PATCH RFC v3 7/8] hw/i386/acpi-build: Add explicit scope in DMAR table

2021-04-21 Thread Wang Xingang
From: Xingang Wang 

In DMAR table, the drhd is set to cover all pci devices when intel_iommu
is on. This patch add explicit scope data, including only the pci devices
that go through iommu.

Signed-off-by: Xingang Wang 
Signed-off-by: Jiahui Cen 
---
 hw/i386/acpi-build.c | 68 ++--
 1 file changed, 66 insertions(+), 2 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index de98750aef..fdb26682cb 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -1988,6 +1988,56 @@ build_srat(GArray *table_data, BIOSLinker *linker, 
MachineState *machine)
  x86ms->oem_table_id);
 }
 
+/*
+ * Insert DMAR scope for PCI bridges and endpoint devcie
+ */
+static void
+insert_scope(PCIBus *bus, PCIDevice *dev, void *opaque)
+{
+GArray *scope_blob = opaque;
+AcpiDmarDeviceScope *scope = NULL;
+
+if (object_dynamic_cast(OBJECT(dev), TYPE_PCI_BRIDGE)) {
+/* Dmar Scope Type: 0x02 for PCI Bridge */
+build_append_int_noprefix(scope_blob, 0x02, 1);
+} else {
+/* Dmar Scope Type: 0x01 for PCI Endpoint Device */
+build_append_int_noprefix(scope_blob, 0x01, 1);
+}
+
+/* length */
+build_append_int_noprefix(scope_blob,
+  sizeof(*scope) + sizeof(scope->path[0]), 1);
+/* reserved */
+build_append_int_noprefix(scope_blob, 0, 2);
+/* enumeration_id */
+build_append_int_noprefix(scope_blob, 0, 1);
+/* bus */
+build_append_int_noprefix(scope_blob, pci_bus_num(bus), 1);
+/* device */
+build_append_int_noprefix(scope_blob, PCI_SLOT(dev->devfn), 1);
+/* function */
+build_append_int_noprefix(scope_blob, PCI_FUNC(dev->devfn), 1);
+}
+
+/* For a given PCI host bridge, walk and insert DMAR scope */
+static int
+dmar_host_bridges(Object *obj, void *opaque)
+{
+GArray *scope_blob = opaque;
+
+if (object_dynamic_cast(obj, TYPE_PCI_HOST_BRIDGE)) {
+PCIBus *bus = PCI_HOST_BRIDGE(obj)->bus;
+
+if (bus && !pci_bus_bypass_iommu(bus)) {
+pci_for_each_device(bus, pci_bus_num(bus), insert_scope,
+scope_blob);
+}
+}
+
+return 0;
+}
+
 /*
  * VT-d spec 8.1 DMA Remapping Reporting Structure
  * (version Oct. 2014 or later)
@@ -2007,6 +2057,15 @@ build_dmar_q35(GArray *table_data, BIOSLinker *linker, 
const char *oem_id,
 /* Root complex IOAPIC use one path[0] only */
 size_t ioapic_scope_size = sizeof(*scope) + sizeof(scope->path[0]);
 IntelIOMMUState *intel_iommu = INTEL_IOMMU_DEVICE(iommu);
+GArray *scope_blob = g_array_new(false, true, 1);
+
+/*
+ * A PCI bus walk, for each PCI host bridge.
+ * Insert scope for each PCI bridge and endpoint device which
+ * is attached to a bus with iommu enabled.
+ */
+object_child_foreach_recursive(object_get_root(),
+   dmar_host_bridges, scope_blob);
 
 assert(iommu);
 if (x86_iommu_ir_supported(iommu)) {
@@ -2020,8 +2079,9 @@ build_dmar_q35(GArray *table_data, BIOSLinker *linker, 
const char *oem_id,
 /* DMAR Remapping Hardware Unit Definition structure */
 drhd = acpi_data_push(table_data, sizeof(*drhd) + ioapic_scope_size);
 drhd->type = cpu_to_le16(ACPI_DMAR_TYPE_HARDWARE_UNIT);
-drhd->length = cpu_to_le16(sizeof(*drhd) + ioapic_scope_size);
-drhd->flags = ACPI_DMAR_INCLUDE_PCI_ALL;
+drhd->length =
+cpu_to_le16(sizeof(*drhd) + ioapic_scope_size + scope_blob->len);
+drhd->flags = 0;/* Don't include all pci device */
 drhd->pci_segment = cpu_to_le16(0);
 drhd->address = cpu_to_le64(Q35_HOST_BRIDGE_IOMMU_ADDR);
 
@@ -2035,6 +2095,10 @@ build_dmar_q35(GArray *table_data, BIOSLinker *linker, 
const char *oem_id,
 scope->path[0].device = PCI_SLOT(Q35_PSEUDO_DEVFN_IOAPIC);
 scope->path[0].function = PCI_FUNC(Q35_PSEUDO_DEVFN_IOAPIC);
 
+/* Add scope found above */
+g_array_append_vals(table_data, scope_blob->data, scope_blob->len);
+g_array_free(scope_blob, true);
+
 if (iommu->dt_supported) {
 atsr = acpi_data_push(table_data, sizeof(*atsr));
 atsr->type = cpu_to_le16(ACPI_DMAR_TYPE_ATSR);
-- 
2.19.1




[PATCH RFC v3 2/8] hw/pxb: Add a bypass iommu property

2021-04-21 Thread Wang Xingang
From: Xingang Wang 

This add a bypass_iommu property for pci_expander_bridge.
The property can be used as:
qemu -device pxb-pcie,bus_nr=0x10,addr=0x1,bypass_iommu=true

Signed-off-by: Xingang Wang 
Signed-off-by: Jiahui Cen 
---
 hw/pci-bridge/pci_expander_bridge.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/hw/pci-bridge/pci_expander_bridge.c 
b/hw/pci-bridge/pci_expander_bridge.c
index aedded1064..7112dc3062 100644
--- a/hw/pci-bridge/pci_expander_bridge.c
+++ b/hw/pci-bridge/pci_expander_bridge.c
@@ -57,6 +57,7 @@ struct PXBDev {
 
 uint8_t bus_nr;
 uint16_t numa_node;
+bool bypass_iommu;
 };
 
 static PXBDev *convert_to_pxb(PCIDevice *dev)
@@ -255,6 +256,7 @@ static void pxb_dev_realize_common(PCIDevice *dev, bool 
pcie, Error **errp)
 bus->map_irq = pxb_map_irq_fn;
 
 PCI_HOST_BRIDGE(ds)->bus = bus;
+PCI_HOST_BRIDGE(ds)->bypass_iommu = pxb->bypass_iommu;
 
 pxb_register_bus(dev, bus, &local_err);
 if (local_err) {
@@ -301,6 +303,7 @@ static Property pxb_dev_properties[] = {
 /* Note: 0 is not a legal PXB bus number. */
 DEFINE_PROP_UINT8("bus_nr", PXBDev, bus_nr, 0),
 DEFINE_PROP_UINT16("numa_node", PXBDev, numa_node, NUMA_NODE_UNASSIGNED),
+DEFINE_PROP_BOOL("bypass_iommu", PXBDev, bypass_iommu, false),
 DEFINE_PROP_END_OF_LIST(),
 };
 
-- 
2.19.1




[PATCH RFC v3 5/8] hw/pci: Add pci_bus_range to get bus number range

2021-04-21 Thread Wang Xingang
From: Xingang Wang 

This helps to get the bus number range of a pci bridge hierarchy.

Signed-off-by: Xingang Wang 
Signed-off-by: Jiahui Cen 
---
 hw/pci/pci.c | 15 +++
 include/hw/pci/pci.h |  1 +
 2 files changed, 16 insertions(+)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 301addfb35..2ac3b8d76c 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -538,6 +538,21 @@ int pci_bus_num(PCIBus *s)
 return PCI_BUS_GET_CLASS(s)->bus_num(s);
 }
 
+void pci_bus_range(PCIBus *bus, int *min_bus, int *max_bus)
+{
+int i;
+*min_bus = *max_bus = pci_bus_num(bus);
+
+for (i = 0; i < ARRAY_SIZE(bus->devices); ++i) {
+PCIDevice *dev = bus->devices[i];
+
+if (dev && PCI_DEVICE_GET_CLASS(dev)->is_bridge) {
+*min_bus = MIN(*min_bus, dev->config[PCI_SECONDARY_BUS]);
+*max_bus = MAX(*max_bus, dev->config[PCI_SUBORDINATE_BUS]);
+}
+}
+}
+
 int pci_bus_numa_node(PCIBus *bus)
 {
 return PCI_BUS_GET_CLASS(bus)->numa_node(bus);
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index f4d51b672b..d0f4266e37 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -450,6 +450,7 @@ static inline PCIBus *pci_get_bus(const PCIDevice *dev)
 return PCI_BUS(qdev_get_parent_bus(DEVICE(dev)));
 }
 int pci_bus_num(PCIBus *s);
+void pci_bus_range(PCIBus *bus, int *min_bus, int *max_bus);
 static inline int pci_dev_bus_num(const PCIDevice *dev)
 {
 return pci_bus_num(pci_get_bus(dev));
-- 
2.19.1




[PATCH RFC v3 6/8] hw/arm/virt-acpi-build: Add explicit IORT idmap for smmuv3 node

2021-04-21 Thread Wang Xingang
From: Xingang Wang 

This add explicit IORT idmap info according to pci root bus number
range, and only add smmu idmap for those which does not bypass iommu.

For idmap directly to ITS node, this split the whole RID mapping to
smmu idmap and its idmap. So this should cover the whole idmap for
through/bypass SMMUv3 node.

Signed-off-by: Xingang Wang 
Signed-off-by: Jiahui Cen 
---
 hw/arm/virt-acpi-build.c | 128 +--
 1 file changed, 109 insertions(+), 19 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 60fe2e65a7..661b84edec 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -44,6 +44,7 @@
 #include "hw/acpi/tpm.h"
 #include "hw/pci/pcie_host.h"
 #include "hw/pci/pci.h"
+#include "hw/pci/pci_bus.h"
 #include "hw/pci-host/gpex.h"
 #include "hw/arm/virt.h"
 #include "hw/mem/nvdimm.h"
@@ -237,6 +238,41 @@ static void acpi_dsdt_add_tpm(Aml *scope, VirtMachineState 
*vms)
 aml_append(scope, dev);
 }
 
+/* Build the iort ID mapping to SMMUv3 for a given PCI host bridge */
+static int
+iort_host_bridges(Object *obj, void *opaque)
+{
+GArray *idmap_blob = opaque;
+
+if (object_dynamic_cast(obj, TYPE_PCI_HOST_BRIDGE)) {
+PCIBus *bus = PCI_HOST_BRIDGE(obj)->bus;
+
+if (bus && !pci_bus_bypass_iommu(bus)) {
+int min_bus, max_bus;
+pci_bus_range(bus, &min_bus, &max_bus);
+
+AcpiIortIdMapping idmap = {
+.input_base = cpu_to_le32(min_bus << 8),
+.id_count = cpu_to_le32((max_bus - min_bus + 1) << 8),
+.output_base = cpu_to_le32(min_bus << 8),
+.output_reference = cpu_to_le32(0),
+.flags = cpu_to_le32(0),
+};
+g_array_append_val(idmap_blob, idmap);
+}
+}
+
+return 0;
+}
+
+static int smmu_idmap_sort_func(gconstpointer a, gconstpointer b)
+{
+AcpiIortIdMapping *idmap_a = (AcpiIortIdMapping *)a;
+AcpiIortIdMapping *idmap_b = (AcpiIortIdMapping *)b;
+
+return idmap_a->input_base - idmap_b->input_base;
+}
+
 static void
 build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
 {
@@ -247,6 +283,45 @@ build_iort(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 AcpiIortSmmu3 *smmu;
 size_t node_size, iort_node_offset, iort_length, smmu_offset = 0;
 AcpiIortRC *rc;
+uint32_t base, i, rc_map_count;
+GArray *smmu_idmap_blob =
+g_array_new(false, true, sizeof(AcpiIortIdMapping));
+GArray *its_idmap_blob =
+g_array_new(false, true, sizeof(AcpiIortIdMapping));
+
+object_child_foreach_recursive(object_get_root(),
+   iort_host_bridges, smmu_idmap_blob);
+
+g_array_sort(smmu_idmap_blob, smmu_idmap_sort_func);
+
+/* Build the iort ID mapping to ITS directly */
+i = 0, base = 0;
+while (base < 0x && i <= smmu_idmap_blob->len) {
+AcpiIortIdMapping new_idmap = {
+.input_base = cpu_to_le32(base),
+.id_count = cpu_to_le32(0),
+.output_base = cpu_to_le32(base),
+.output_reference = cpu_to_le32(0),
+.flags = cpu_to_le32(0),
+};
+
+if (i == smmu_idmap_blob->len) {
+if (base < 0x) {
+new_idmap.id_count = cpu_to_le32(0x - base);
+g_array_append_val(its_idmap_blob, new_idmap);
+}
+break;
+}
+
+idmap = &g_array_index(smmu_idmap_blob, AcpiIortIdMapping, i);
+if (base < idmap->input_base) {
+new_idmap.id_count = cpu_to_le32(idmap->input_base - base);
+g_array_append_val(its_idmap_blob, new_idmap);
+}
+
+i++;
+base = idmap->input_base + idmap->id_count;
+}
 
 iort = acpi_data_push(table_data, sizeof(*iort));
 
@@ -280,13 +355,13 @@ build_iort(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 
 /* SMMUv3 node */
 smmu_offset = iort_node_offset + node_size;
-node_size = sizeof(*smmu) + sizeof(*idmap);
+node_size = sizeof(*smmu) + sizeof(*idmap) * smmu_idmap_blob->len;
 iort_length += node_size;
 smmu = acpi_data_push(table_data, node_size);
 
 smmu->type = ACPI_IORT_NODE_SMMU_V3;
 smmu->length = cpu_to_le16(node_size);
-smmu->mapping_count = cpu_to_le32(1);
+smmu->mapping_count = cpu_to_le32(smmu_idmap_blob->len);
 smmu->mapping_offset = cpu_to_le32(sizeof(*smmu));
 smmu->base_address = cpu_to_le64(vms->memmap[VIRT_SMMU].base);
 smmu->flags = cpu_to_le32(ACPI_IORT_SMMU_V3_COHACC_OVERRIDE);
@@ -295,23 +370,24 @@ build_iort(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 smmu->sync_gsiv = cpu_to_le32(irq + 2);
 smmu->gerr_gsiv = cpu_to_le32(irq + 3);
 
-/* Identity RID mapping covering the whole input RID range */
-idmap = &smmu->id_mapping_array[

[Bug 1924669] Re: VFP code cannot see CPACR write in the same TB

2021-04-21 Thread Hansni Bu
Sorry, it's because a "ISB" is missing after CPACR is changed. Not bug
of qemu.

** Changed in: qemu
   Status: New => Invalid

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1924669

Title:
  VFP code cannot see CPACR write in the same TB

Status in QEMU:
  Invalid

Bug description:
  If FPU is enabled by writing to CPACR, and the code is in the same
  translation block as the following VFP code, qemu generates "v7M NOCP
  UsageFault".

  This can be reproduced with git HEAD (commit
  8fe9f1f891eff4e37f82622b7480ee748bf4af74).

  The target binary is attached. The qemu command is:
  qemu-system-arm -nographic -monitor null -serial null -semihosting -machine 
mps2-an505 -cpu cortex-m33 -kernel cpacr_vfp.elf -d 
in_asm,int,exec,cpu,cpu_reset,unimp,guest_errors,nochain -D log

  If the code is changed a little, so that they are not in the same
  block, VFP code can see the effect of CPACR, or -singlestep of qemu
  has the same result.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1924669/+subscriptions



Re: [PATCH v4 09/23] job: call job_enter from job_pause

2021-04-21 Thread Vladimir Sementsov-Ogievskiy

07.04.2021 14:38, Vladimir Sementsov-Ogievskiy wrote:

07.04.2021 14:19, Max Reitz wrote:

On 16.01.21 22:46, Vladimir Sementsov-Ogievskiy wrote:

If main job coroutine called job_yield (while some background process
is in progress), we should give it a chance to call job_pause_point().
It will be used in backup, when moved on async block-copy.

Note, that job_user_pause is not enough: we want to handle
child_job_drained_begin() as well, which call job_pause().

Still, if job is already in job_do_yield() in job_pause_point() we
should not enter it.

iotest 109 output is modified: on stop we do bdrv_drain_all() which now
triggers job pause immediately (and pause after ready is standby).

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  job.c  |  3 +++
  tests/qemu-iotests/109.out | 24 
  2 files changed, 27 insertions(+)


While looking into

https://lists.gnu.org/archive/html/qemu-block/2021-04/msg00035.html

I noticed this:

$ ./qemu-img create -f raw src.img 1G
$ ./qemu-img create -f raw dst.img 1G

$ (echo '
    {"execute":"qmp_capabilities"}
    {"execute":"blockdev-mirror",
 "arguments":{"job-id":"mirror",
  "device":"source",
  "target":"target",
  "sync":"full",
  "filter-node-name":"mirror-top"}}
'; sleep 3; echo '
    {"execute":"human-monitor-command",
 "arguments":{"command-line":
  "qemu-io mirror-top \"write 0 1G\""}}') \
| x86_64-softmmu/qemu-system-x86_64 \
    -qmp stdio \
    -blockdev file,node-name=source,filename=src.img \
    -blockdev file,node-name=target,filename=dst.img \
    -object iothread,id=iothr0 \
    -device virtio-blk,drive=source,iothread=iothr0

Before this commit, qemu-io returned an error that there is a permission 
conflict with virtio-blk.  After this commit, there is an abort (“qemu: 
qemu_mutex_unlock_impl: Operation not permitted”):

#0  0x7f8445a4eef5 in raise () at /usr/lib/libc.so.6
#1  0x7f8445a38862 in abort () at /usr/lib/libc.so.6
#2  0x55fbb14a36bf in error_exit
    (err=, msg=msg@entry=0x55fbb1634790 <__func__.27> 
"qemu_mutex_unlock_impl")
    at ../util/qemu-thread-posix.c:37
#3  0x55fbb14a3bc3 in qemu_mutex_unlock_impl
    (mutex=mutex@entry=0x55fbb25ab6e0, file=file@entry=0x55fbb1636957 
"../util/async.c", line=line@entry=650)
    at ../util/qemu-thread-posix.c:109
#4  0x55fbb14b2e75 in aio_context_release (ctx=ctx@entry=0x55fbb25ab680) at 
../util/async.c:650
#5  0x55fbb13d2029 in bdrv_do_drained_begin
    (bs=bs@entry=0x55fbb3a87000, recursive=recursive@entry=false, 
parent=parent@entry=0x0, ignore_bds_parents=ignore_bds_parents@entry=false, 
poll=poll@entry=true) at ../block/io.c:441
#6  0x55fbb13d2192 in bdrv_do_drained_begin
    (poll=true, ignore_bds_parents=false, parent=0x0, recursive=false, 
bs=0x55fbb3a87000) at ../block/io.c:448
#7  0x55fbb13c71a7 in blk_drain (blk=0x55fbb26c5a00) at 
../block/block-backend.c:1718
#8  0x55fbb13c8bbd in blk_unref (blk=0x55fbb26c5a00) at 
../block/block-backend.c:498
#9  blk_unref (blk=0x55fbb26c5a00) at ../block/block-backend.c:491
#10 0x55fbb1024863 in hmp_qemu_io (mon=0x7fffaf3fc7d0, qdict=)
    at ../block/monitor/block-hmp-cmds.c:628

Can you make anything out of this?



Hmm.. Interesting.

man pthread_mutex_unlock
...
     EPERM  The  mutex type is PTHREAD_MUTEX_ERRORCHECK or 
PTHREAD_MUTEX_RECURSIVE, or the mutex is a
  robust mutex, and the current thread does not own the mutex.

So, thread doesn't own the mutex.. We have an iothread here.

AIO_WAIT_WHILE() documents that ctx must be acquired exactly once by caller.. 
But I don't see, where is it acquired in the call stack?

The other question, is why permission conflict is lost with the commit. 
Strange. I ss that hmp_qemu_io creates blk with perm=0 and 
shread=BLK_PERM_ALL.. How could it conflict even before the considered commit?




Sorry, I've answered and forgot about this thread. Now, looking through my 
series I find this again. Seems that problem is really in lacking aio-context 
locking around blk_unref(). I'll send patch now.


--
Best regards,
Vladimir



[PATCH] monitor: hmp_qemu_io: acquire aio contex, fix crash

2021-04-21 Thread Vladimir Sementsov-Ogievskiy
Max reported the following bug:

$ ./qemu-img create -f raw src.img 1G
$ ./qemu-img create -f raw dst.img 1G

$ (echo '
   {"execute":"qmp_capabilities"}
   {"execute":"blockdev-mirror",
"arguments":{"job-id":"mirror",
 "device":"source",
 "target":"target",
 "sync":"full",
 "filter-node-name":"mirror-top"}}
'; sleep 3; echo '
   {"execute":"human-monitor-command",
"arguments":{"command-line":
 "qemu-io mirror-top \"write 0 1G\""}}') \
| x86_64-softmmu/qemu-system-x86_64 \
   -qmp stdio \
   -blockdev file,node-name=source,filename=src.img \
   -blockdev file,node-name=target,filename=dst.img \
   -object iothread,id=iothr0 \
   -device virtio-blk,drive=source,iothread=iothr0

crashes:

0  raise () at /usr/lib/libc.so.6
1  abort () at /usr/lib/libc.so.6
2  error_exit
   (err=,
   msg=msg@entry=0x55fbb1634790 <__func__.27> "qemu_mutex_unlock_impl")
   at ../util/qemu-thread-posix.c:37
3  qemu_mutex_unlock_impl
   (mutex=mutex@entry=0x55fbb25ab6e0,
   file=file@entry=0x55fbb1636957 "../util/async.c",
   line=line@entry=650)
   at ../util/qemu-thread-posix.c:109
4  aio_context_release (ctx=ctx@entry=0x55fbb25ab680) at ../util/async.c:650
5  bdrv_do_drained_begin
   (bs=bs@entry=0x55fbb3a87000, recursive=recursive@entry=false,
   parent=parent@entry=0x0,
   ignore_bds_parents=ignore_bds_parents@entry=false,
   poll=poll@entry=true) at ../block/io.c:441
6  bdrv_do_drained_begin
   (poll=true, ignore_bds_parents=false, parent=0x0, recursive=false,
   bs=0x55fbb3a87000) at ../block/io.c:448
7  blk_drain (blk=0x55fbb26c5a00) at ../block/block-backend.c:1718
8  blk_unref (blk=0x55fbb26c5a00) at ../block/block-backend.c:498
9  blk_unref (blk=0x55fbb26c5a00) at ../block/block-backend.c:491
10 hmp_qemu_io (mon=0x7fffaf3fc7d0, qdict=)
   at ../block/monitor/block-hmp-cmds.c:628

man pthread_mutex_unlock
...
EPERM  The  mutex type is PTHREAD_MUTEX_ERRORCHECK or
PTHREAD_MUTEX_RECURSIVE, or the mutex is a robust mutex, and the
current thread does not own the mutex.

So, thread doesn't own the mutex. And we have iothread here.

Next, note that AIO_WAIT_WHILE() documents that ctx must be acquired
exactly once by caller. But where is it acquired in the call stack?
Seems nowhere.

qemuio_command do acquire aio context.. But we need context acquired
around blk_unref as well. Let's do it.

Reported-by: Max Reitz 
Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/monitor/block-hmp-cmds.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index ebf1033f31..934100d0eb 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -559,6 +559,7 @@ void hmp_qemu_io(Monitor *mon, const QDict *qdict)
 {
 BlockBackend *blk;
 BlockBackend *local_blk = NULL;
+AioContext *ctx;
 bool qdev = qdict_get_try_bool(qdict, "qdev", false);
 const char *device = qdict_get_str(qdict, "device");
 const char *command = qdict_get_str(qdict, "command");
@@ -615,7 +616,13 @@ void hmp_qemu_io(Monitor *mon, const QDict *qdict)
 qemuio_command(blk, command);
 
 fail:
+ctx = blk_get_aio_context(blk);
+aio_context_acquire(ctx);
+
 blk_unref(local_blk);
+
+aio_context_release(ctx);
+
 hmp_handle_error(mon, err);
 }
 
-- 
2.29.2




Re: [PATCH 0/2] i386: Fix interrupt based Async PF enablement

2021-04-21 Thread Vitaly Kuznetsov
Eduardo Habkost  writes:

> On Thu, Apr 15, 2021 at 08:14:30PM +0100, Dr. David Alan Gilbert wrote:
>> * Paolo Bonzini (pbonz...@redhat.com) wrote:
>> > On 06/04/21 13:42, Vitaly Kuznetsov wrote:
>> > > older machine types are still available (I disable it for <= 5.1 but we
>> > > can consider disabling it for 5.2 too). The feature is upstream since
>> > > Linux 5.8, I know that QEMU supports much older kernels but this doesn't
>> > > probably mean that we can't enable new KVM PV features unless all
>> > > supported kernels have it, we'd have to wait many years otherwise.
>> > 
>> > Yes, this is a known problem in fact. :(  In 6.0 we even support RHEL 7,
>> > though that will go away in 6.1.
>> > 
>> > We should take the occasion of dropping RHEL7 to be clearer about which
>> > kernels are supported.
>> 
>> It would be nice to be able to define sets of KVM functonality that we
>> can either start given machine types with, or provide a separate switch
>> to limit kvm functionality back to some defined point.  We do trip over
>> the same things pretty regularly when accidentally turning on new
>> features.
>
> The same idea can apply to the hyperv=on stuff Vitaly is working
> on.  Maybe we should consider making a generic version of the
> s390x FeatGroup code, use it to define convenient sets of KVM and
> hyperv features.

True, the more I look at PV features enablement, the more I think that
we're missing something important in the logic. All machine types we
have are generally suposed to work with the oldest supported kernel so
we should wait many years before enabling some of the new PV features
(KVM or Hyper-V) by default.

This also links to our parallel discussion regarding migration
policies. Currently, we can't enable PV features by default based on
their availability on the host because of migration, the set may differ
on the destination host. What if we introduce (and maybe even switch to
it by default) something like

 -migratable opportunistic (stupid name, I know)

which would allow to enable all features supported by the source host
and then somehow checking that the destination host has them all. This
would effectively mean that it is possible to migrate a VM to a
same-or-newer software (both kernel an QEMU) but not the other way
around. This may be a reasonable choice.

-- 
Vitaly




Re: [RFC PATCH 0/3] block-copy: lock tasks and calls list

2021-04-21 Thread Paolo Bonzini

On 20/04/21 15:12, Vladimir Sementsov-Ogievskiy wrote:

20.04.2021 13:04, Emanuele Giuseppe Esposito wrote:

This serie of patches continues Paolo's series on making the
block layer thread safe. Add a CoMutex lock for both tasks and
calls list present in block/block-copy.c



I think, we need more information about what kind of thread-safety we 
want. Should the whole interface of block-copy be thread safe? Or only 
part of it? What is going to be shared between different threads? Which 
functions will be called concurrently from different threads? This 
should be documented in include/block/block-copy.h.


I guess all of it.  So more state fields should be identified in 
BlockCopyState, especially in_flight_bytes.  ProgressMeter and 
SharedResource should be made thread-safe on their own, just like the 
patch I posted for RateLimit.


What I see here, is that some things are protected by mutex.. Some 
things not. What became thread-safe?


For example, in block_copy_dirty_clusters(), we modify task fields 
without any mutex held:


  block_copy_task_shrink doesn't take mutex.
  task->zeroes is set without mutex as well


Agreed, these are bugs in the series.


Still all these accesses are done when task is already added to the list.

Looping in block_copy_common() is not thread safe as well.


That one should be mostly safe, because only one coroutine ever writes 
to all fields except cancelled.  cancelled should be accessed with 
qatomic_read/qatomic_set, but there's also the problem that coroutine 
sleep/wake APIs are hard to use in a thread-safe manner (which affects 
block_copy_kick).  This is a different topic and it is something I'm 
working on,



You also forget to protect QLIST_REMOVE() call in block_copy_task_end()..

Next, block-copy uses co-shared-resource API, which is not thread-safe 
(as it is directly noted in include/qemu/co-shared-resource.h).


Same thing is block/aio_task API, which is not thread-safe too.

So, we should bring thread-safety first to these smaller helper APIs.


Good point.  Emanuele, can you work on ProgressMeter and SharedResource? 
 AioTaskPool can also be converted to just use CoQueue instead of 
manually waking up coroutines.





[PATCH] amd_iommu: Fix pte_override_page_mask()

2021-04-21 Thread Jean-Philippe Brucker
AMD IOMMU PTEs have a special mode allowing to specify an arbitrary page
size. Quoting the AMD IOMMU specification: "When the Next Level bits [of
a pte] are 7h, the size of the page is determined by the first zero bit
in the page address, starting from bit 12."

So if the lowest bits of the page address is 0, the page is 8kB. If the
lowest bits are 011, the page is 32kB. Currently pte_override_page_mask()
doesn't compute the right value for this page size and amdvi_translate()
can return the wrong guest-physical address. With a Linux guest, DMA
from SATA devices accesses the wrong memory and causes probe failure:

qemu-system-x86_64 ... -device amd-iommu -drive id=hd1,file=foo.bin,if=none \
-device ahci,id=ahci -device ide-hd,drive=hd1,bus=ahci.0
[6.613093] ata1.00: qc timeout (cmd 0xec)
[6.615062] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)

Fix the page mask.

Signed-off-by: Jean-Philippe Brucker 
---
 hw/i386/amd_iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 74a93a5d93f..43b6e9bf510 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -860,8 +860,8 @@ static inline uint8_t get_pte_translation_mode(uint64_t pte)
 
 static inline uint64_t pte_override_page_mask(uint64_t pte)
 {
-uint8_t page_mask = 12;
-uint64_t addr = (pte & AMDVI_DEV_PT_ROOT_MASK) ^ AMDVI_DEV_PT_ROOT_MASK;
+uint8_t page_mask = 13;
+uint64_t addr = (pte & AMDVI_DEV_PT_ROOT_MASK) >> 12;
 /* find the first zero bit */
 while (addr & 1) {
 page_mask++;
-- 
2.31.1




Re: [PATCH 0/2] i386: Fix interrupt based Async PF enablement

2021-04-21 Thread Daniel P . Berrangé
On Wed, Apr 21, 2021 at 10:38:06AM +0200, Vitaly Kuznetsov wrote:
> Eduardo Habkost  writes:
> 
> > On Thu, Apr 15, 2021 at 08:14:30PM +0100, Dr. David Alan Gilbert wrote:
> >> * Paolo Bonzini (pbonz...@redhat.com) wrote:
> >> > On 06/04/21 13:42, Vitaly Kuznetsov wrote:
> >> > > older machine types are still available (I disable it for <= 5.1 but we
> >> > > can consider disabling it for 5.2 too). The feature is upstream since
> >> > > Linux 5.8, I know that QEMU supports much older kernels but this 
> >> > > doesn't
> >> > > probably mean that we can't enable new KVM PV features unless all
> >> > > supported kernels have it, we'd have to wait many years otherwise.
> >> > 
> >> > Yes, this is a known problem in fact. :(  In 6.0 we even support RHEL 7,
> >> > though that will go away in 6.1.
> >> > 
> >> > We should take the occasion of dropping RHEL7 to be clearer about which
> >> > kernels are supported.
> >> 
> >> It would be nice to be able to define sets of KVM functonality that we
> >> can either start given machine types with, or provide a separate switch
> >> to limit kvm functionality back to some defined point.  We do trip over
> >> the same things pretty regularly when accidentally turning on new
> >> features.
> >
> > The same idea can apply to the hyperv=on stuff Vitaly is working
> > on.  Maybe we should consider making a generic version of the
> > s390x FeatGroup code, use it to define convenient sets of KVM and
> > hyperv features.
> 
> True, the more I look at PV features enablement, the more I think that
> we're missing something important in the logic. All machine types we
> have are generally suposed to work with the oldest supported kernel so
> we should wait many years before enabling some of the new PV features
> (KVM or Hyper-V) by default.
> 
> This also links to our parallel discussion regarding migration
> policies. Currently, we can't enable PV features by default based on
> their availability on the host because of migration, the set may differ
> on the destination host. What if we introduce (and maybe even switch to
> it by default) something like
> 
>  -migratable opportunistic (stupid name, I know)
> 
> which would allow to enable all features supported by the source host
> and then somehow checking that the destination host has them all. This
> would effectively mean that it is possible to migrate a VM to a
> same-or-newer software (both kernel an QEMU) but not the other way
> around. This may be a reasonable choice.

I don't think this is usable in pratice. Any large cloud or data center
mgmt app using QEMU relies on migration, so can't opportunistically
use arbitrary new features. They can only use features in the oldest
kernel their deployment cares about. This can be newer than the oldest
that QEMU supports, but still older than the newest that exists.

ie we have situation where:

 - QEMU upstream minimum host is version 7
 - Latest possible host is version 45
 - A particular deployment has a mixture of hosts at version 24 and 37

"-migratable opportunistic"  would let QEMU use features from version 37
despite the deployment needing compatibility with host version 24 still.


It is almost as if we need to have a way to explicitly express a minimum
required host version that VM requires compatibility with, so deployments
can set their own baseline that is newer than QEMU minimum.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [RFC PATCH 0/3] block-copy: lock tasks and calls list

2021-04-21 Thread Vladimir Sementsov-Ogievskiy

21.04.2021 11:38, Paolo Bonzini wrote:

On 20/04/21 15:12, Vladimir Sementsov-Ogievskiy wrote:

20.04.2021 13:04, Emanuele Giuseppe Esposito wrote:

This serie of patches continues Paolo's series on making the
block layer thread safe. Add a CoMutex lock for both tasks and
calls list present in block/block-copy.c



I think, we need more information about what kind of thread-safety we want. 
Should the whole interface of block-copy be thread safe? Or only part of it? 
What is going to be shared between different threads? Which functions will be 
called concurrently from different threads? This should be documented in 
include/block/block-copy.h.


I guess all of it.  So more state fields should be identified in 
BlockCopyState, especially in_flight_bytes.  ProgressMeter and SharedResource 
should be made thread-safe on their own, just like the patch I posted for 
RateLimit.


What I see here, is that some things are protected by mutex.. Some things not. 
What became thread-safe?

For example, in block_copy_dirty_clusters(), we modify task fields without any 
mutex held:

  block_copy_task_shrink doesn't take mutex.
  task->zeroes is set without mutex as well


Agreed, these are bugs in the series.


Still all these accesses are done when task is already added to the list.

Looping in block_copy_common() is not thread safe as well.


That one should be mostly safe, because only one coroutine ever writes to all 
fields except cancelled.  cancelled should be accessed with 
qatomic_read/qatomic_set, but there's also the problem that coroutine 
sleep/wake APIs are hard to use in a thread-safe manner (which affects 
block_copy_kick).  This is a different topic and it is something I'm working on,


You also forget to protect QLIST_REMOVE() call in block_copy_task_end()..

Next, block-copy uses co-shared-resource API, which is not thread-safe (as it 
is directly noted in include/qemu/co-shared-resource.h).

Same thing is block/aio_task API, which is not thread-safe too.

So, we should bring thread-safety first to these smaller helper APIs.


Good point.  Emanuele, can you work on ProgressMeter and SharedResource? 
AioTaskPool can also be converted to just use CoQueue instead of manually 
waking up coroutines.



That would be great.

I have one more question in mind:

Is it effective to use CoMutex here? We are protecting only some fast 
manipulations with data, not io path or something like that. Will simple 
QemuMutex work better? Even if CoMutex doesn't have any overhead, I don't think 
than if thread A wants to modify task list, but mutex is held by thread B (for 
similar thing), there is a reason for thread A to yield and do some other 
things: it can just wait several moments on mutex while B is modifying task 
list..

--
Best regards,
Vladimir



Re: [PATCH] tcg/ppc: Fix building with Clang

2021-04-21 Thread Peter Maydell
On Wed, 21 Apr 2021 at 02:15, Brad Smith  wrote:
>
> Fix building with Clang.
>
> At the moment Clang does not define _CALL_SYSV as GCC does. From
> clang/lib/Basic/Targets/PPC.cpp in getTargetDefines()..
>
>   // FIXME: The following are not yet generated here by Clang, but are
>   //generated by GCC:
>   //
>   //   _SOFT_FLOAT_
>   //   __RECIP_PRECISION__
>   //   __APPLE_ALTIVEC__
>   //   __RECIP__
>   //   __RECIPF__
>   //   __RSQRTE__
>   //   __RSQRTEF__
>   //   _SOFT_DOUBLE_
>   //   __NO_LWSYNC__
>   //   __CMODEL_MEDIUM__
>   //   __CMODEL_LARGE__
>   //   _CALL_SYSV
>   //   _CALL_DARWIN
>
> This is from the OpenBSD ports tree where we use it to build
> on OpenBSD/powerpc.
>
> Signed-off-by: Brad Smith 
>
> diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
> index 838ccfa42d..d2611832e5 100644
> --- a/tcg/ppc/tcg-target.c.inc
> +++ b/tcg/ppc/tcg-target.c.inc
> @@ -25,6 +25,11 @@
>  #include "elf.h"
>  #include "../tcg-pool.c.inc"
>
> +/* Clang does not define _CALL_* */
> +#if defined(__clang__) && defined(__ELF__) && !defined(_CALL_SYSV)
> +#define _CALL_SYSV 1
> +#endif

This is trying to identify the calling convention used by the OS.
That's not purely compiler specific (ie it is not the case that
all ELF output from clang is definitely using the calling convention
that _CALL_SYSV implies), so settign it purely based on "this is clang
producing ELF files" doesn't seem right.

I guess if clang doesn't reliably tell us the calling convention
maybe we should scrap the use of _CALL_SYSV and _CALL_ELF and
use the host OS defines to guess the calling convention ?

thanks
-- PMM



Re: [PATCH 0/2] i386: Fix interrupt based Async PF enablement

2021-04-21 Thread Dr. David Alan Gilbert
* Daniel P. Berrangé (berra...@redhat.com) wrote:
> On Wed, Apr 21, 2021 at 10:38:06AM +0200, Vitaly Kuznetsov wrote:
> > Eduardo Habkost  writes:
> > 
> > > On Thu, Apr 15, 2021 at 08:14:30PM +0100, Dr. David Alan Gilbert wrote:
> > >> * Paolo Bonzini (pbonz...@redhat.com) wrote:
> > >> > On 06/04/21 13:42, Vitaly Kuznetsov wrote:
> > >> > > older machine types are still available (I disable it for <= 5.1 but 
> > >> > > we
> > >> > > can consider disabling it for 5.2 too). The feature is upstream since
> > >> > > Linux 5.8, I know that QEMU supports much older kernels but this 
> > >> > > doesn't
> > >> > > probably mean that we can't enable new KVM PV features unless all
> > >> > > supported kernels have it, we'd have to wait many years otherwise.
> > >> > 
> > >> > Yes, this is a known problem in fact. :(  In 6.0 we even support RHEL 
> > >> > 7,
> > >> > though that will go away in 6.1.
> > >> > 
> > >> > We should take the occasion of dropping RHEL7 to be clearer about which
> > >> > kernels are supported.
> > >> 
> > >> It would be nice to be able to define sets of KVM functonality that we
> > >> can either start given machine types with, or provide a separate switch
> > >> to limit kvm functionality back to some defined point.  We do trip over
> > >> the same things pretty regularly when accidentally turning on new
> > >> features.
> > >
> > > The same idea can apply to the hyperv=on stuff Vitaly is working
> > > on.  Maybe we should consider making a generic version of the
> > > s390x FeatGroup code, use it to define convenient sets of KVM and
> > > hyperv features.
> > 
> > True, the more I look at PV features enablement, the more I think that
> > we're missing something important in the logic. All machine types we
> > have are generally suposed to work with the oldest supported kernel so
> > we should wait many years before enabling some of the new PV features
> > (KVM or Hyper-V) by default.
> > 
> > This also links to our parallel discussion regarding migration
> > policies. Currently, we can't enable PV features by default based on
> > their availability on the host because of migration, the set may differ
> > on the destination host. What if we introduce (and maybe even switch to
> > it by default) something like
> > 
> >  -migratable opportunistic (stupid name, I know)
> > 
> > which would allow to enable all features supported by the source host
> > and then somehow checking that the destination host has them all. This
> > would effectively mean that it is possible to migrate a VM to a
> > same-or-newer software (both kernel an QEMU) but not the other way
> > around. This may be a reasonable choice.
> 
> I don't think this is usable in pratice. Any large cloud or data center
> mgmt app using QEMU relies on migration, so can't opportunistically
> use arbitrary new features. They can only use features in the oldest
> kernel their deployment cares about. This can be newer than the oldest
> that QEMU supports, but still older than the newest that exists.
> 
> ie we have situation where:
> 
>  - QEMU upstream minimum host is version 7
>  - Latest possible host is version 45
>  - A particular deployment has a mixture of hosts at version 24 and 37
> 
> "-migratable opportunistic"  would let QEMU use features from version 37
> despite the deployment needing compatibility with host version 24 still.
> 
> 
> It is almost as if we need to have a way to explicitly express a minimum
> required host version that VM requires compatibility with, so deployments
> can set their own baseline that is newer than QEMU minimum.

It's not a 'version' - it's just the set of capabilities, and the qemu
needs to check them at startup and fail if they're missing; I think
that's what thats FeatGroup is that was suggested.

Just like we have machine type and CPU version we need a set of PV
features that we rely on the host kernel having, and we should only
expose those PV features to the guest.  It's possible that we might
define some machine types as relying on certain PV features, or that
some PV features wouldn't make sense on some machine types.

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [PATCH 0/2] i386: Fix interrupt based Async PF enablement

2021-04-21 Thread Vitaly Kuznetsov
Daniel P. Berrangé  writes:

> On Wed, Apr 21, 2021 at 10:38:06AM +0200, Vitaly Kuznetsov wrote:
>> Eduardo Habkost  writes:
>> 
>> > On Thu, Apr 15, 2021 at 08:14:30PM +0100, Dr. David Alan Gilbert wrote:
>> >> * Paolo Bonzini (pbonz...@redhat.com) wrote:
>> >> > On 06/04/21 13:42, Vitaly Kuznetsov wrote:
>> >> > > older machine types are still available (I disable it for <= 5.1 but 
>> >> > > we
>> >> > > can consider disabling it for 5.2 too). The feature is upstream since
>> >> > > Linux 5.8, I know that QEMU supports much older kernels but this 
>> >> > > doesn't
>> >> > > probably mean that we can't enable new KVM PV features unless all
>> >> > > supported kernels have it, we'd have to wait many years otherwise.
>> >> > 
>> >> > Yes, this is a known problem in fact. :(  In 6.0 we even support RHEL 7,
>> >> > though that will go away in 6.1.
>> >> > 
>> >> > We should take the occasion of dropping RHEL7 to be clearer about which
>> >> > kernels are supported.
>> >> 
>> >> It would be nice to be able to define sets of KVM functonality that we
>> >> can either start given machine types with, or provide a separate switch
>> >> to limit kvm functionality back to some defined point.  We do trip over
>> >> the same things pretty regularly when accidentally turning on new
>> >> features.
>> >
>> > The same idea can apply to the hyperv=on stuff Vitaly is working
>> > on.  Maybe we should consider making a generic version of the
>> > s390x FeatGroup code, use it to define convenient sets of KVM and
>> > hyperv features.
>> 
>> True, the more I look at PV features enablement, the more I think that
>> we're missing something important in the logic. All machine types we
>> have are generally suposed to work with the oldest supported kernel so
>> we should wait many years before enabling some of the new PV features
>> (KVM or Hyper-V) by default.
>> 
>> This also links to our parallel discussion regarding migration
>> policies. Currently, we can't enable PV features by default based on
>> their availability on the host because of migration, the set may differ
>> on the destination host. What if we introduce (and maybe even switch to
>> it by default) something like
>> 
>>  -migratable opportunistic (stupid name, I know)
>> 
>> which would allow to enable all features supported by the source host
>> and then somehow checking that the destination host has them all. This
>> would effectively mean that it is possible to migrate a VM to a
>> same-or-newer software (both kernel an QEMU) but not the other way
>> around. This may be a reasonable choice.
>
> I don't think this is usable in pratice. Any large cloud or data center
> mgmt app using QEMU relies on migration, so can't opportunistically
> use arbitrary new features. They can only use features in the oldest
> kernel their deployment cares about. This can be newer than the oldest
> that QEMU supports, but still older than the newest that exists.
>
> ie we have situation where:
>
>  - QEMU upstream minimum host is version 7
>  - Latest possible host is version 45
>  - A particular deployment has a mixture of hosts at version 24 and 37
>
> "-migratable opportunistic"  would let QEMU use features from version 37
> despite the deployment needing compatibility with host version 24 still.
>

True; I was not really thinking about 'big' clouds/data centers, these
should have enough resources to carefully set all the required features
and not rely on the 'default'. My thoughts were around using migration
for host upgrade on smaller (several hosts) deployments and in this case
it's probably fairly reasonable to require to start with the oldest host
and upgrade them all if getting new features is one of the upgrade goals.

>
> It is almost as if we need to have a way to explicitly express a minimum
> required host version that VM requires compatibility with, so deployments
> can set their own baseline that is newer than QEMU minimum.

Yes, maybe, but setting the baseline is also a non-trivial task:
e.g. how would users know which PV features they can enable without
going through Linux kernel logs or just trying them on the oldest kernel
they need? This should probably be solved by some upper layer management
app which would collect feature sets from all hosts and come up with a
common subset. I'm not sure if this is done by some tools already.

-- 
Vitaly




Re: [RFC PATCH v2 0/6] hw/arm/virt: Introduce cpu topology support

2021-04-21 Thread wangyanan (Y)



On 2021/4/13 16:07, Yanan Wang wrote:

Hi,

This series is a new version of [0] recently posted by Ying Fang
to introduce cpu topology support for ARM platform. I have taken
over his work about this now, thanks for his contribution.

Description:
An accurate cpu topology may help improve the cpu scheduler's decision
making when dealing with multi-core system. So cpu topology description
is helpful to provide guest with the right view. Dario Faggioli's talk
in [1] also shows the virtual topology could have impact on scheduling
performace. Thus this patch series introduces cpu topology support for
ARM platform.

This series originally comes from Andrew Jones's patches [2], but with
some re-arrangement. Thanks for Andrew's contribution. In this series,
both fdt and ACPI PPTT table are introduced to present cpu topology to
the guest. And a new function virt_smp_parse() not like the default
smp_parse() is introduced, which prefers cores over sockets.

[0] 
https://patchwork.kernel.org/project/qemu-devel/cover/20210225085627.2263-1-fangyi...@huawei.com/
[1] 
https://kvmforum2020.sched.com/event/eE1y/virtual-topology-for-virtual-machines-friend-or-foe-dario-faggioli-suse
[2] 
https://github.com/rhdrjones/qemu/commit/ecfc1565f22187d2c715a99bbcd35cf3a7e428fa

Test results:
After applying this patch series, launch a guest with virt-6.0 and cpu
topology configured with: -smp 96,sockets=2,clusters=6,cores=4,threads=2,

Fix the incorrect statement:
Here the command line was "-smp 96, sockets=2, cores=24,threads=2" in 
reality.


Thanks,
Yanan

VM's cpu topology description shows as below.

Architecture:aarch64
Byte Order:  Little Endian
CPU(s):  96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):   2
NUMA node(s):1
Vendor ID:   0x48
Model:   0
Stepping:0x1
BogoMIPS:200.00
NUMA node0 CPU(s):   0-95

---

Changelogs:
v1->v2:
- Address Andrew Jones's comments
- Address Michael S. Tsirkin's comments
- Pick up one more patch(patch#6) of Andrew Jones
- Rebased on v6.0.0-rc2 release

---

Andrew Jones (3):
   device_tree: Add qemu_fdt_add_path
   hw/arm/virt: DT: Add cpu-map
   hw/arm/virt: Replace smp_parse with one that prefers cores

Yanan Wang (2):
   hw/acpi/aml-build: Add processor hierarchy node structure
   hw/arm/virt-acpi-build: Add PPTT table

Ying Fang (1):
   hw/arm/virt-acpi-build: Distinguish possible and present cpus

  hw/acpi/aml-build.c  |  27 
  hw/arm/virt-acpi-build.c |  77 --
  hw/arm/virt.c| 120 ++-
  include/hw/acpi/aml-build.h  |   4 ++
  include/hw/arm/virt.h|   1 +
  include/sysemu/device_tree.h |   1 +
  softmmu/device_tree.c|  45 -
  7 files changed, 268 insertions(+), 7 deletions(-)





Re: [PATCH 0/2] i386: Fix interrupt based Async PF enablement

2021-04-21 Thread Dr. David Alan Gilbert
* Vitaly Kuznetsov (vkuzn...@redhat.com) wrote:
> Daniel P. Berrangé  writes:
> 
> > On Wed, Apr 21, 2021 at 10:38:06AM +0200, Vitaly Kuznetsov wrote:
> >> Eduardo Habkost  writes:
> >> 
> >> > On Thu, Apr 15, 2021 at 08:14:30PM +0100, Dr. David Alan Gilbert wrote:
> >> >> * Paolo Bonzini (pbonz...@redhat.com) wrote:
> >> >> > On 06/04/21 13:42, Vitaly Kuznetsov wrote:
> >> >> > > older machine types are still available (I disable it for <= 5.1 
> >> >> > > but we
> >> >> > > can consider disabling it for 5.2 too). The feature is upstream 
> >> >> > > since
> >> >> > > Linux 5.8, I know that QEMU supports much older kernels but this 
> >> >> > > doesn't
> >> >> > > probably mean that we can't enable new KVM PV features unless all
> >> >> > > supported kernels have it, we'd have to wait many years otherwise.
> >> >> > 
> >> >> > Yes, this is a known problem in fact. :(  In 6.0 we even support RHEL 
> >> >> > 7,
> >> >> > though that will go away in 6.1.
> >> >> > 
> >> >> > We should take the occasion of dropping RHEL7 to be clearer about 
> >> >> > which
> >> >> > kernels are supported.
> >> >> 
> >> >> It would be nice to be able to define sets of KVM functonality that we
> >> >> can either start given machine types with, or provide a separate switch
> >> >> to limit kvm functionality back to some defined point.  We do trip over
> >> >> the same things pretty regularly when accidentally turning on new
> >> >> features.
> >> >
> >> > The same idea can apply to the hyperv=on stuff Vitaly is working
> >> > on.  Maybe we should consider making a generic version of the
> >> > s390x FeatGroup code, use it to define convenient sets of KVM and
> >> > hyperv features.
> >> 
> >> True, the more I look at PV features enablement, the more I think that
> >> we're missing something important in the logic. All machine types we
> >> have are generally suposed to work with the oldest supported kernel so
> >> we should wait many years before enabling some of the new PV features
> >> (KVM or Hyper-V) by default.
> >> 
> >> This also links to our parallel discussion regarding migration
> >> policies. Currently, we can't enable PV features by default based on
> >> their availability on the host because of migration, the set may differ
> >> on the destination host. What if we introduce (and maybe even switch to
> >> it by default) something like
> >> 
> >>  -migratable opportunistic (stupid name, I know)
> >> 
> >> which would allow to enable all features supported by the source host
> >> and then somehow checking that the destination host has them all. This
> >> would effectively mean that it is possible to migrate a VM to a
> >> same-or-newer software (both kernel an QEMU) but not the other way
> >> around. This may be a reasonable choice.
> >
> > I don't think this is usable in pratice. Any large cloud or data center
> > mgmt app using QEMU relies on migration, so can't opportunistically
> > use arbitrary new features. They can only use features in the oldest
> > kernel their deployment cares about. This can be newer than the oldest
> > that QEMU supports, but still older than the newest that exists.
> >
> > ie we have situation where:
> >
> >  - QEMU upstream minimum host is version 7
> >  - Latest possible host is version 45
> >  - A particular deployment has a mixture of hosts at version 24 and 37
> >
> > "-migratable opportunistic"  would let QEMU use features from version 37
> > despite the deployment needing compatibility with host version 24 still.
> >
> 
> True; I was not really thinking about 'big' clouds/data centers, these
> should have enough resources to carefully set all the required features
> and not rely on the 'default'. My thoughts were around using migration
> for host upgrade on smaller (several hosts) deployments and in this case
> it's probably fairly reasonable to require to start with the oldest host
> and upgrade them all if getting new features is one of the upgrade goals.

It's not actually that simple.
Small installations tend to have less spare hardware available and/or
flexibility; if you've got say a 3 or 5 host cluster, once you start
upgrading one node you've now got nowhere to go if you hit a problem.

Dave

> >
> > It is almost as if we need to have a way to explicitly express a minimum
> > required host version that VM requires compatibility with, so deployments
> > can set their own baseline that is newer than QEMU minimum.
> 
> Yes, maybe, but setting the baseline is also a non-trivial task:
> e.g. how would users know which PV features they can enable without
> going through Linux kernel logs or just trying them on the oldest kernel
> they need? This should probably be solved by some upper layer management
> app which would collect feature sets from all hosts and come up with a
> common subset. I'm not sure if this is done by some tools already.
> 
> -- 
> Vitaly
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [PATCH 0/2] i386: Fix interrupt based Async PF enablement

2021-04-21 Thread Daniel P . Berrangé
On Wed, Apr 21, 2021 at 11:29:45AM +0200, Vitaly Kuznetsov wrote:
> Daniel P. Berrangé  writes:
> 
> > On Wed, Apr 21, 2021 at 10:38:06AM +0200, Vitaly Kuznetsov wrote:
> >> Eduardo Habkost  writes:
> >> 
> >> > On Thu, Apr 15, 2021 at 08:14:30PM +0100, Dr. David Alan Gilbert wrote:
> >> >> * Paolo Bonzini (pbonz...@redhat.com) wrote:
> >> >> > On 06/04/21 13:42, Vitaly Kuznetsov wrote:
> >> >> > > older machine types are still available (I disable it for <= 5.1 
> >> >> > > but we
> >> >> > > can consider disabling it for 5.2 too). The feature is upstream 
> >> >> > > since
> >> >> > > Linux 5.8, I know that QEMU supports much older kernels but this 
> >> >> > > doesn't
> >> >> > > probably mean that we can't enable new KVM PV features unless all
> >> >> > > supported kernels have it, we'd have to wait many years otherwise.
> >> >> > 
> >> >> > Yes, this is a known problem in fact. :(  In 6.0 we even support RHEL 
> >> >> > 7,
> >> >> > though that will go away in 6.1.
> >> >> > 
> >> >> > We should take the occasion of dropping RHEL7 to be clearer about 
> >> >> > which
> >> >> > kernels are supported.
> >> >> 
> >> >> It would be nice to be able to define sets of KVM functonality that we
> >> >> can either start given machine types with, or provide a separate switch
> >> >> to limit kvm functionality back to some defined point.  We do trip over
> >> >> the same things pretty regularly when accidentally turning on new
> >> >> features.
> >> >
> >> > The same idea can apply to the hyperv=on stuff Vitaly is working
> >> > on.  Maybe we should consider making a generic version of the
> >> > s390x FeatGroup code, use it to define convenient sets of KVM and
> >> > hyperv features.
> >> 
> >> True, the more I look at PV features enablement, the more I think that
> >> we're missing something important in the logic. All machine types we
> >> have are generally suposed to work with the oldest supported kernel so
> >> we should wait many years before enabling some of the new PV features
> >> (KVM or Hyper-V) by default.
> >> 
> >> This also links to our parallel discussion regarding migration
> >> policies. Currently, we can't enable PV features by default based on
> >> their availability on the host because of migration, the set may differ
> >> on the destination host. What if we introduce (and maybe even switch to
> >> it by default) something like
> >> 
> >>  -migratable opportunistic (stupid name, I know)
> >> 
> >> which would allow to enable all features supported by the source host
> >> and then somehow checking that the destination host has them all. This
> >> would effectively mean that it is possible to migrate a VM to a
> >> same-or-newer software (both kernel an QEMU) but not the other way
> >> around. This may be a reasonable choice.
> >
> > I don't think this is usable in pratice. Any large cloud or data center
> > mgmt app using QEMU relies on migration, so can't opportunistically
> > use arbitrary new features. They can only use features in the oldest
> > kernel their deployment cares about. This can be newer than the oldest
> > that QEMU supports, but still older than the newest that exists.
> >
> > ie we have situation where:
> >
> >  - QEMU upstream minimum host is version 7
> >  - Latest possible host is version 45
> >  - A particular deployment has a mixture of hosts at version 24 and 37
> >
> > "-migratable opportunistic"  would let QEMU use features from version 37
> > despite the deployment needing compatibility with host version 24 still.
> >
> 
> True; I was not really thinking about 'big' clouds/data centers, these
> should have enough resources to carefully set all the required features
> and not rely on the 'default'. My thoughts were around using migration
> for host upgrade on smaller (several hosts) deployments and in this case
> it's probably fairly reasonable to require to start with the oldest host
> and upgrade them all if getting new features is one of the upgrade goals.


> > It is almost as if we need to have a way to explicitly express a minimum
> > required host version that VM requires compatibility with, so deployments
> > can set their own baseline that is newer than QEMU minimum.
> 
> Yes, maybe, but setting the baseline is also a non-trivial task:
> e.g. how would users know which PV features they can enable without
> going through Linux kernel logs or just trying them on the oldest kernel
> they need? This should probably be solved by some upper layer management
> app which would collect feature sets from all hosts and come up with a
> common subset. I'm not sure if this is done by some tools already.

I specifically didn't talk in terms of features, because the problem you
describe is unreasonable to push onto applications.

Rather QEMU could express host baseline

   - "host-v1"  - features A and B
   - "host-v2"  - features A, B and C
   - "host-v3"  - features A, B, C, D, E and f

The mgmt app / admin only has to know which QEMU host baselines their
hosts support.

Re: [PATCH 0/2] i386: Fix interrupt based Async PF enablement

2021-04-21 Thread Vitaly Kuznetsov
Daniel P. Berrangé  writes:

> On Wed, Apr 21, 2021 at 11:29:45AM +0200, Vitaly Kuznetsov wrote:
>> Daniel P. Berrangé  writes:
>> 
>> > On Wed, Apr 21, 2021 at 10:38:06AM +0200, Vitaly Kuznetsov wrote:
>> >> Eduardo Habkost  writes:
>> >> 
>> >> > On Thu, Apr 15, 2021 at 08:14:30PM +0100, Dr. David Alan Gilbert wrote:
>> >> >> * Paolo Bonzini (pbonz...@redhat.com) wrote:
>> >> >> > On 06/04/21 13:42, Vitaly Kuznetsov wrote:
>> >> >> > > older machine types are still available (I disable it for <= 5.1 
>> >> >> > > but we
>> >> >> > > can consider disabling it for 5.2 too). The feature is upstream 
>> >> >> > > since
>> >> >> > > Linux 5.8, I know that QEMU supports much older kernels but this 
>> >> >> > > doesn't
>> >> >> > > probably mean that we can't enable new KVM PV features unless all
>> >> >> > > supported kernels have it, we'd have to wait many years otherwise.
>> >> >> > 
>> >> >> > Yes, this is a known problem in fact. :(  In 6.0 we even support 
>> >> >> > RHEL 7,
>> >> >> > though that will go away in 6.1.
>> >> >> > 
>> >> >> > We should take the occasion of dropping RHEL7 to be clearer about 
>> >> >> > which
>> >> >> > kernels are supported.
>> >> >> 
>> >> >> It would be nice to be able to define sets of KVM functonality that we
>> >> >> can either start given machine types with, or provide a separate switch
>> >> >> to limit kvm functionality back to some defined point.  We do trip over
>> >> >> the same things pretty regularly when accidentally turning on new
>> >> >> features.
>> >> >
>> >> > The same idea can apply to the hyperv=on stuff Vitaly is working
>> >> > on.  Maybe we should consider making a generic version of the
>> >> > s390x FeatGroup code, use it to define convenient sets of KVM and
>> >> > hyperv features.
>> >> 
>> >> True, the more I look at PV features enablement, the more I think that
>> >> we're missing something important in the logic. All machine types we
>> >> have are generally suposed to work with the oldest supported kernel so
>> >> we should wait many years before enabling some of the new PV features
>> >> (KVM or Hyper-V) by default.
>> >> 
>> >> This also links to our parallel discussion regarding migration
>> >> policies. Currently, we can't enable PV features by default based on
>> >> their availability on the host because of migration, the set may differ
>> >> on the destination host. What if we introduce (and maybe even switch to
>> >> it by default) something like
>> >> 
>> >>  -migratable opportunistic (stupid name, I know)
>> >> 
>> >> which would allow to enable all features supported by the source host
>> >> and then somehow checking that the destination host has them all. This
>> >> would effectively mean that it is possible to migrate a VM to a
>> >> same-or-newer software (both kernel an QEMU) but not the other way
>> >> around. This may be a reasonable choice.
>> >
>> > I don't think this is usable in pratice. Any large cloud or data center
>> > mgmt app using QEMU relies on migration, so can't opportunistically
>> > use arbitrary new features. They can only use features in the oldest
>> > kernel their deployment cares about. This can be newer than the oldest
>> > that QEMU supports, but still older than the newest that exists.
>> >
>> > ie we have situation where:
>> >
>> >  - QEMU upstream minimum host is version 7
>> >  - Latest possible host is version 45
>> >  - A particular deployment has a mixture of hosts at version 24 and 37
>> >
>> > "-migratable opportunistic"  would let QEMU use features from version 37
>> > despite the deployment needing compatibility with host version 24 still.
>> >
>> 
>> True; I was not really thinking about 'big' clouds/data centers, these
>> should have enough resources to carefully set all the required features
>> and not rely on the 'default'. My thoughts were around using migration
>> for host upgrade on smaller (several hosts) deployments and in this case
>> it's probably fairly reasonable to require to start with the oldest host
>> and upgrade them all if getting new features is one of the upgrade goals.
>
>
>> > It is almost as if we need to have a way to explicitly express a minimum
>> > required host version that VM requires compatibility with, so deployments
>> > can set their own baseline that is newer than QEMU minimum.
>> 
>> Yes, maybe, but setting the baseline is also a non-trivial task:
>> e.g. how would users know which PV features they can enable without
>> going through Linux kernel logs or just trying them on the oldest kernel
>> they need? This should probably be solved by some upper layer management
>> app which would collect feature sets from all hosts and come up with a
>> common subset. I'm not sure if this is done by some tools already.
>
> I specifically didn't talk in terms of features, because the problem you
> describe is unreasonable to push onto applications.
>
> Rather QEMU could express host baseline
>
>- "host-v1"  - features A and B
>- "host-v2"  - features A, B and 

firmware selection for SEV-ES

2021-04-21 Thread Laszlo Ersek
Hi Brijesh, Tom,

in QEMU's "docs/interop/firmware.json", the @FirmwareFeature enumeration
has a constant called @amd-sev. We should introduce an @amd-sev-es
constant as well, minimally for the following reason:

AMD document #56421 ("SEV-ES Guest-Hypervisor Communication Block
Standardization") revision 1.40 says in "4.6 System Management Mode
(SMM)" that "SMM will not be supported in this version of the
specification". This is reflected in OVMF, so an OVMF binary that's
supposed to run in a SEV-ES guest must be built without "-D
SMM_REQUIRE". (As a consequence, such a binary should be built also
without "-D SECURE_BOOT_ENABLE".)

At the level of "docs/interop/firmware.json", this means that management
applications should be enabled to look for the @amd-sev-es feature (and
it also means, for OS distributors, that any firmware descriptor
exposing @amd-sev-es will currently have to lack all three of:
@requires-smm, @secure-boot, @enrolled-keys).

I have three questions:


(1) According to
, SEV-ES is
explicitly requested in the domain XML via setting bit#2 in the "policy"
element.

Can this setting be used by libvirt to look for such a firmware
descriptor that exposes @amd-sev-es?


(2) "docs/interop/firmware.json" documents @amd-sev as follows:

# @amd-sev: The firmware supports running under AMD Secure Encrypted
#   Virtualization, as specified in the AMD64 Architecture
#   Programmer's Manual. QEMU command line options related to
#   this feature are documented in
#   "docs/amd-memory-encryption.txt".

Documenting the new @amd-sev-es enum constant with very slight
customizations for the same text should be possible, I reckon. However,
"docs/amd-memory-encryption.txt" (nor
"docs/confidential-guest-support.txt") seem to mention SEV-ES.

Can you guys propose a patch for "docs/amd-memory-encryption.txt"?

I guess that would be next to this snippet:

> # ${QEMU} \
>sev-guest,id=sev0,policy=0x1...\


(3) Is the "AMD64 Architecture Programmer's Manual" the specification
that we should reference under @amd-sev-es as well (i.e., same as with
@amd-sev), or is there a more specific document?

Thanks,
Laszlo




Re: [RFC PATCH] vfio-ccw: Permit missing IRQs

2021-04-21 Thread Cornelia Huck
On Mon, 19 Apr 2021 20:49:06 +0200
Eric Farman  wrote:

> Commit 690e29b91102 ("vfio-ccw: Refactor ccw irq handler") changed
> one of the checks for the IRQ notifier registration from saying
> "the host needs to recognize the only IRQ that exists" to saying
> "the host needs to recognize ANY IRQ that exists."
> 
> And this worked fine, because the subsequent change to support the
> CRW IRQ notifier doesn't get into this code when running on an older
> kernel, thanks to a guard by a capability region. The later addition
> of the REQ(uest) IRQ by commit b2f96f9e4f5f ("vfio-ccw: Connect the
> device request notifier") broke this assumption because there is no
> matching capability region. Thus, running new QEMU on an older
> kernel fails with:
> 
>   vfio: unexpected number of irqs 2
> 
> Let's simply remove the check (and the less-than-helpful message),
> and make the VFIO_DEVICE_GET_IRQ_INFO ioctl request for the IRQ
> being processed. If it returns with EINVAL, we can treat it as
> an unfortunate mismatch but not a fatal error for the guest.
> 
> Fixes: 690e29b91102 ("vfio-ccw: Refactor ccw irq handler")
> Fixes: b2f96f9e4f5f ("vfio-ccw: Connect the device request notifier")
> Signed-off-by: Eric Farman 
> ---
>  hw/vfio/ccw.c | 15 +++
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
> index b2df708e4b..cfbfc3d1a2 100644
> --- a/hw/vfio/ccw.c
> +++ b/hw/vfio/ccw.c
> @@ -411,20 +411,19 @@ static void 
> vfio_ccw_register_irq_notifier(VFIOCCWDevice *vcdev,
>  return;
>  }
>  
> -if (vdev->num_irqs < irq + 1) {
> -error_setg(errp, "vfio: unexpected number of irqs %u",
> -   vdev->num_irqs);

Alternative proposal: Change this message to

"vfio: IRQ %u not available (number of irqs %u)"

and still fail this function, while treating a failure of
vfio_ccw_register_irq_notifier(vcdev, VFIO_CCW_REQ_IRQ_INDEX, &err); in
vfio_ccw_realize() as a non-fatal error (maybe log a message).

This allows us to skip doing an ioctl call, of which we already know
that it would fail. Still, we can catch cases where a broken kernel e.g.
provides the crw region, but not the matching irq (I believe something
like that should indeed be a fatal error.)

> -return;
> -}
> -
>  argsz = sizeof(*irq_info);
>  irq_info = g_malloc0(argsz);
>  irq_info->index = irq;
>  irq_info->argsz = argsz;
>  if (ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO,
>irq_info) < 0 || irq_info->count < 1) {
> -error_setg_errno(errp, errno, "vfio: Error getting irq info");
> -goto out_free_info;
> +if (errno == EINVAL) {
> +warn_report("Unable to get information about IRQ %u", irq);
> +goto out_free_info;
> +} else {
> +error_setg_errno(errp, errno, "vfio: Error getting irq info");
> +goto out_free_info;
> +}
>  }
>  
>  if (event_notifier_init(notifier, 0)) {




Re: [PATCH v3] hw/block/nvme: fix lbaf formats initialization

2021-04-21 Thread Gollu Appalanaidu

On Tue, Apr 20, 2021 at 09:47:00PM +0200, Klaus Jensen wrote:

On Apr 16 17:29, Gollu Appalanaidu wrote:

Currently LBAF formats are being intialized based on metadata
size if and only if nvme-ns "ms" parameter is non-zero value.
Since FormatNVM command being supported device parameter "ms"
may not be the criteria to initialize the supported LBAFs.

Signed-off-by: Gollu Appalanaidu 
---
-v3: Remove "mset" constraint  check if ms < 8, "mset" can be
set even when ms < 8 and non-zero.

-v2: Addressing review comments (Klaus)
Change the current "pi" and "ms" constraint check such that it
will throw the error if ms < 8 and if namespace protection info,
location and metadata settings are set.
Splitting this from compare fix patch series.

hw/block/nvme-ns.c | 58 --
1 file changed, 25 insertions(+), 33 deletions(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 7bb618f182..594b0003cf 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -85,38 +85,28 @@ static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
   ds = 31 - clz32(ns->blkconf.logical_block_size);
   ms = ns->params.ms;

-if (ns->params.ms) {
-id_ns->mc = 0x3;
+id_ns->mc = 0x3;

-if (ns->params.mset) {
-id_ns->flbas |= 0x10;
-}
+if (ms && ns->params.mset) {
+id_ns->flbas |= 0x10;
+}

-id_ns->dpc = 0x1f;
-id_ns->dps = ((ns->params.pil & 0x1) << 3) | ns->params.pi;
-
-NvmeLBAF lbaf[16] = {
-[0] = { .ds =  9   },
-[1] = { .ds =  9, .ms =  8 },
-[2] = { .ds =  9, .ms = 16 },
-[3] = { .ds =  9, .ms = 64 },
-[4] = { .ds = 12   },
-[5] = { .ds = 12, .ms =  8 },
-[6] = { .ds = 12, .ms = 16 },
-[7] = { .ds = 12, .ms = 64 },
-};
-
-memcpy(&id_ns->lbaf, &lbaf, sizeof(lbaf));
-id_ns->nlbaf = 7;
-} else {
-NvmeLBAF lbaf[16] = {
-[0] = { .ds =  9 },
-[1] = { .ds = 12 },
-};
+id_ns->dpc = 0x1f;
+id_ns->dps = ((ns->params.pil & 0x1) << 3) | ns->params.pi;

-memcpy(&id_ns->lbaf, &lbaf, sizeof(lbaf));
-id_ns->nlbaf = 1;
-}
+NvmeLBAF lbaf[16] = {
+[0] = { .ds =  9   },
+[1] = { .ds =  9, .ms =  8 },
+[2] = { .ds =  9, .ms = 16 },
+[3] = { .ds =  9, .ms = 64 },
+[4] = { .ds = 12   },
+[5] = { .ds = 12, .ms =  8 },
+[6] = { .ds = 12, .ms = 16 },
+[7] = { .ds = 12, .ms = 64 },
+};
+
+memcpy(&id_ns->lbaf, &lbaf, sizeof(lbaf));
+id_ns->nlbaf = 7;

   for (i = 0; i <= id_ns->nlbaf; i++) {
   NvmeLBAF *lbaf = &id_ns->lbaf[i];


This part LGTM.


@@ -395,10 +385,12 @@ static int nvme_ns_check_constraints(NvmeCtrl *n, 
NvmeNamespace *ns,
   return -1;
   }

-if (ns->params.pi && ns->params.ms < 8) {
-error_setg(errp, "at least 8 bytes of metadata required to enable "
-   "protection information");
-return -1;
+if (ns->params.ms < 8) {
+if (ns->params.pi || ns->params.pil) {
+error_setg(errp, "at least 8 bytes of metadata required to enable "
+"protection information, protection information location");
+return -1;
+}
   }



If you do this additional check, then you should maybe also check that 
pil is only set if pi is. But if pi is not enabled, then the value of 
pil is irrelevant (even though it ends up in FLBAS). In other words, 
if you want to validate all possible parameter configurations, then we 
have a lot more checking to do!


Currently, the approach taken by the parameter validation code is to 
error out on *invalid* configurations that causes invariants to not 
hold, and I'd prefer that we stick with that to keep the check logic 
as simple as possible.


So, (without this unnecessary check):



Sure, will remove this check and send v4


Reviewed-by: Klaus Jensen 





Re: [PATCH v3] hw/block/nvme: fix lbaf formats initialization

2021-04-21 Thread Philippe Mathieu-Daudé
On 4/16/21 1:59 PM, Gollu Appalanaidu wrote:
> Currently LBAF formats are being intialized based on metadata
> size if and only if nvme-ns "ms" parameter is non-zero value.
> Since FormatNVM command being supported device parameter "ms"
> may not be the criteria to initialize the supported LBAFs.
> 
> Signed-off-by: Gollu Appalanaidu 
> ---
> -v3: Remove "mset" constraint  check if ms < 8, "mset" can be
>  set even when ms < 8 and non-zero.
> 
> -v2: Addressing review comments (Klaus)
>  Change the current "pi" and "ms" constraint check such that it
>  will throw the error if ms < 8 and if namespace protection info,
>  location and metadata settings are set.
>  Splitting this from compare fix patch series.
> 
>  hw/block/nvme-ns.c | 58 --
>  1 file changed, 25 insertions(+), 33 deletions(-)

> +NvmeLBAF lbaf[16] = {

Unrelated to your change, but better to use a read-only array:

   static const NvmeLBAF lbaf[16] = {

> +[0] = { .ds =  9   },
> +[1] = { .ds =  9, .ms =  8 },
> +[2] = { .ds =  9, .ms = 16 },
> +[3] = { .ds =  9, .ms = 64 },
> +[4] = { .ds = 12   },
> +[5] = { .ds = 12, .ms =  8 },
> +[6] = { .ds = 12, .ms = 16 },
> +[7] = { .ds = 12, .ms = 64 },
> +};
> +
> +memcpy(&id_ns->lbaf, &lbaf, sizeof(lbaf));
> +id_ns->nlbaf = 7;




Re: [PATCH v3] memory: Directly dispatch alias accesses on origin memory region

2021-04-21 Thread Mark Cave-Ayland

On 20/04/2021 21:59, Peter Xu wrote:


I agree with this sentiment: it has taken me a while to figure out what
was happening, and that was only because I spotted accesses being
rejected with -d guest_errors.

 From my perspective the names memory_region_dispatch_read() and
memory_region_dispatch_write() were the misleading part, although I
remember thinking it odd whilst trying to use them that I had to start
delving into sections etc. just to recurse a memory access.



I think it should always be a valid request to trigger memory access via the MR
layer, say, what if the caller has no address space context at all?


For these cases you can just use the global default address_space_memory which is the 
solution I used in the second version of my patch e.g.


val = address_space_ldl_be(&address_space_memory, addr, attrs, &r);


From the
name of memory_region_dispatch_write|read I don't see either on why we should
not take care of alias mrs.  That's also the reason I'd even prefer this patch
rather than an assert.


The problem I see here is that this patch is breaking the abstraction between 
generating the flatview from the memory topology and dispatching a request to it.


If you look at the existing code then aliased memory regions are de-referenced at 
flatview construction time, so you end up with a flatview where each range points to 
a target (leaf or terminating) memory region plus offset. You can see this if you 
compare the output of "info mtree" with "info mtree -f" in the monitor.


This patch adds a "live" memory region alias de-reference at dispatch time when this 
should already have occurred as the flatview was constructed. I haven't had a chance 
to look at this patch in detail yet but requiring this special case just for 
de-referencing the alias at dispatch time seems wrong.


Given that the related patch "memory: Initialize MemoryRegionOps for RAM memory 
regions" is also changing the default mr->ops for ram devices in a commit originally 
from 2013 then this is another hint that the dispatch API is being used in a way in 
which it wasn't intended.



ATB,

Mark.



Re: [PATCH v3 1/2] iotests/231: Update expected deprecation message

2021-04-21 Thread Stefano Garzarella

On Fri, Apr 09, 2021 at 09:38:53AM -0500, Connor Kuehl wrote:

The deprecation message in the expected output has technically been
wrong since the wrong version of a patch was applied to it. Because of
this, the test fails. Correct the expected output so that it passes.

Signed-off-by: Connor Kuehl 
Reviewed-by: Max Reitz 
---
tests/qemu-iotests/231.out | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)


Reviewed-by: Stefano Garzarella 



diff --git a/tests/qemu-iotests/231.out b/tests/qemu-iotests/231.out
index 579ba11c16..747dd221bb 100644
--- a/tests/qemu-iotests/231.out
+++ b/tests/qemu-iotests/231.out
@@ -1,9 +1,7 @@
QA output created by 231
-qemu-img: RBD options encoded in the filename as keyvalue pairs is deprecated. 
 Future versions may cease to parse these options in the future.
+qemu-img: warning: RBD options encoded in the filename as keyvalue pairs is 
deprecated
unable to get monitor info from DNS SRV with service name: ceph-mon
-no monitors specified to connect to.
qemu-img: Could not open 
'json:{'file.driver':'rbd','file.filename':'rbd:rbd/bogus:conf=BOGUS_CONF'}': 
error connecting: No such file or directory
unable to get monitor info from DNS SRV with service name: ceph-mon
-no monitors specified to connect to.
qemu-img: Could not open 
'json:{'file.driver':'rbd','file.pool':'rbd','file.image':'bogus','file.conf':'BOGUS_CONF'}':
 error connecting: No such file or directory
*** done
--
2.30.2







Re: [PATCH v2 02/12] virtio-gpu: Add udmabuf helpers

2021-04-21 Thread Gerd Hoffmann
  Hi,

> --- /dev/null
> +++ b/include/standard-headers/linux/udmabuf.h

> --- a/scripts/update-linux-headers.sh
> +++ b/scripts/update-linux-headers.sh

Separate patch please.

thanks,
  Gerd




Re: [PATCH v3 2/2] block/rbd: Add an escape-aware strchr helper

2021-04-21 Thread Stefano Garzarella

On Fri, Apr 09, 2021 at 09:38:54AM -0500, Connor Kuehl wrote:

Sometimes the parser needs to further split a token it has collected
from the token input stream. Right now, it does a cursory check to see
if the relevant characters appear in the token to determine if it should
break it down further.

However, qemu_rbd_next_tok() will escape characters as it removes tokens
from the token stream and plain strchr() won't. This can make the
initial strchr() check slightly misleading since it implies
qemu_rbd_next_tok() will find the token and split on it, except the
reality is that qemu_rbd_next_tok() will pass over it if it is escaped.

Use a custom strchr to avoid mixing escaped and unescaped string
operations.

Reported-by: Han Han 
Fixes: https://bugzilla.redhat.com/1873913
Signed-off-by: Connor Kuehl 
---
 v2 -> v3:
   * Update qemu_rbd_strchr to only skip if there's a delimiter AND the
 next character is not the NUL terminator

block/rbd.c| 20 ++--
tests/qemu-iotests/231 |  4 
tests/qemu-iotests/231.out |  3 +++
3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 9071a00e3f..291e3f09e1 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -134,6 +134,22 @@ static char *qemu_rbd_next_tok(char *src, char delim, char 
**p)
return src;
}

+static char *qemu_rbd_strchr(char *src, char delim)
+{
+char *p;
+
+for (p = src; *p; ++p) {
+if (*p == delim) {
+return p;
+}
+if (*p == '\\' && p[1] != '\0') {
+++p;
+}
+}
+
+return NULL;
+}
+


IIUC this is similar to the code used in qemu_rbd_next_tok().
To avoid code duplication can we use this new function inside it?

I mean something like this (not tested):

diff --git a/block/rbd.c b/block/rbd.c
index f098a89c7b..eb6a839362 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -119,15 +119,8 @@ static char *qemu_rbd_next_tok(char *src, char delim, char 
**p)
 
 *p = NULL;
 
-for (end = src; *end; ++end) {

-if (*end == delim) {
-break;
-}
-if (*end == '\\' && end[1] != '\0') {
-end++;
-}
-}
-if (*end == delim) {
+end = qemu_rbd_strchr(src, delim);
+if (end && *end == delim) {
 *p = end + 1;
 *end = '\0';
 }


The rest LGTM!

Thanks for fixing this issue,
Stefano




Re: [PATCH v3] memory: Directly dispatch alias accesses on origin memory region

2021-04-21 Thread Peter Maydell
On Tue, 20 Apr 2021 at 21:59, Peter Xu  wrote:
> I think it should always be a valid request to trigger memory access via the 
> MR
> layer, say, what if the caller has no address space context at all? From the
> name of memory_region_dispatch_write|read I don't see either on why we should
> not take care of alias mrs.  That's also the reason I'd even prefer this patch
> rather than an assert.

It's a bit of an odd case to need to do direct accesses on an MR
(so if you think you need to you should probably look for whether there's
a better way to do it), but there are sometimes reasons it's necessary or
expedient -- we have about half a dozen uses which aren't part of the core
memory system using memory_region_dispatch_* as an internal function.
So I think I agree that while we have this as an API exposed to
the rest of the system it would be nice if it Just Worked for
alias MRs.

thanks
-- PMM



[PATCH v2 0/5] mptcp support

2021-04-21 Thread Dr. David Alan Gilbert (git)
From: "Dr. David Alan Gilbert" 

Hi,
  This set adds support for multipath TCP (mptcp), and has
been tested for migration and (lightly) for NBD.

  Multipath-tcp is a bit like bonding, but at L3; you can use
it to handle failure, but can also use it to split traffic across
multiple interfaces.

  Using a pair of 10Gb interfaces, I've managed to get 19Gbps
(with the only tuning being using huge pages and turning the MTU up).

  It needs a bleeding-edge Linux kernel (in some older ones you get
false accept messages for the subflows), and a C lib that has the
constants defined (as current glibc does).

  To use it you just need to append ,mptcp to an address; for migration:

  -incoming tcp:0:,mptcp
  migrate -d tcp:192.168.11.20:,mptcp

For nbd:

  (qemu) nbd_server_start 0.0.0.0:,mptcp=on

  -blockdev 
driver=nbd,server.type=inet,server.host=192.168.11.20,server.port=,server.mptcp=on,node-name=nbddisk,export=mydisk
 \
  -device virtio-blk,drive=nbddisk,id=disk0

(Many of the other NBD address parsers/forms would need extra work)

  All comments welcome.

Dave

v2
  Use of if defined(...) in the json file based on feedback
  A few missing ifdef's (from a bsd build test)
  Added nbd example.


Dr. David Alan Gilbert (5):
  channel-socket: Only set CLOEXEC if we have space for fds
  io/net-listener: Call the notifier during finalize
  migration: Add cleanup hook for inwards migration
  migration/socket: Close the listener at the end
  sockets: Support multipath TCP

 io/channel-socket.c   |  8 
 io/dns-resolver.c |  4 
 io/net-listener.c |  3 +++
 migration/migration.c |  3 +++
 migration/migration.h |  4 
 migration/multifd.c   |  5 +
 migration/socket.c| 24 ++--
 qapi/sockets.json |  5 -
 util/qemu-sockets.c   | 23 +++
 9 files changed, 68 insertions(+), 11 deletions(-)

-- 
2.31.1




[PATCH v2 1/5] channel-socket: Only set CLOEXEC if we have space for fds

2021-04-21 Thread Dr. David Alan Gilbert (git)
From: "Dr. David Alan Gilbert" 

MSG_CMSG_CLOEXEC cleans up received fd's; it's really only for Unix
sockets, but currently we enable it for everything; some socket types
(IP_MPTCP) don't like this.

Only enable it when we're giving the recvmsg room to receive fd's
anyway.

Signed-off-by: Dr. David Alan Gilbert 
Reviewed-by: Daniel P. Berrangé 
---
 io/channel-socket.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/io/channel-socket.c b/io/channel-socket.c
index de259f7eed..606ec97cf7 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -487,15 +487,15 @@ static ssize_t qio_channel_socket_readv(QIOChannel *ioc,
 
 memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
 
-#ifdef MSG_CMSG_CLOEXEC
-sflags |= MSG_CMSG_CLOEXEC;
-#endif
-
 msg.msg_iov = (struct iovec *)iov;
 msg.msg_iovlen = niov;
 if (fds && nfds) {
 msg.msg_control = control;
 msg.msg_controllen = sizeof(control);
+#ifdef MSG_CMSG_CLOEXEC
+sflags |= MSG_CMSG_CLOEXEC;
+#endif
+
 }
 
  retry:
-- 
2.31.1




[PATCH v2 4/5] migration/socket: Close the listener at the end

2021-04-21 Thread Dr. David Alan Gilbert (git)
From: "Dr. David Alan Gilbert" 

Delay closing the listener until the cleanup hook at the end; mptcp
needs the listener to stay open while the other paths come in.

Signed-off-by: Dr. David Alan Gilbert 
Reviewed-by: Daniel P. Berrangé 
---
 migration/multifd.c |  5 +
 migration/socket.c  | 24 ++--
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index a6677c45c8..cebd9029b9 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1165,6 +1165,11 @@ bool multifd_recv_all_channels_created(void)
 return true;
 }
 
+if (!multifd_recv_state) {
+/* Called before any connections created */
+return false;
+}
+
 return thread_count == qatomic_read(&multifd_recv_state->count);
 }
 
diff --git a/migration/socket.c b/migration/socket.c
index 6016642e04..05705a32d8 100644
--- a/migration/socket.c
+++ b/migration/socket.c
@@ -126,22 +126,31 @@ static void 
socket_accept_incoming_migration(QIONetListener *listener,
 {
 trace_migration_socket_incoming_accepted();
 
-qio_channel_set_name(QIO_CHANNEL(cioc), "migration-socket-incoming");
-migration_channel_process_incoming(QIO_CHANNEL(cioc));
-
 if (migration_has_all_channels()) {
-/* Close listening socket as its no longer needed */
-qio_net_listener_disconnect(listener);
-object_unref(OBJECT(listener));
+error_report("%s: Extra incoming migration connection; ignoring",
+ __func__);
+return;
 }
+
+qio_channel_set_name(QIO_CHANNEL(cioc), "migration-socket-incoming");
+migration_channel_process_incoming(QIO_CHANNEL(cioc));
 }
 
+static void
+socket_incoming_migration_end(void *opaque)
+{
+QIONetListener *listener = opaque;
+
+qio_net_listener_disconnect(listener);
+object_unref(OBJECT(listener));
+}
 
 static void
 socket_start_incoming_migration_internal(SocketAddress *saddr,
  Error **errp)
 {
 QIONetListener *listener = qio_net_listener_new();
+MigrationIncomingState *mis = migration_incoming_get_current();
 size_t i;
 int num = 1;
 
@@ -156,6 +165,9 @@ socket_start_incoming_migration_internal(SocketAddress 
*saddr,
 return;
 }
 
+mis->transport_data = listener;
+mis->transport_cleanup = socket_incoming_migration_end;
+
 qio_net_listener_set_client_func_full(listener,
   socket_accept_incoming_migration,
   NULL, NULL,
-- 
2.31.1




[PATCH v2 5/5] sockets: Support multipath TCP

2021-04-21 Thread Dr. David Alan Gilbert (git)
From: "Dr. David Alan Gilbert" 

Multipath TCP allows combining multiple interfaces/routes into a single
socket, with very little work for the user/admin.

It's enabled by 'mptcp' on most socket addresses:

   ./qemu-system-x86_64 -nographic -incoming tcp:0:,mptcp

Signed-off-by: Dr. David Alan Gilbert 
---
 io/dns-resolver.c   |  4 
 qapi/sockets.json   |  5 -
 util/qemu-sockets.c | 23 +++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/io/dns-resolver.c b/io/dns-resolver.c
index 743a0efc87..a5946a93bf 100644
--- a/io/dns-resolver.c
+++ b/io/dns-resolver.c
@@ -122,6 +122,10 @@ static int 
qio_dns_resolver_lookup_sync_inet(QIODNSResolver *resolver,
 .ipv4 = iaddr->ipv4,
 .has_ipv6 = iaddr->has_ipv6,
 .ipv6 = iaddr->ipv6,
+#ifdef IPPROTO_MPTCP
+.has_mptcp = iaddr->has_mptcp,
+.mptcp = iaddr->mptcp,
+#endif
 };
 
 (*addrs)[i] = newaddr;
diff --git a/qapi/sockets.json b/qapi/sockets.json
index 2e83452797..735eb4abb5 100644
--- a/qapi/sockets.json
+++ b/qapi/sockets.json
@@ -57,6 +57,8 @@
 # @keep-alive: enable keep-alive when connecting to this socket. Not supported
 #  for passive sockets. (Since 4.2)
 #
+# @mptcp: enable multi-path TCP. (Since 6.1)
+#
 # Since: 1.3
 ##
 { 'struct': 'InetSocketAddress',
@@ -66,7 +68,8 @@
 '*to': 'uint16',
 '*ipv4': 'bool',
 '*ipv6': 'bool',
-'*keep-alive': 'bool' } }
+'*keep-alive': 'bool',
+'*mptcp': { 'type': 'bool', 'if': 'defined(IPPROTO_MPTCP)' } } }
 
 ##
 # @UnixSocketAddress:
diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index 8af0278f15..ba7cb1ec4f 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -278,6 +278,11 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 
 /* create socket + bind/listen */
 for (e = res; e != NULL; e = e->ai_next) {
+#ifdef IPPROTO_MPTCP
+if (saddr->has_mptcp && saddr->mptcp) {
+e->ai_protocol = IPPROTO_MPTCP;
+}
+#endif
 getnameinfo((struct sockaddr*)e->ai_addr,e->ai_addrlen,
 uaddr,INET6_ADDRSTRLEN,uport,32,
 NI_NUMERICHOST | NI_NUMERICSERV);
@@ -456,6 +461,13 @@ int inet_connect_saddr(InetSocketAddress *saddr, Error 
**errp)
 for (e = res; e != NULL; e = e->ai_next) {
 error_free(local_err);
 local_err = NULL;
+
+#ifdef IPPROTO_MPTCP
+if (saddr->has_mptcp && saddr->mptcp) {
+e->ai_protocol = IPPROTO_MPTCP;
+}
+#endif
+
 sock = inet_connect_addr(saddr, e, &local_err);
 if (sock >= 0) {
 break;
@@ -687,6 +699,17 @@ int inet_parse(InetSocketAddress *addr, const char *str, 
Error **errp)
 }
 addr->has_keep_alive = true;
 }
+#ifdef IPPROTO_MPTCP
+begin = strstr(optstr, ",mptcp");
+if (begin) {
+if (inet_parse_flag("mptcp", begin + strlen(",mptcp"),
+&addr->mptcp, errp) < 0)
+{
+return -1;
+}
+addr->has_mptcp = true;
+}
+#endif
 return 0;
 }
 
-- 
2.31.1




[PATCH v2 2/5] io/net-listener: Call the notifier during finalize

2021-04-21 Thread Dr. David Alan Gilbert (git)
From: "Dr. David Alan Gilbert" 

Call the notifier during finalize; it's currently only called
if we change it, which is not the intent.

Signed-off-by: Dr. David Alan Gilbert 
Reviewed-by: Daniel P. Berrangé 
---
 io/net-listener.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/io/net-listener.c b/io/net-listener.c
index 46c2643d00..1c984d69c6 100644
--- a/io/net-listener.c
+++ b/io/net-listener.c
@@ -292,6 +292,9 @@ static void qio_net_listener_finalize(Object *obj)
 QIONetListener *listener = QIO_NET_LISTENER(obj);
 size_t i;
 
+if (listener->io_notify) {
+listener->io_notify(listener->io_data);
+}
 qio_net_listener_disconnect(listener);
 
 for (i = 0; i < listener->nsioc; i++) {
-- 
2.31.1




[PATCH v2 3/5] migration: Add cleanup hook for inwards migration

2021-04-21 Thread Dr. David Alan Gilbert (git)
From: "Dr. David Alan Gilbert" 

Add a cleanup hook for incoming migration that gets called
at the end as a way for a transport to allow cleanup.

Signed-off-by: Dr. David Alan Gilbert 
Reviewed-by: Daniel P. Berrangé 
---
 migration/migration.c | 3 +++
 migration/migration.h | 4 
 2 files changed, 7 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 8ca034136b..d48986fbbb 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -279,6 +279,9 @@ void migration_incoming_state_destroy(void)
 g_array_free(mis->postcopy_remote_fds, TRUE);
 mis->postcopy_remote_fds = NULL;
 }
+if (mis->transport_cleanup) {
+mis->transport_cleanup(mis->transport_data);
+}
 
 qemu_event_reset(&mis->main_thread_load_event);
 
diff --git a/migration/migration.h b/migration/migration.h
index db6708326b..1b4c5da917 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -49,6 +49,10 @@ struct PostcopyBlocktimeContext;
 struct MigrationIncomingState {
 QEMUFile *from_src_file;
 
+/* A hook to allow cleanup at the end of incoming migration */
+void *transport_data;
+void (*transport_cleanup)(void *data);
+
 /*
  * Free at the start of the main state load, set as the main thread 
finishes
  * loading state.
-- 
2.31.1




[Bug 1759522] Re: windows qemu-img create vpc/vhdx error

2021-04-21 Thread Albert Kao
** Changed in: qemu
   Status: Incomplete => New

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1759522

Title:
  windows qemu-img create vpc/vhdx error

Status in QEMU:
  New

Bug description:
  On windows, using qemu-img (version 2.11.90) to create vpc/vhdx
  virtual disk tends to fail. Here's the way to reproduce:

  1. Install qemu-w64-setup-20180321.exe

  2. Use `qemu-img create -f vhdx -o subformat=fixed disk.vhdx 512M` to create 
a vhdx:
 Formatting 'disk.vhdx', fmt=vhdx size=536870912 log_size=1048576 
block_size=0 subformat=fixed

  3. Execute `qemu-img info disk.vhdx` gives the result, (note the `disk size` 
is incorrect):
 image: disk.vhdx
 file format: vhdx
 virtual size: 512M (536870912 bytes)
 disk size: 1.4M
 cluster_size: 8388608

  4. On Windows 10 (V1709), double click disk.vhdx gives an error:
 Make sure the file is in an NTFS volume and isn't in a compressed folder 
or volume.

 Using Disk Management -> Action -> Attach VHD gives an error:
 The requested operation could not be completed due to a virtual disk 
system limitation. Virtual hard disk files must be uncompressed and uneccrypted 
and must not be sparse.

  Comparison with Windows 10 created VHDX:

  1. Using Disk Management -> Action -> Create VHD:
 File name: win.vhdx
 Virtual hard disk size: 512MB
 Virtual hard disk format: VHDX
 Virtual hard disk type: Fixed size

  2. Detach VHDX

  3. Execute `qemu-img info win.vhdx` gives the result:
 image: win.vhdx
 file format: vhdx
 virtual size: 512M (536870912 bytes)
 disk size: 516M
 cluster_size: 33554432

  Comparison with qemu-img under Ubuntu:

  1. Version: qemu-img version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.16),
  Copyright (c) 2004-2008 Fabrice Bellard

  2. qemu-img create -f vhdx -o subformat=fixed lin.vhdx 512M
 Formatting 'lin.vhdx', fmt=vhdx size=536870912 log_size=1048576 
block_size=0 subformat=fixed

  3. qemu-img info lin.vhdx
 image: lin.vhdx
 file format: vhdx
 virtual size: 512M (536870912 bytes)
 disk size: 520M
 cluster_size: 8388608

  4. Load lin.vhdx under Windows 10 is ok

  The same thing happens on `vpc` format with or without
  `oformat=fixed`, it seems that windows version of qemu-img has some
  incorrect operation? My guess is that windows version of qemu-img
  doesn't handle the description field of vpc/vhdx, which leads to an
  incorrect `disk size` field.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1759522/+subscriptions



Re: firmware selection for SEV-ES

2021-04-21 Thread Pavel Hrdina
On Wed, Apr 21, 2021 at 11:54:24AM +0200, Laszlo Ersek wrote:
> Hi Brijesh, Tom,
> 
> in QEMU's "docs/interop/firmware.json", the @FirmwareFeature enumeration
> has a constant called @amd-sev. We should introduce an @amd-sev-es
> constant as well, minimally for the following reason:
> 
> AMD document #56421 ("SEV-ES Guest-Hypervisor Communication Block
> Standardization") revision 1.40 says in "4.6 System Management Mode
> (SMM)" that "SMM will not be supported in this version of the
> specification". This is reflected in OVMF, so an OVMF binary that's
> supposed to run in a SEV-ES guest must be built without "-D
> SMM_REQUIRE". (As a consequence, such a binary should be built also
> without "-D SECURE_BOOT_ENABLE".)
> 
> At the level of "docs/interop/firmware.json", this means that management
> applications should be enabled to look for the @amd-sev-es feature (and
> it also means, for OS distributors, that any firmware descriptor
> exposing @amd-sev-es will currently have to lack all three of:
> @requires-smm, @secure-boot, @enrolled-keys).
> 
> I have three questions:
> 
> 
> (1) According to
> , SEV-ES is
> explicitly requested in the domain XML via setting bit#2 in the "policy"
> element.
> 
> Can this setting be used by libvirt to look for such a firmware
> descriptor that exposes @amd-sev-es?

Hi Laszlo and all,

Currently we use only  when selecting
firmware to make sure that it supports @amd-sev. Since we already have a
place in the VM XML where users can configure amd-sev-as we can use that
information when selecting correct firmware that should be used for the
VM.

Pavel

> (2) "docs/interop/firmware.json" documents @amd-sev as follows:
> 
> # @amd-sev: The firmware supports running under AMD Secure Encrypted
> #   Virtualization, as specified in the AMD64 Architecture
> #   Programmer's Manual. QEMU command line options related to
> #   this feature are documented in
> #   "docs/amd-memory-encryption.txt".
> 
> Documenting the new @amd-sev-es enum constant with very slight
> customizations for the same text should be possible, I reckon. However,
> "docs/amd-memory-encryption.txt" (nor
> "docs/confidential-guest-support.txt") seem to mention SEV-ES.
> 
> Can you guys propose a patch for "docs/amd-memory-encryption.txt"?
> 
> I guess that would be next to this snippet:
> 
> > # ${QEMU} \
> >sev-guest,id=sev0,policy=0x1...\
> 
> 
> (3) Is the "AMD64 Architecture Programmer's Manual" the specification
> that we should reference under @amd-sev-es as well (i.e., same as with
> @amd-sev), or is there a more specific document?
> 
> Thanks,
> Laszlo
> 


signature.asc
Description: PGP signature


Re: [RFC PATCH 0/3] block-copy: lock tasks and calls list

2021-04-21 Thread Paolo Bonzini

On 21/04/21 10:53, Vladimir Sementsov-Ogievskiy wrote:


Good point. Emanuele, can you work on ProgressMeter and 
SharedResource? AioTaskPool can also be converted to just use CoQueue 
instead of manually waking up coroutines.




That would be great.

I have one more question in mind:

Is it effective to use CoMutex here? We are protecting only some fast 
manipulations with data, not io path or something like that. Will simple 
QemuMutex work better? Even if CoMutex doesn't have any overhead, I 
don't think than if thread A wants to modify task list, but mutex is 
held by thread B (for similar thing), there is a reason for thread A to 
yield and do some other things: it can just wait several moments on 
mutex while B is modifying task list..


Indeed even CoQueue primitives count as simple manipulation of data, 
because they unlock/lock the mutex while the coroutine sleeps.  So 
you're right that it would be okay to use QemuMutex as well


The block copy code that Emanuele has touched so far is all coroutine 
based.  I like using CoMutex when that is the case, because it says 
implicitly "the monitor is not involved".  But we need to see what it 
will be like when the patches are complete.


Rate limiting ends up being called by the monitor, but it will have its 
own QemuMutex so it's fine.  What's left is cancellation and 
block_copy_kick; I think that we can make qemu_co_sleep thread-safe with 
an API similar to Linux's prepare_to_wait, so a QemuMutex wouldn't be 
needed there either.


Paolo




Re: [PATCH v2 8/8] block: do not take AioContext around reopen

2021-04-21 Thread Paolo Bonzini

On 19/04/21 10:55, Emanuele Giuseppe Esposito wrote:

Reopen needs to handle AioContext carefully due to calling
bdrv_drain_all_begin/end.  By not taking AioContext around calls to
bdrv_reopen_multiple, we can drop the function's release/acquire
pair and the AioContext argument too.


So... I wrote this commit message and I cannot parse it anymore---much 
less relate it to the code in the patch.  This is a problem, but it 
doesn't mean that the patch is wrong.


bdrv_reopen_multiple does not have the AioContext argument anymore. 
It's not doing release/acquire either.  The relevant commit is commit 
1a63a90750 ("block: Keep nodes drained between reopen_queue/multiple", 
2017-12-22).  You're basically cleaning up after that code in the same 
way as patch 7: reopen functions take care of keeping the BDS quiescent, 
so there's nothing to synchronize on.


For the future, the important step you missed was to check your diff 
against the one that you cherry-picked from.  Then you would have 
noticed that 1) it's much smaller 2) one thing that is mentioned in the 
commit message ("drop the function's release/acquire pair and argument") 
is not needed anymore.


Paolo


Signed-off-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
---
  block/block-backend.c |  4 
  block/mirror.c|  9 -
  blockdev.c| 19 ++-
  3 files changed, 6 insertions(+), 26 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 413af51f3b..6fdc698e9e 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2291,20 +2291,16 @@ int blk_commit_all(void)
  BlockBackend *blk = NULL;
  
  while ((blk = blk_all_next(blk)) != NULL) {

-AioContext *aio_context = blk_get_aio_context(blk);
  BlockDriverState *unfiltered_bs = bdrv_skip_filters(blk_bs(blk));
  
-aio_context_acquire(aio_context);

  if (blk_is_inserted(blk) && bdrv_cow_child(unfiltered_bs)) {
  int ret;
  
  ret = bdrv_commit(unfiltered_bs);

  if (ret < 0) {
-aio_context_release(aio_context);
  return ret;
  }
  }
-aio_context_release(aio_context);
  }
  return 0;
  }
diff --git a/block/mirror.c b/block/mirror.c
index 5a71bd8bbc..43174bbc6b 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -631,7 +631,6 @@ static int mirror_exit_common(Job *job)
  MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
  BlockJob *bjob = &s->common;
  MirrorBDSOpaque *bs_opaque;
-AioContext *replace_aio_context = NULL;
  BlockDriverState *src;
  BlockDriverState *target_bs;
  BlockDriverState *mirror_top_bs;
@@ -699,11 +698,6 @@ static int mirror_exit_common(Job *job)
  }
  }
  
-if (s->to_replace) {

-replace_aio_context = bdrv_get_aio_context(s->to_replace);
-aio_context_acquire(replace_aio_context);
-}
-
  if (s->should_complete && !abort) {
  BlockDriverState *to_replace = s->to_replace ?: src;
  bool ro = bdrv_is_read_only(to_replace);
@@ -740,9 +734,6 @@ static int mirror_exit_common(Job *job)
  error_free(s->replace_blocker);
  bdrv_unref(s->to_replace);
  }
-if (replace_aio_context) {
-aio_context_release(replace_aio_context);
-}
  g_free(s->replaces);
  bdrv_unref(target_bs);
  
diff --git a/blockdev.c b/blockdev.c

index e901107344..1672ef756e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3469,7 +3469,6 @@ void qmp_change_backing_file(const char *device,
   Error **errp)
  {
  BlockDriverState *bs = NULL;
-AioContext *aio_context;
  BlockDriverState *image_bs = NULL;
  Error *local_err = NULL;
  bool ro;
@@ -3480,37 +3479,34 @@ void qmp_change_backing_file(const char *device,
  return;
  }
  
-aio_context = bdrv_get_aio_context(bs);

-aio_context_acquire(aio_context);
-
  image_bs = bdrv_lookup_bs(NULL, image_node_name, &local_err);
  if (local_err) {
  error_propagate(errp, local_err);
-goto out;
+return;
  }
  
  if (!image_bs) {

  error_setg(errp, "image file not found");
-goto out;
+return;
  }
  
  if (bdrv_find_base(image_bs) == image_bs) {

  error_setg(errp, "not allowing backing file change on an image "
   "without a backing file");
-goto out;
+return;
  }
  
  /* even though we are not necessarily operating on bs, we need it to

   * determine if block ops are currently prohibited on the chain */
  if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_CHANGE, errp)) {
-goto out;
+return;
  }
  
  /* final sanity check */

  if (!bdrv_chain_contains(bs, image_bs)) {
  error_setg(errp, "'%s' and image file are not in the same chain",
 device);
-goto out;
+

Re: [PATCH v2 0/8] Block layer thread-safety, continued

2021-04-21 Thread Paolo Bonzini

On 19/04/21 10:55, Emanuele Giuseppe Esposito wrote:

This and the following serie of patches are based on Paolo's
v1 patches sent in 2017[*]. They have been ported to the current QEMU
version, but the goal remains the same:
- make the block layer thread-safe (patches 1-5), and
- remove aio_context_acquire/release (patches 6-8).

[*] = https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg01398.html

Signed-off-by: Emanuele Giuseppe Esposito 


This looks good to me, though the commit message of patch 8 needs to be 
rewritten.


Paolo


---
v1 (2017) -> v2 (2021):
- v1 Patch "block-backup: add reqs_lock" has been dropped, because now
   is completely different from the old version and all functions
   that were affected by it have been moved or deleted.
   It will be replaced by another serie that aims to thread safety to
   block/block-copy.c
- remaining v1 patches will be integrated in next serie.
- Patch "block: do not acquire AioContext in check_to_replace_node"
   moves part of the logic of check_to_replace_node to the caller,
   so that the function can be included in the aio_context_acquire/release
   block that follows.

Emanuele Giuseppe Esposito (8):
   block: prepare write threshold code for thread safety
   block: protect write threshold QMP commands from concurrent requests
   util: use RCU accessors for notifiers
   block: make before-write notifiers thread-safe
   block: add a few more notes on locking
   block: do not acquire AioContext in check_to_replace_node
   block/replication: do not acquire AioContext
   block: do not take AioContext around reopen

  block.c   | 28 ++--
  block/block-backend.c |  4 ---
  block/io.c| 12 +
  block/mirror.c|  9 ---
  block/replication.c   | 54 +--
  block/write-threshold.c   | 39 ++--
  blockdev.c| 26 +--
  include/block/block.h |  1 +
  include/block/block_int.h | 42 +-
  util/notify.c | 13 +-
  10 files changed, 113 insertions(+), 115 deletions(-)






[PATCH v6 00/15] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property

2021-04-21 Thread David Hildenbrand
Based-on: 20210406080126.24010-1-da...@redhat.com

Some cleanups previously sent in other context (resizeable allocations),
followed by RAM_NORESERVE, implementing it under Linux using MAP_NORESERVE,
and letting users configure it for memory backens using the "reserve"
property (default: true).

MAP_NORESERVE under Linux has in the context of QEMU an effect on
1) Private/shared anonymous memory
-> memory-backend-ram,id=mem0,size=10G
2) Private fd-based mappings
-> memory-backend-file,id=mem0,size=10G,mem-path=/dev/shm/0
-> memory-backend-memfd,id=mem0,size=10G
3) Private/shared hugetlb mappings
-> memory-backend-memfd,id=mem0,size=10G,hugetlb=on,hugetlbsize=2M

With MAP_NORESERVE/"reserve=off", we won't be reserving swap space (1/2) or
huge pages (3) for the whole memory region.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM. MAP_NORESERVE tells the OS
"this mapping might be very sparse". This essentially allows
avoiding having to set "/proc/sys/vm/overcommit_memory == 1") when using
virtio-mem and also supporting hugetlbfs in the future.

v5 -> v6:
- "softmmu/memory: Pass ram_flags to memory_region_init ..."
-- Split up into two patches
---> "softmmu/memory: Pass ram_flags to memory_region.."
---> "softmmu/memory: Pass ram_flags to qemu_ram_alloc() ..."
-- Also set RAM_PREALLOC from qemu_ram_alloc_from_ptr()
- Collected acks/rbs

v4 -> v5:
- Sent out shared anonymous RAM fixes separately
- Rebased
- "hostmem: Wire up RAM_NORESERVE via "reserve" property"
-- Adjusted/simplified description of new "reserve" property
-- Properly add it to qapi/qom.json
- "qmp: Clarify memory backend properties returned via query-memdev"
-- Added
- "qmp: Include "share" property of memory backends"
-- Added
- "hmp: Print "share" property of memory backends with "info memdev""
- Added
- "qmp: Include "reserve" property of memory backends"
-- Adjust description of new "reserve" property

v3 -> v4:
- Minor comment/description updates
- "softmmu/physmem: Fix ram_block_discard_range() to handle shared ..."
-- Extended description
- "util/mmap-alloc: Pass flags instead of separate bools to ..."
-- Move flags to include/qemu/osdep.h and rename to "QEMU_MAP_*"
- "memory: Introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()"
-- Adjust to new flags. Handle errors in mmap_activate() for now.
- "util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux"
-- Restrict support to Linux only for now
- "qmp: Include "reserve" property of memory backends"
-- Added
- "hmp: Print "reserve" property of memory backends with ..."
-- Added

v2 -> v3:
- Renamed "softmmu/physmem: Drop "shared" parameter from ram_block_add()"
  to "softmmu/physmem: Mark shared anonymous memory RAM_SHARED" and
  adjusted the description
- Added "softmmu/physmem: Fix ram_block_discard_range() to handle shared
  anonymous memory"
- Added "softmmu/physmem: Fix qemu_ram_remap() to handle shared anonymous
  memory"
- Added "util/mmap-alloc: Pass flags instead of separate bools to
  qemu_ram_mmap()"
- "util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE"
-- Further tweak code comments
-- Handle shared anonymous memory

v1 -> v2:
- Rebased to upstream and phs_mem_alloc simplifications
-- Upsteam added the "map_offset" parameter to many RAM allocation
   interfaces.
- "softmmu/physmem: Drop "shared" parameter from ram_block_add()"
-- Use local variable "shared"
- "memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()"
-- Simplify due to phs_mem_alloc changes
- "util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE"
-- Add a whole bunch of comments.
-- Exclude shared anonymous memory that QEMU doesn't use
-- Special-case readonly mappings

Cc: Peter Xu 
Cc: "Michael S. Tsirkin" 
Cc: Eduardo Habkost 
Cc: "Dr. David Alan Gilbert" 
Cc: Richard Henderson 
Cc: Paolo Bonzini 
Cc: Igor Mammedov 
Cc: "Philippe Mathieu-Daudé" 
Cc: Stefan Hajnoczi 
Cc: Murilo Opsfelder Araujo 
Cc: Greg Kurz 
Cc: Liam Merwick 
Cc: Marcel Apfelbaum 

David Hildenbrand (15):
  util/mmap-alloc: Factor out calculation of the pagesize for the guard
page
  util/mmap-alloc: Factor out reserving of a memory region to
mmap_reserve()
  util/mmap-alloc: Factor out activating of memory to mmap_activate()
  softmmu/memory: Pass ram_flags to qemu_ram_alloc_from_fd()
  softmmu/memory: Pass ram_flags to
memory_region_init_ram_shared_nomigrate()
  softmmu/memory: Pass ram_flags to qemu_ram_alloc() and
qemu_ram_alloc_internal()
  util/mmap-alloc: Pass flags instead of separate bools to
qemu_ram_mmap()
  memory: Introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux
  hostmem: Wire up RAM_NORESERVE via "reserve" property
  qmp: Clarify memory backend properties returned via query-memdev
  qmp: Include "share" property of memory backends
  hmp: Print "share" property of memory backends with "info memdev"
  qmp: Include "reserve" prop

Re: [RFC PATCH] vfio-ccw: Permit missing IRQs

2021-04-21 Thread Eric Farman
On Wed, 2021-04-21 at 12:01 +0200, Cornelia Huck wrote:
> On Mon, 19 Apr 2021 20:49:06 +0200
> Eric Farman  wrote:
> 
> > Commit 690e29b91102 ("vfio-ccw: Refactor ccw irq handler") changed
> > one of the checks for the IRQ notifier registration from saying
> > "the host needs to recognize the only IRQ that exists" to saying
> > "the host needs to recognize ANY IRQ that exists."
> > 
> > And this worked fine, because the subsequent change to support the
> > CRW IRQ notifier doesn't get into this code when running on an
> > older
> > kernel, thanks to a guard by a capability region. The later
> > addition
> > of the REQ(uest) IRQ by commit b2f96f9e4f5f ("vfio-ccw: Connect the
> > device request notifier") broke this assumption because there is no
> > matching capability region. Thus, running new QEMU on an older
> > kernel fails with:
> > 
> >   vfio: unexpected number of irqs 2
> > 
> > Let's simply remove the check (and the less-than-helpful message),
> > and make the VFIO_DEVICE_GET_IRQ_INFO ioctl request for the IRQ
> > being processed. If it returns with EINVAL, we can treat it as
> > an unfortunate mismatch but not a fatal error for the guest.
> > 
> > Fixes: 690e29b91102 ("vfio-ccw: Refactor ccw irq handler")
> > Fixes: b2f96f9e4f5f ("vfio-ccw: Connect the device request
> > notifier")
> > Signed-off-by: Eric Farman 
> > ---
> >  hw/vfio/ccw.c | 15 +++
> >  1 file changed, 7 insertions(+), 8 deletions(-)
> > 
> > diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
> > index b2df708e4b..cfbfc3d1a2 100644
> > --- a/hw/vfio/ccw.c
> > +++ b/hw/vfio/ccw.c
> > @@ -411,20 +411,19 @@ static void
> > vfio_ccw_register_irq_notifier(VFIOCCWDevice *vcdev,
> >  return;
> >  }
> >  
> > -if (vdev->num_irqs < irq + 1) {
> > -error_setg(errp, "vfio: unexpected number of irqs %u",
> > -   vdev->num_irqs);
> 
> Alternative proposal: Change this message to
> 
> "vfio: IRQ %u not available (number of irqs %u)"

> and still fail this function, while treating a failure of
> vfio_ccw_register_irq_notifier(vcdev, VFIO_CCW_REQ_IRQ_INDEX, &err);
> in
> vfio_ccw_realize() as a non-fatal error (maybe log a message).

This all sounds fine to me. I'll send a v2 as such.

> 
> This allows us to skip doing an ioctl call, of which we already know
> that it would fail. 

True, though as this is at the configuration time it's not as critical.

> Still, we can catch cases where a broken kernel e.g.
> provides the crw region, but not the matching irq (I believe
> something
> like that should indeed be a fatal error.)

Well they shouldn't do THAT. :)

> 
> > -return;
> > -}
> > -
> >  argsz = sizeof(*irq_info);
> >  irq_info = g_malloc0(argsz);
> >  irq_info->index = irq;
> >  irq_info->argsz = argsz;
> >  if (ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO,
> >irq_info) < 0 || irq_info->count < 1) {
> > -error_setg_errno(errp, errno, "vfio: Error getting irq
> > info");
> > -goto out_free_info;
> > +if (errno == EINVAL) {
> > +warn_report("Unable to get information about IRQ %u",
> > irq);
> > +goto out_free_info;
> > +} else {
> > +error_setg_errno(errp, errno, "vfio: Error getting irq
> > info");
> > +goto out_free_info;
> > +}
> >  }
> >  
> >  if (event_notifier_init(notifier, 0)) {




[PATCH v6 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux

2021-04-21 Thread David Hildenbrand
Let's support RAM_NORESERVE via MAP_NORESERVE on Linux. The flag has no
effect on most shared mappings - except for hugetlbfs and anonymous memory.

Linux man page:
  "MAP_NORESERVE: Do not reserve swap space for this mapping. When swap
  space is reserved, one has the guarantee that it is possible to modify
  the mapping. When swap space is not reserved one might get SIGSEGV
  upon a write if no physical memory is available. See also the discussion
  of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before
  2.6, this flag had effect only for private writable mappings."

Note that the "guarantee" part is wrong with memory overcommit in Linux.

Also, in Linux hugetlbfs is treated differently - we configure reservation
of huge pages from the pool, not reservation of swap space (huge pages
cannot be swapped).

The rough behavior is [1]:
a) !Hugetlbfs:

  1) Without MAP_NORESERVE *or* with memory overcommit under Linux
 disabled ("/proc/sys/vm/overcommit_memory == 2"), the following
 accounting/reservation happens:
  For a file backed map
   SHARED or READ-only - 0 cost (the file is the map not swap)
   PRIVATE WRITABLE - size of mapping per instance

  For an anonymous or /dev/zero map
   SHARED   - size of mapping
   PRIVATE READ-only - 0 cost (but of little use)
   PRIVATE WRITABLE - size of mapping per instance

  2) With MAP_NORESERVE, no accounting/reservation happens.

b) Hugetlbfs:

  1) Without MAP_NORESERVE, huge pages are reserved.

  2) With MAP_NORESERVE, no huge pages are reserved.

Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able
to configure it for !hugetlbfs globally; this toggle now allows
configuring it more fine-grained, not for the whole system.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM.

[1] https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 include/qemu/osdep.h |  3 ++
 softmmu/physmem.c|  1 +
 util/mmap-alloc.c| 69 ++--
 3 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index af65b36698..0a7384d15c 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -195,6 +195,9 @@ extern "C" {
 #ifndef MAP_FIXED_NOREPLACE
 #define MAP_FIXED_NOREPLACE 0
 #endif
+#ifndef MAP_NORESERVE
+#define MAP_NORESERVE 0
+#endif
 #ifndef ENOMEDIUM
 #define ENOMEDIUM ENODEV
 #endif
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 1efb1d5193..ccc5985324 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2230,6 +2230,7 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
 flags = MAP_FIXED;
 flags |= block->flags & RAM_SHARED ?
  MAP_SHARED : MAP_PRIVATE;
+flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
 if (block->fd >= 0) {
 area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
 flags, block->fd, offset);
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index d0cf4aaee5..838e286ce5 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -20,6 +20,7 @@
 #include "qemu/osdep.h"
 #include "qemu/mmap-alloc.h"
 #include "qemu/host-utils.h"
+#include "qemu/cutils.h"
 #include "qemu/error-report.h"
 
 #define HUGETLBFS_MAGIC   0x958458f6
@@ -83,6 +84,70 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
 return qemu_real_host_page_size;
 }
 
+#define OVERCOMMIT_MEMORY_PATH "/proc/sys/vm/overcommit_memory"
+static bool map_noreserve_effective(int fd, uint32_t qemu_map_flags)
+{
+#if defined(__linux__)
+const bool readonly = qemu_map_flags & QEMU_MAP_READONLY;
+const bool shared = qemu_map_flags & QEMU_MAP_SHARED;
+gchar *content = NULL;
+const char *endptr;
+unsigned int tmp;
+
+/*
+ * hugeltb accounting is different than ordinary swap reservation:
+ * a) Hugetlb pages from the pool are reserved for both private and
+ *shared mappings. For shared mappings, all mappers have to specify
+ *MAP_NORESERVE.
+ * b) MAP_NORESERVE is not affected by /proc/sys/vm/overcommit_memory.
+ */
+if (qemu_fd_getpagesize(fd) != qemu_real_host_page_size) {
+return true;
+}
+
+/*
+ * Accountable mappings in the kernel that can be affected by MAP_NORESEVE
+ * are private writable mappings (see mm/mmap.c:accountable_mapping() in
+ * Linux). For all shared or readonly mappings, MAP_NORESERVE is always
+ * implicitly active -- no reservation; this includes shmem. The only
+ * exception is shared anonymous memory, it is accounted like private
+ * anonymous memory.
+ */
+if (readonly || (shared && fd >= 0)) {
+return true;
+}
+
+/*
+ * MAP_NORESERVE is globally ignored for applicable !hugetlb mappings whe

[PATCH v6 01/15] util/mmap-alloc: Factor out calculation of the pagesize for the guard page

2021-04-21 Thread David Hildenbrand
Let's factor out calculating the size of the guard page and rename the
variable to make it clearer that this pagesize only applies to the
guard page.

Reviewed-by: Peter Xu 
Acked-by: Murilo Opsfelder Araujo 
Cc: Igor Kotrasinski 
Signed-off-by: David Hildenbrand 
---
 util/mmap-alloc.c | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index e6fa8b598b..24854064b4 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -82,6 +82,16 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
 return qemu_real_host_page_size;
 }
 
+static inline size_t mmap_guard_pagesize(int fd)
+{
+#if defined(__powerpc64__) && defined(__linux__)
+/* Mappings in the same segment must share the same page size */
+return qemu_fd_getpagesize(fd);
+#else
+return qemu_real_host_page_size;
+#endif
+}
+
 void *qemu_ram_mmap(int fd,
 size_t size,
 size_t align,
@@ -90,12 +100,12 @@ void *qemu_ram_mmap(int fd,
 bool is_pmem,
 off_t map_offset)
 {
+const size_t guard_pagesize = mmap_guard_pagesize(fd);
 int prot;
 int flags;
 int map_sync_flags = 0;
 int guardfd;
 size_t offset;
-size_t pagesize;
 size_t total;
 void *guardptr;
 void *ptr;
@@ -116,8 +126,7 @@ void *qemu_ram_mmap(int fd,
  * anonymous memory is OK.
  */
 flags = MAP_PRIVATE;
-pagesize = qemu_fd_getpagesize(fd);
-if (fd == -1 || pagesize == qemu_real_host_page_size) {
+if (fd == -1 || guard_pagesize == qemu_real_host_page_size) {
 guardfd = -1;
 flags |= MAP_ANONYMOUS;
 } else {
@@ -126,7 +135,6 @@ void *qemu_ram_mmap(int fd,
 }
 #else
 guardfd = -1;
-pagesize = qemu_real_host_page_size;
 flags = MAP_PRIVATE | MAP_ANONYMOUS;
 #endif
 
@@ -138,7 +146,7 @@ void *qemu_ram_mmap(int fd,
 
 assert(is_power_of_2(align));
 /* Always align to host page size */
-assert(align >= pagesize);
+assert(align >= guard_pagesize);
 
 flags = MAP_FIXED;
 flags |= fd == -1 ? MAP_ANONYMOUS : 0;
@@ -193,8 +201,8 @@ void *qemu_ram_mmap(int fd,
  * a guard page guarding against potential buffer overflows.
  */
 total -= offset;
-if (total > size + pagesize) {
-munmap(ptr + size + pagesize, total - size - pagesize);
+if (total > size + guard_pagesize) {
+munmap(ptr + size + guard_pagesize, total - size - guard_pagesize);
 }
 
 return ptr;
@@ -202,15 +210,8 @@ void *qemu_ram_mmap(int fd,
 
 void qemu_ram_munmap(int fd, void *ptr, size_t size)
 {
-size_t pagesize;
-
 if (ptr) {
 /* Unmap both the RAM block and the guard page */
-#if defined(__powerpc64__) && defined(__linux__)
-pagesize = qemu_fd_getpagesize(fd);
-#else
-pagesize = qemu_real_host_page_size;
-#endif
-munmap(ptr, size + pagesize);
+munmap(ptr, size + mmap_guard_pagesize(fd));
 }
 }
-- 
2.30.2




[PATCH v6 11/15] qmp: Clarify memory backend properties returned via query-memdev

2021-04-21 Thread David Hildenbrand
We return information on the currently configured memory backends and
don't configure them, so decribe what the currently set properties
express.

Reviewed-by: Philippe Mathieu-Daudé 
Suggested-by: Markus Armbruster 
Cc: Eric Blake 
Cc: Markus Armbruster 
Cc: Igor Mammedov 
Signed-off-by: David Hildenbrand 
---
 qapi/machine.json | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/qapi/machine.json b/qapi/machine.json
index 6e90d463fc..758b901185 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -790,11 +790,11 @@
 #
 # @size: memory backend size
 #
-# @merge: enables or disables memory merge support
+# @merge: whether memory merge support is enabled
 #
-# @dump: includes memory backend's memory in a core dump or not
+# @dump: whether memory backend's memory is included in a core dump
 #
-# @prealloc: enables or disables memory preallocation
+# @prealloc: whether memory was preallocated
 #
 # @host-nodes: host nodes for its memory policy
 #
-- 
2.30.2




[PATCH v6 05/15] softmmu/memory: Pass ram_flags to memory_region_init_ram_shared_nomigrate()

2021-04-21 Thread David Hildenbrand
Let's forward ram_flags instead, renaming
memory_region_init_ram_shared_nomigrate() into
memory_region_init_ram_flags_nomigrate().

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 backends/hostmem-ram.c|  6 +++--
 hw/m68k/next-cube.c   |  4 ++--
 include/exec/memory.h | 24 +--
 .../memory-region-housekeeping.cocci  |  8 +++
 softmmu/memory.c  | 18 +++---
 5 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
index 5cc53e76c9..741e701062 100644
--- a/backends/hostmem-ram.c
+++ b/backends/hostmem-ram.c
@@ -19,6 +19,7 @@
 static void
 ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
+uint32_t ram_flags;
 char *name;
 
 if (!backend->size) {
@@ -27,8 +28,9 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 }
 
 name = host_memory_backend_get_name(backend);
-memory_region_init_ram_shared_nomigrate(&backend->mr, OBJECT(backend), 
name,
-   backend->size, backend->share, errp);
+ram_flags = backend->share ? RAM_SHARED : 0;
+memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
+   backend->size, ram_flags, errp);
 g_free(name);
 }
 
diff --git a/hw/m68k/next-cube.c b/hw/m68k/next-cube.c
index 92b45d760f..59ccae0d5e 100644
--- a/hw/m68k/next-cube.c
+++ b/hw/m68k/next-cube.c
@@ -986,8 +986,8 @@ static void next_cube_init(MachineState *machine)
 sysbus_mmio_map(SYS_BUS_DEVICE(pcdev), 1, 0x0210);
 
 /* BMAP memory */
-memory_region_init_ram_shared_nomigrate(bmapm1, NULL, "next.bmapmem", 64,
-true, &error_fatal);
+memory_region_init_ram_flags_nomigrate(bmapm1, NULL, "next.bmapmem", 64,
+   RAM_SHARED, &error_fatal);
 memory_region_add_subregion(sysmem, 0x020c, bmapm1);
 /* The Rev_2.5_v66.bin firmware accesses it at 0x820c0020, too */
 memory_region_init_alias(bmapm2, NULL, "next.bmapmem2", bmapm1, 0x0, 64);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 8ad280e532..10179c6695 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -928,27 +928,27 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
   Error **errp);
 
 /**
- * memory_region_init_ram_shared_nomigrate:  Initialize RAM memory region.
- *   Accesses into the region will
- *   modify memory directly.
+ * memory_region_init_ram_flags_nomigrate:  Initialize RAM memory region.
+ *  Accesses into the region will
+ *  modify memory directly.
  *
  * @mr: the #MemoryRegion to be initialized.
  * @owner: the object that tracks the region's reference count
  * @name: Region name, becomes part of RAMBlock name used in migration stream
  *must be unique within any device
  * @size: size of the region.
- * @share: allow remapping RAM to different addresses
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED.
  * @errp: pointer to Error*, to store an error if it happens.
  *
- * Note that this function is similar to memory_region_init_ram_nomigrate.
- * The only difference is part of the RAM region can be remapped.
+ * Note that this function does not do anything to cause the data in the
+ * RAM memory region to be migrated; that is the responsibility of the caller.
  */
-void memory_region_init_ram_shared_nomigrate(MemoryRegion *mr,
- Object *owner,
- const char *name,
- uint64_t size,
- bool share,
- Error **errp);
+void memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
+Object *owner,
+const char *name,
+uint64_t size,
+uint32_t ram_flags,
+Error **errp);
 
 /**
  * memory_region_init_resizeable_ram:  Initialize memory region with resizeable
diff --git a/scripts/coccinelle/memory-region-housekeeping.cocci 
b/scripts/coccinelle/memory-region-housekeeping.cocci
index c768d8140a..29651ebde9 100644
--- a/scripts/coccinelle/memory-region-housekeeping.cocci
+++ b/scripts/coccinelle/memory-region-housekeeping.cocci
@@ -127,8 +127,8 @@ static void device_fn(DeviceState *dev, ...)
 - memory_region_init_rom(E1, NULL, E2, E3, E4);
 + memory_region_init_rom(E1, obj, E2, E3, E4);
 |
-- memory_region_init_ram_share

[PATCH v6 10/15] hostmem: Wire up RAM_NORESERVE via "reserve" property

2021-04-21 Thread David Hildenbrand
Let's provide a way to control the use of RAM_NORESERVE via memory
backends using the "reserve" property which defaults to true (old
behavior).

Only Linux currently supports clearing the flag (and support is checked at
runtime, depending on the setting of "/proc/sys/vm/overcommit_memory").
Windows and other POSIX systems will bail out with "reserve=false".

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM. This essentially allows
avoiding to set "/proc/sys/vm/overcommit_memory == 0") when using
virtio-mem and also supporting hugetlbfs in the future.

Reviewed-by: Peter Xu 
Reviewed-by: Eduardo Habkost 
Cc: Markus Armbruster 
Cc: Eric Blake 
Cc: Igor Mammedov 
Signed-off-by: David Hildenbrand 
---
 backends/hostmem-file.c  | 11 ++-
 backends/hostmem-memfd.c |  1 +
 backends/hostmem-ram.c   |  1 +
 backends/hostmem.c   | 32 
 include/sysemu/hostmem.h |  2 +-
 qapi/qom.json|  4 
 6 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index b683da9daf..9d550e53d4 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -40,6 +40,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
object_get_typename(OBJECT(backend)));
 #else
 HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend);
+uint32_t ram_flags;
 gchar *name;
 
 if (!backend->size) {
@@ -52,11 +53,11 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 }
 
 name = host_memory_backend_get_name(backend);
-memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
- name,
- backend->size, fb->align,
- (backend->share ? RAM_SHARED : 0) |
- (fb->is_pmem ? RAM_PMEM : 0),
+ram_flags = backend->share ? RAM_SHARED : 0;
+ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
+ram_flags |= fb->is_pmem ? RAM_PMEM : 0;
+memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), name,
+ backend->size, fb->align, ram_flags,
  fb->mem_path, fb->readonly, errp);
 g_free(name);
 #endif
diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 93b5d1a4cf..f3436b623d 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -55,6 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 
 name = host_memory_backend_get_name(backend);
 ram_flags = backend->share ? RAM_SHARED : 0;
+ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
 memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
backend->size, ram_flags, fd, 0, errp);
 g_free(name);
diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
index 741e701062..b8e55cdbd0 100644
--- a/backends/hostmem-ram.c
+++ b/backends/hostmem-ram.c
@@ -29,6 +29,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 
 name = host_memory_backend_get_name(backend);
 ram_flags = backend->share ? RAM_SHARED : 0;
+ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
 memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
backend->size, ram_flags, errp);
 g_free(name);
diff --git a/backends/hostmem.c b/backends/hostmem.c
index c6c1ff5b99..58fdc1b658 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -217,6 +217,11 @@ static void host_memory_backend_set_prealloc(Object *obj, 
bool value,
 Error *local_err = NULL;
 HostMemoryBackend *backend = MEMORY_BACKEND(obj);
 
+if (!backend->reserve && value) {
+error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
+return;
+}
+
 if (!host_memory_backend_mr_inited(backend)) {
 backend->prealloc = value;
 return;
@@ -268,6 +273,7 @@ static void host_memory_backend_init(Object *obj)
 /* TODO: convert access to globals to compat properties */
 backend->merge = machine_mem_merge(machine);
 backend->dump = machine_dump_guest_core(machine);
+backend->reserve = true;
 backend->prealloc_threads = 1;
 }
 
@@ -426,6 +432,28 @@ static void host_memory_backend_set_share(Object *o, bool 
value, Error **errp)
 backend->share = value;
 }
 
+static bool host_memory_backend_get_reserve(Object *o, Error **errp)
+{
+HostMemoryBackend *backend = MEMORY_BACKEND(o);
+
+return backend->reserve;
+}
+
+static void host_memory_backend_set_reserve(Object *o, bool value, Error 
**errp)
+{
+HostMemoryBackend *backend = MEMORY_BACKEND(o);
+
+if (host_memory_backend_mr_inited(backend)) {
+error_setg(errp, "cannot change property value");
+return;
+}
+if (backen

[PATCH v6 02/15] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve()

2021-04-21 Thread David Hildenbrand
We want to reserve a memory region without actually populating memory.
Let's factor that out.

Reviewed-by: Igor Kotrasinski 
Acked-by: Murilo Opsfelder Araujo 
Reviewed-by: Richard Henderson 
Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 util/mmap-alloc.c | 58 +++
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 24854064b4..223d66219c 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -82,6 +82,38 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
 return qemu_real_host_page_size;
 }
 
+/*
+ * Reserve a new memory region of the requested size to be used for mapping
+ * from the given fd (if any).
+ */
+static void *mmap_reserve(size_t size, int fd)
+{
+int flags = MAP_PRIVATE;
+
+#if defined(__powerpc64__) && defined(__linux__)
+/*
+ * On ppc64 mappings in the same segment (aka slice) must share the same
+ * page size. Since we will be re-allocating part of this segment
+ * from the supplied fd, we should make sure to use the same page size, to
+ * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
+ * avoid allocating backing store memory.
+ * We do this unless we are using the system page size, in which case
+ * anonymous memory is OK.
+ */
+if (fd == -1 || qemu_fd_getpagesize(fd) == qemu_real_host_page_size) {
+fd = -1;
+flags |= MAP_ANONYMOUS;
+} else {
+flags |= MAP_NORESERVE;
+}
+#else
+fd = -1;
+flags |= MAP_ANONYMOUS;
+#endif
+
+return mmap(0, size, PROT_NONE, flags, fd, 0);
+}
+
 static inline size_t mmap_guard_pagesize(int fd)
 {
 #if defined(__powerpc64__) && defined(__linux__)
@@ -104,7 +136,6 @@ void *qemu_ram_mmap(int fd,
 int prot;
 int flags;
 int map_sync_flags = 0;
-int guardfd;
 size_t offset;
 size_t total;
 void *guardptr;
@@ -116,30 +147,7 @@ void *qemu_ram_mmap(int fd,
  */
 total = size + align;
 
-#if defined(__powerpc64__) && defined(__linux__)
-/* On ppc64 mappings in the same segment (aka slice) must share the same
- * page size. Since we will be re-allocating part of this segment
- * from the supplied fd, we should make sure to use the same page size, to
- * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
- * avoid allocating backing store memory.
- * We do this unless we are using the system page size, in which case
- * anonymous memory is OK.
- */
-flags = MAP_PRIVATE;
-if (fd == -1 || guard_pagesize == qemu_real_host_page_size) {
-guardfd = -1;
-flags |= MAP_ANONYMOUS;
-} else {
-guardfd = fd;
-flags |= MAP_NORESERVE;
-}
-#else
-guardfd = -1;
-flags = MAP_PRIVATE | MAP_ANONYMOUS;
-#endif
-
-guardptr = mmap(0, total, PROT_NONE, flags, guardfd, 0);
-
+guardptr = mmap_reserve(total, fd);
 if (guardptr == MAP_FAILED) {
 return MAP_FAILED;
 }
-- 
2.30.2




Re: [PATCH v6 05/15] softmmu/memory: Pass ram_flags to memory_region_init_ram_shared_nomigrate()

2021-04-21 Thread Philippe Mathieu-Daudé
On 4/21/21 2:26 PM, David Hildenbrand wrote:
> Let's forward ram_flags instead, renaming
> memory_region_init_ram_shared_nomigrate() into
> memory_region_init_ram_flags_nomigrate().
> 
> Reviewed-by: Peter Xu 
> Signed-off-by: David Hildenbrand 
> ---
>  backends/hostmem-ram.c|  6 +++--
>  hw/m68k/next-cube.c   |  4 ++--
>  include/exec/memory.h | 24 +--
>  .../memory-region-housekeeping.cocci  |  8 +++
>  softmmu/memory.c  | 18 +++---
>  5 files changed, 31 insertions(+), 29 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé 




[PATCH v6 12/15] qmp: Include "share" property of memory backends

2021-04-21 Thread David Hildenbrand
Let's include the property, which can be helpful when debugging,
for example, to spot misuse of MAP_PRIVATE which can result in some ugly
corner cases (e.g., double-memory consumption on shmem).

Use the same description we also use for describing the property.

Reviewed-by: Philippe Mathieu-Daudé 
Cc: Eric Blake 
Cc: Markus Armbruster 
Cc: Igor Mammedov 
Signed-off-by: David Hildenbrand 
---
 hw/core/machine-qmp-cmds.c | 1 +
 qapi/machine.json  | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/hw/core/machine-qmp-cmds.c b/hw/core/machine-qmp-cmds.c
index 68a942595a..d41db5b93b 100644
--- a/hw/core/machine-qmp-cmds.c
+++ b/hw/core/machine-qmp-cmds.c
@@ -174,6 +174,7 @@ static int query_memdev(Object *obj, void *opaque)
 m->merge = object_property_get_bool(obj, "merge", &error_abort);
 m->dump = object_property_get_bool(obj, "dump", &error_abort);
 m->prealloc = object_property_get_bool(obj, "prealloc", &error_abort);
+m->share = object_property_get_bool(obj, "share", &error_abort);
 m->policy = object_property_get_enum(obj, "policy", "HostMemPolicy",
  &error_abort);
 host_nodes = object_property_get_qobject(obj,
diff --git a/qapi/machine.json b/qapi/machine.json
index 758b901185..32650bfe9e 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -796,6 +796,8 @@
 #
 # @prealloc: whether memory was preallocated
 #
+# @share: whether memory is private to QEMU or shared (since 6.1)
+#
 # @host-nodes: host nodes for its memory policy
 #
 # @policy: memory policy of memory backend
@@ -809,6 +811,7 @@
 'merge':  'bool',
 'dump':   'bool',
 'prealloc':   'bool',
+'share':  'bool',
 'host-nodes': ['uint16'],
 'policy': 'HostMemPolicy' }}
 
-- 
2.30.2




[PATCH v6 08/15] memory: Introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()

2021-04-21 Thread David Hildenbrand
Let's introduce RAM_NORESERVE, allowing mmap'ing with MAP_NORESERVE. The
new flag has the following semantics:

"
RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or huge
pages if applicable) is skipped: will bail out if not supported. When not
set, the OS will do the reservation, if supported for the memory type.
"

Allow passing it into:
- memory_region_init_ram_nomigrate()
- memory_region_init_resizeable_ram()
- memory_region_init_ram_from_file()

... and teach qemu_ram_mmap() and qemu_anon_ram_alloc() about the flag.
Bail out if the flag is not supported, which is the case right now for
both, POSIX and win32. We will add Linux support next and allow specifying
RAM_NORESERVE via memory backends.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM.

Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 include/exec/cpu-common.h |  1 +
 include/exec/memory.h | 15 ---
 include/exec/ram_addr.h   |  3 ++-
 include/qemu/osdep.h  |  9 -
 migration/ram.c   |  3 +--
 softmmu/physmem.c | 15 ---
 util/mmap-alloc.c |  7 +++
 util/oslib-posix.c|  6 --
 util/oslib-win32.c| 13 -
 9 files changed, 59 insertions(+), 13 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 5a0a2d93e0..38a47ad4ac 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -58,6 +58,7 @@ void *qemu_ram_get_host_addr(RAMBlock *rb);
 ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
 ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
 bool qemu_ram_is_shared(RAMBlock *rb);
+bool qemu_ram_is_noreserve(RAMBlock *rb);
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
 void qemu_ram_set_uf_zeroable(RAMBlock *rb);
 bool qemu_ram_is_migratable(RAMBlock *rb);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 10179c6695..8d77819bcd 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -155,6 +155,13 @@ typedef struct IOMMUTLBEvent {
  */
 #define RAM_UF_WRITEPROTECT (1 << 6)
 
+/*
+ * RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or huge
+ * pages if applicable) is skipped: will bail out if not supported. When not
+ * set, the OS will do the reservation, if supported for the memory type.
+ */
+#define RAM_NORESERVE (1 << 7)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
IOMMUNotifierFlag flags,
hwaddr start, hwaddr end,
@@ -937,7 +944,7 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
  * @name: Region name, becomes part of RAMBlock name used in migration stream
  *must be unique within any device
  * @size: size of the region.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_NORESERVE.
  * @errp: pointer to Error*, to store an error if it happens.
  *
  * Note that this function does not do anything to cause the data in the
@@ -991,7 +998,8 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  * (getpagesize()) will be used.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ * RAM_NORESERVE,
  * @path: the path in which to allocate the RAM.
  * @readonly: true to open @path for reading, false for read/write.
  * @errp: pointer to Error*, to store an error if it happens.
@@ -1017,7 +1025,8 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
  * @owner: the object that tracks the region's reference count
  * @name: the name of the region.
  * @size: size of the region.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ * RAM_NORESERVE.
  * @fd: the fd to mmap.
  * @offset: offset within the file referenced by fd
  * @errp: pointer to Error*, to store an error if it happens.
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 6d4513f8e2..551876bed0 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -104,7 +104,8 @@ long qemu_maxrampagesize(void);
  * Parameters:
  *  @size: the size in bytes of the ram block
  *  @mr: the memory region where the ram block is
- *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *  RAM_NORESERVE.
  *  @mem_path or @fd: specify the backing file or device
  *  @readonly: true to open @path for reading, false for read/write.
  *  @errp: pointer to Error*, to store an error if it happens
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
i

[PATCH v6 04/15] softmmu/memory: Pass ram_flags to qemu_ram_alloc_from_fd()

2021-04-21 Thread David Hildenbrand
Let's pass in ram flags just like we do with qemu_ram_alloc_from_file(),
to clean up and prepare for more flags.

Simplify the documentation of passed ram flags: Looking at our
documentation of RAM_SHARED and RAM_PMEM is sufficient, no need to be
repetitive.

Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 backends/hostmem-memfd.c | 7 ---
 hw/misc/ivshmem.c| 5 ++---
 include/exec/memory.h| 9 +++--
 include/exec/ram_addr.h  | 6 +-
 softmmu/memory.c | 7 +++
 5 files changed, 13 insertions(+), 21 deletions(-)

diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 69b0ae30bb..93b5d1a4cf 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -36,6 +36,7 @@ static void
 memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
 HostMemoryBackendMemfd *m = MEMORY_BACKEND_MEMFD(backend);
+uint32_t ram_flags;
 char *name;
 int fd;
 
@@ -53,9 +54,9 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 }
 
 name = host_memory_backend_get_name(backend);
-memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
-   name, backend->size,
-   backend->share, fd, 0, errp);
+ram_flags = backend->share ? RAM_SHARED : 0;
+memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
+   backend->size, ram_flags, fd, 0, errp);
 g_free(name);
 }
 
diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index a1fa4878be..1ba4a98377 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -493,9 +493,8 @@ static void process_msg_shmem(IVShmemState *s, int fd, 
Error **errp)
 size = buf.st_size;
 
 /* mmap the region and map into the BAR2 */
-memory_region_init_ram_from_fd(&s->server_bar2, OBJECT(s),
-   "ivshmem.bar2", size, true, fd, 0,
-   &local_err);
+memory_region_init_ram_from_fd(&s->server_bar2, OBJECT(s), "ivshmem.bar2",
+   size, RAM_SHARED, fd, 0, &local_err);
 if (local_err) {
 error_propagate(errp, local_err);
 return;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 5728a681b2..8ad280e532 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -991,10 +991,7 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  * (getpagesize()) will be used.
- * @ram_flags: Memory region features:
- * - RAM_SHARED: memory must be mmaped with the MAP_SHARED flag
- * - RAM_PMEM: the memory is persistent memory
- * Other bits are ignored now.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
  * @path: the path in which to allocate the RAM.
  * @readonly: true to open @path for reading, false for read/write.
  * @errp: pointer to Error*, to store an error if it happens.
@@ -1020,7 +1017,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
  * @owner: the object that tracks the region's reference count
  * @name: the name of the region.
  * @size: size of the region.
- * @share: %true if memory must be mmaped with the MAP_SHARED flag
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
  * @fd: the fd to mmap.
  * @offset: offset within the file referenced by fd
  * @errp: pointer to Error*, to store an error if it happens.
@@ -1032,7 +1029,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
 Object *owner,
 const char *name,
 uint64_t size,
-bool share,
+uint32_t ram_flags,
 int fd,
 ram_addr_t offset,
 Error **errp);
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 3cb9791df3..a7e3378340 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -104,11 +104,7 @@ long qemu_maxrampagesize(void);
  * Parameters:
  *  @size: the size in bytes of the ram block
  *  @mr: the memory region where the ram block is
- *  @ram_flags: specify the properties of the ram block, which can be one
- *  or bit-or of following values
- *  - RAM_SHARED: mmap the backing file or device with MAP_SHARED
- *  - RAM_PMEM: the backend @mem_path or @fd is persistent memory
- *  Other bits are ignored.
+ *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
  *  @mem_path or @fd: specify the backing file or device
  *  @readonly: true to open @path for reading, false for read/write.
  *  @errp: pointer to Error*, to store an erro

[PATCH v6 13/15] hmp: Print "share" property of memory backends with "info memdev"

2021-04-21 Thread David Hildenbrand
Let's print the property.

Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Dr. David Alan Gilbert 
Cc: Markus Armbruster 
Cc: Eric Blake 
Cc: Igor Mammedov 
Signed-off-by: David Hildenbrand 
---
 hw/core/machine-hmp-cmds.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c
index 58248cffa3..004a92b3d6 100644
--- a/hw/core/machine-hmp-cmds.c
+++ b/hw/core/machine-hmp-cmds.c
@@ -110,6 +110,8 @@ void hmp_info_memdev(Monitor *mon, const QDict *qdict)
m->value->dump ? "true" : "false");
 monitor_printf(mon, "  prealloc: %s\n",
m->value->prealloc ? "true" : "false");
+monitor_printf(mon, "  share: %s\n",
+   m->value->share ? "true" : "false");
 monitor_printf(mon, "  policy: %s\n",
HostMemPolicy_str(m->value->policy));
 visit_complete(v, &str);
-- 
2.30.2




[PATCH v6 03/15] util/mmap-alloc: Factor out activating of memory to mmap_activate()

2021-04-21 Thread David Hildenbrand
We want to activate memory within a reserved memory region, to make it
accessible. Let's factor that out.

Reviewed-by: Richard Henderson 
Acked-by: Murilo Opsfelder Araujo 
Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 util/mmap-alloc.c | 94 +--
 1 file changed, 50 insertions(+), 44 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 223d66219c..0e2bd7bc0e 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -114,6 +114,52 @@ static void *mmap_reserve(size_t size, int fd)
 return mmap(0, size, PROT_NONE, flags, fd, 0);
 }
 
+/*
+ * Activate memory in a reserved region from the given fd (if any), to make
+ * it accessible.
+ */
+static void *mmap_activate(void *ptr, size_t size, int fd, bool readonly,
+   bool shared, bool is_pmem, off_t map_offset)
+{
+const int prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
+int map_sync_flags = 0;
+int flags = MAP_FIXED;
+void *activated_ptr;
+
+flags |= fd == -1 ? MAP_ANONYMOUS : 0;
+flags |= shared ? MAP_SHARED : MAP_PRIVATE;
+if (shared && is_pmem) {
+map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
+}
+
+activated_ptr = mmap(ptr, size, prot, flags | map_sync_flags, fd,
+ map_offset);
+if (activated_ptr == MAP_FAILED && map_sync_flags) {
+if (errno == ENOTSUP) {
+char *proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
+char *file_name = g_malloc0(PATH_MAX);
+int len = readlink(proc_link, file_name, PATH_MAX - 1);
+
+if (len < 0) {
+len = 0;
+}
+file_name[len] = '\0';
+fprintf(stderr, "Warning: requesting persistence across crashes "
+"for backend file %s failed. Proceeding without "
+"persistence, data might become corrupted in case of host "
+"crash.\n", file_name);
+g_free(proc_link);
+g_free(file_name);
+}
+/*
+ * If mmap failed with MAP_SHARED_VALIDATE | MAP_SYNC, we will try
+ * again without these flags to handle backwards compatibility.
+ */
+activated_ptr = mmap(ptr, size, prot, flags, fd, map_offset);
+}
+return activated_ptr;
+}
+
 static inline size_t mmap_guard_pagesize(int fd)
 {
 #if defined(__powerpc64__) && defined(__linux__)
@@ -133,13 +179,8 @@ void *qemu_ram_mmap(int fd,
 off_t map_offset)
 {
 const size_t guard_pagesize = mmap_guard_pagesize(fd);
-int prot;
-int flags;
-int map_sync_flags = 0;
-size_t offset;
-size_t total;
-void *guardptr;
-void *ptr;
+size_t offset, total;
+void *ptr, *guardptr;
 
 /*
  * Note: this always allocates at least one extra page of virtual address
@@ -156,45 +197,10 @@ void *qemu_ram_mmap(int fd,
 /* Always align to host page size */
 assert(align >= guard_pagesize);
 
-flags = MAP_FIXED;
-flags |= fd == -1 ? MAP_ANONYMOUS : 0;
-flags |= shared ? MAP_SHARED : MAP_PRIVATE;
-if (shared && is_pmem) {
-map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
-}
-
 offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
 
-prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
-
-ptr = mmap(guardptr + offset, size, prot,
-   flags | map_sync_flags, fd, map_offset);
-
-if (ptr == MAP_FAILED && map_sync_flags) {
-if (errno == ENOTSUP) {
-char *proc_link, *file_name;
-int len;
-proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
-file_name = g_malloc0(PATH_MAX);
-len = readlink(proc_link, file_name, PATH_MAX - 1);
-if (len < 0) {
-len = 0;
-}
-file_name[len] = '\0';
-fprintf(stderr, "Warning: requesting persistence across crashes "
-"for backend file %s failed. Proceeding without "
-"persistence, data might become corrupted in case of host "
-"crash.\n", file_name);
-g_free(proc_link);
-g_free(file_name);
-}
-/*
- * if map failed with MAP_SHARED_VALIDATE | MAP_SYNC,
- * we will remove these flags to handle compatibility.
- */
-ptr = mmap(guardptr + offset, size, prot, flags, fd, map_offset);
-}
-
+ptr = mmap_activate(guardptr + offset, size, fd, readonly, shared, is_pmem,
+map_offset);
 if (ptr == MAP_FAILED) {
 munmap(guardptr, total);
 return MAP_FAILED;
-- 
2.30.2




[PATCH v6 15/15] hmp: Print "reserve" property of memory backends with "info memdev"

2021-04-21 Thread David Hildenbrand
Let's print the new property.

Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Dr. David Alan Gilbert 
Cc: Markus Armbruster 
Cc: Eric Blake 
Cc: Igor Mammedov 
Signed-off-by: David Hildenbrand 
---
 hw/core/machine-hmp-cmds.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c
index 004a92b3d6..9bedc77bb4 100644
--- a/hw/core/machine-hmp-cmds.c
+++ b/hw/core/machine-hmp-cmds.c
@@ -112,6 +112,8 @@ void hmp_info_memdev(Monitor *mon, const QDict *qdict)
m->value->prealloc ? "true" : "false");
 monitor_printf(mon, "  share: %s\n",
m->value->share ? "true" : "false");
+monitor_printf(mon, "  reserve: %s\n",
+   m->value->reserve ? "true" : "false");
 monitor_printf(mon, "  policy: %s\n",
HostMemPolicy_str(m->value->policy));
 visit_complete(v, &str);
-- 
2.30.2




[PATCH v6 06/15] softmmu/memory: Pass ram_flags to qemu_ram_alloc() and qemu_ram_alloc_internal()

2021-04-21 Thread David Hildenbrand
Let's pass ram_flags to qemu_ram_alloc() and qemu_ram_alloc_internal(),
preparing for passing additional flags.

Signed-off-by: David Hildenbrand 
---
 include/exec/ram_addr.h |  2 +-
 softmmu/memory.c|  4 ++--
 softmmu/physmem.c   | 29 -
 3 files changed, 15 insertions(+), 20 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index a7e3378340..6d4513f8e2 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -122,7 +122,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, 
MemoryRegion *mr,
 
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
   MemoryRegion *mr, Error **errp);
-RAMBlock *qemu_ram_alloc(ram_addr_t size, bool share, MemoryRegion *mr,
+RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags, MemoryRegion *mr,
  Error **errp);
 RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t max_size,
 void (*resized)(const char*,
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 67be0aa152..580d2c06f5 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1548,7 +1548,7 @@ void memory_region_init_ram_flags_nomigrate(MemoryRegion 
*mr,
 mr->ram = true;
 mr->terminates = true;
 mr->destructor = memory_region_destructor_ram;
-mr->ram_block = qemu_ram_alloc(size, ram_flags & RAM_SHARED, mr, &err);
+mr->ram_block = qemu_ram_alloc(size, ram_flags, mr, &err);
 if (err) {
 mr->size = int128_zero();
 object_unparent(OBJECT(mr));
@@ -1704,7 +1704,7 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion 
*mr,
 mr->terminates = true;
 mr->rom_device = true;
 mr->destructor = memory_region_destructor_ram;
-mr->ram_block = qemu_ram_alloc(size, false,  mr, &err);
+mr->ram_block = qemu_ram_alloc(size, 0, mr, &err);
 if (err) {
 mr->size = int128_zero();
 object_unparent(OBJECT(mr));
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index cc59f05593..11b45be271 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2108,12 +2108,15 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, 
ram_addr_t max_size,
   void (*resized)(const char*,
   uint64_t length,
   void *host),
-  void *host, bool resizeable, bool share,
+  void *host, uint32_t ram_flags,
   MemoryRegion *mr, Error **errp)
 {
 RAMBlock *new_block;
 Error *local_err = NULL;
 
+assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_PREALLOC)) == 0);
+assert(!host ^ (ram_flags & RAM_PREALLOC));
+
 size = HOST_PAGE_ALIGN(size);
 max_size = HOST_PAGE_ALIGN(max_size);
 new_block = g_malloc0(sizeof(*new_block));
@@ -2125,15 +2128,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, 
ram_addr_t max_size,
 new_block->fd = -1;
 new_block->page_size = qemu_real_host_page_size;
 new_block->host = host;
-if (host) {
-new_block->flags |= RAM_PREALLOC;
-}
-if (share) {
-new_block->flags |= RAM_SHARED;
-}
-if (resizeable) {
-new_block->flags |= RAM_RESIZEABLE;
-}
+new_block->flags = ram_flags;
 ram_block_add(new_block, &local_err);
 if (local_err) {
 g_free(new_block);
@@ -2146,15 +2141,15 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, 
ram_addr_t max_size,
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
MemoryRegion *mr, Error **errp)
 {
-return qemu_ram_alloc_internal(size, size, NULL, host, false,
-   false, mr, errp);
+return qemu_ram_alloc_internal(size, size, NULL, host, RAM_PREALLOC, mr,
+   errp);
 }
 
-RAMBlock *qemu_ram_alloc(ram_addr_t size, bool share,
+RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
  MemoryRegion *mr, Error **errp)
 {
-return qemu_ram_alloc_internal(size, size, NULL, NULL, false,
-   share, mr, errp);
+assert((ram_flags & ~RAM_SHARED) == 0);
+return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, 
errp);
 }
 
 RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
@@ -2163,8 +2158,8 @@ RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, 
ram_addr_t maxsz,
  void *host),
  MemoryRegion *mr, Error **errp)
 {
-return qemu_ram_alloc_internal(size, maxsz, resized, NULL, true,
-   false, mr, errp);
+return qemu_ram_alloc_internal(size, maxsz, resized, NULL,
+   RAM_RESIZEABLE, mr, errp);
 }
 
 static void reclaim_ramblock(RAM

Re: [PATCH v6 06/15] softmmu/memory: Pass ram_flags to qemu_ram_alloc() and qemu_ram_alloc_internal()

2021-04-21 Thread Philippe Mathieu-Daudé
On 4/21/21 2:26 PM, David Hildenbrand wrote:
> Let's pass ram_flags to qemu_ram_alloc() and qemu_ram_alloc_internal(),
> preparing for passing additional flags.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  include/exec/ram_addr.h |  2 +-
>  softmmu/memory.c|  4 ++--
>  softmmu/physmem.c   | 29 -
>  3 files changed, 15 insertions(+), 20 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé 




[PATCH v6 07/15] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap()

2021-04-21 Thread David Hildenbrand
Let's pass flags instead of bools to prepare for passing other flags and
update the documentation of qemu_ram_mmap(). Introduce new QEMU_MAP_
flags that abstract the mmap() PROT_ and MAP_ flag handling and simplify
it.

We expose only flags that are currently supported by qemu_ram_mmap().
Maybe, we'll see qemu_mmap() in the future as well that can implement these
flags.

Note: We don't use MAP_ flags as some flags (e.g., MAP_SYNC) are only
defined for some systems and we want to always be able to identify
these flags reliably inside qemu_ram_mmap() -- for example, to properly
warn when some future flags are not available or effective on a system.
Also, this way we can simplify PROT_ handling as well.

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 include/qemu/mmap-alloc.h | 16 +---
 include/qemu/osdep.h  | 18 ++
 softmmu/physmem.c |  8 +---
 util/mmap-alloc.c | 15 ---
 util/oslib-posix.c|  3 ++-
 5 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index 456ff87df1..90d0eee705 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -7,18 +7,22 @@ size_t qemu_fd_getpagesize(int fd);
 size_t qemu_mempath_getpagesize(const char *mem_path);
 
 /**
- * qemu_ram_mmap: mmap the specified file or device.
+ * qemu_ram_mmap: mmap anonymous memory, the specified file or device.
+ *
+ * mmap() abstraction to map guest RAM, simplifying flag handling, taking
+ * care of alignment requirements and installing guard pages.
  *
  * Parameters:
  *  @fd: the file or the device to mmap
  *  @size: the number of bytes to be mmaped
  *  @align: if not zero, specify the alignment of the starting mapping address;
  *  otherwise, the alignment in use will be determined by QEMU.
- *  @readonly: true for a read-only mapping, false for read/write.
- *  @shared: map has RAM_SHARED flag.
- *  @is_pmem: map has RAM_PMEM flag.
+ *  @qemu_map_flags: QEMU_MAP_* flags
  *  @map_offset: map starts at offset of map_offset from the start of fd
  *
+ * Internally, MAP_PRIVATE, MAP_ANONYMOUS and MAP_SHARED_VALIDATE are set
+ * implicitly based on other parameters.
+ *
  * Return:
  *  On success, return a pointer to the mapped area.
  *  On failure, return MAP_FAILED.
@@ -26,9 +30,7 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
 void *qemu_ram_mmap(int fd,
 size_t size,
 size_t align,
-bool readonly,
-bool shared,
-bool is_pmem,
+uint32_t qemu_map_flags,
 off_t map_offset);
 
 void qemu_ram_munmap(int fd, void *ptr, size_t size);
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index cb2a07e472..a96d6cb7ac 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -366,6 +366,24 @@ void *qemu_anon_ram_alloc(size_t size, uint64_t *align, 
bool shared);
 void qemu_vfree(void *ptr);
 void qemu_anon_ram_free(void *ptr, size_t size);
 
+/*
+ * Abstraction of PROT_ and MAP_ flags as passed to mmap(), for example,
+ * consumed by qemu_ram_mmap().
+ */
+
+/* Map PROT_READ instead of PROT_READ | PROT_WRITE. */
+#define QEMU_MAP_READONLY   (1 << 0)
+
+/* Use MAP_SHARED instead of MAP_PRIVATE. */
+#define QEMU_MAP_SHARED (1 << 1)
+
+/*
+ * Use MAP_SYNC | MAP_SHARED_VALIDATE if supported. Ignored without
+ * QEMU_MAP_SHARED. If mapping fails, warn and fallback to !QEMU_MAP_SYNC.
+ */
+#define QEMU_MAP_SYNC   (1 << 2)
+
+
 #define QEMU_MADV_INVALID -1
 
 #if defined(CONFIG_MADVISE)
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 11b45be271..2e8d1f47f0 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1533,6 +1533,7 @@ static void *file_ram_alloc(RAMBlock *block,
 off_t offset,
 Error **errp)
 {
+uint32_t qemu_map_flags;
 void *area;
 
 block->page_size = qemu_fd_getpagesize(fd);
@@ -1580,9 +1581,10 @@ static void *file_ram_alloc(RAMBlock *block,
 perror("ftruncate");
 }
 
-area = qemu_ram_mmap(fd, memory, block->mr->align, readonly,
- block->flags & RAM_SHARED, block->flags & RAM_PMEM,
- offset);
+qemu_map_flags = readonly ? QEMU_MAP_READONLY : 0;
+qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
+qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
+area = qemu_ram_mmap(fd, memory, block->mr->align, qemu_map_flags, offset);
 if (area == MAP_FAILED) {
 error_setg_errno(errp, errno,
  "unable to map backing store for guest RAM");
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 0e2bd7bc0e..1ddc0e2a1e 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -118,9 +118,12 @@ static void *mmap_reserve(size_t size, int fd)
  * Activate memory in a reserved region from the given

[PATCH v6 14/15] qmp: Include "reserve" property of memory backends

2021-04-21 Thread David Hildenbrand
Let's include the new property.

Reviewed-by: Philippe Mathieu-Daudé 
Cc: Eric Blake 
Cc: Markus Armbruster 
Cc: Igor Mammedov 
Signed-off-by: David Hildenbrand 
---
 hw/core/machine-qmp-cmds.c | 1 +
 qapi/machine.json  | 4 
 2 files changed, 5 insertions(+)

diff --git a/hw/core/machine-qmp-cmds.c b/hw/core/machine-qmp-cmds.c
index d41db5b93b..2d135ecdd0 100644
--- a/hw/core/machine-qmp-cmds.c
+++ b/hw/core/machine-qmp-cmds.c
@@ -175,6 +175,7 @@ static int query_memdev(Object *obj, void *opaque)
 m->dump = object_property_get_bool(obj, "dump", &error_abort);
 m->prealloc = object_property_get_bool(obj, "prealloc", &error_abort);
 m->share = object_property_get_bool(obj, "share", &error_abort);
+m->reserve = object_property_get_bool(obj, "reserve", &error_abort);
 m->policy = object_property_get_enum(obj, "policy", "HostMemPolicy",
  &error_abort);
 host_nodes = object_property_get_qobject(obj,
diff --git a/qapi/machine.json b/qapi/machine.json
index 32650bfe9e..5932139d20 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -798,6 +798,9 @@
 #
 # @share: whether memory is private to QEMU or shared (since 6.1)
 #
+# @reserve: whether swap space (or huge pages) was reserved if applicable
+#   (since 6.1)
+#
 # @host-nodes: host nodes for its memory policy
 #
 # @policy: memory policy of memory backend
@@ -812,6 +815,7 @@
 'dump':   'bool',
 'prealloc':   'bool',
 'share':  'bool',
+'reserve':'bool',
 'host-nodes': ['uint16'],
 'policy': 'HostMemPolicy' }}
 
-- 
2.30.2




Re: [PATCH v6 07/15] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap()

2021-04-21 Thread Philippe Mathieu-Daudé
On 4/21/21 2:26 PM, David Hildenbrand wrote:
> Let's pass flags instead of bools to prepare for passing other flags and
> update the documentation of qemu_ram_mmap(). Introduce new QEMU_MAP_
> flags that abstract the mmap() PROT_ and MAP_ flag handling and simplify
> it.
> 
> We expose only flags that are currently supported by qemu_ram_mmap().
> Maybe, we'll see qemu_mmap() in the future as well that can implement these
> flags.
> 
> Note: We don't use MAP_ flags as some flags (e.g., MAP_SYNC) are only
> defined for some systems and we want to always be able to identify
> these flags reliably inside qemu_ram_mmap() -- for example, to properly
> warn when some future flags are not available or effective on a system.
> Also, this way we can simplify PROT_ handling as well.
> 
> Reviewed-by: Peter Xu 
> Signed-off-by: David Hildenbrand 
> ---
>  include/qemu/mmap-alloc.h | 16 +---
>  include/qemu/osdep.h  | 18 ++
>  softmmu/physmem.c |  8 +---
>  util/mmap-alloc.c | 15 ---
>  util/oslib-posix.c|  3 ++-
>  5 files changed, 42 insertions(+), 18 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH v2 01/11] hw/arm/aspeed: Do not directly map ram container onto main address bus

2021-04-21 Thread Philippe Mathieu-Daudé
On 4/21/21 7:53 AM, Cédric Le Goater wrote:
> On 4/20/21 8:28 PM, Peter Xu wrote:
>> On Sat, Apr 17, 2021 at 12:30:18PM +0200, Philippe Mathieu-Daudé wrote:
>>> The RAM container is exposed as an AddressSpace.
>>
>> I didn't see where did ram_container got exposed as an address space.

I guess I used the wrong base to git-publish and skipped the first patch =)

>> I see it's added as one subregion of get_system_memory(), which looks okay? 
> my version of this patch took a simpler approach. See below.
> 
> Thanks,
> 
> C.
> 
> --- a/hw/arm/aspeed.c
> +++ b/hw/arm/aspeed.c
> @@ -327,7 +327,7 @@ static void aspeed_machine_init(MachineState *machine)
>  object_property_set_int(OBJECT(&bmc->soc), "num-cs", amc->num_cs,
>  &error_abort);
>  object_property_set_link(OBJECT(&bmc->soc), "dram",
> - OBJECT(&bmc->ram_container), &error_abort);
> + OBJECT(machine->ram), &error_abort);

This will work as long as no board maps the main memory elsewhere than
0x0. Using the alias make it more robust (and also is good API example
for the usual "use API via copy/pasting" style when adding new board)
IMHO.

>  if (machine->kernel_filename) {
>  /*
>   * When booting with a -kernel command line there is no u-boot
> 
> 



[RFC PATCH] tests/tcg: add a multiarch signals test to stress test signal delivery

2021-04-21 Thread Alex Bennée
This adds a simple signal test that combines the POSIX timer_create
with signal delivery across multiple threads.

[AJB: So I wrote this in an attempt to flush out issues with the
s390x-linux-user handling. However I suspect I've done something wrong
or opened a can of signal handling worms.

Nominally this runs fine on real hardware but I variously get failures
when running it under translation and while debugging QEMU running the
test. I've also exposed a shortcomming with the gdb stub when dealing
with guest TLS data so yay ;-). So I post this as an RFC in case
anyone else can offer insight or can verify they are seeing the same
strange behaviour?]

Signed-off-by: Alex Bennée 
---
 tests/tcg/multiarch/signals.c   | 149 
 tests/tcg/multiarch/Makefile.target |   2 +
 2 files changed, 151 insertions(+)
 create mode 100644 tests/tcg/multiarch/signals.c

diff --git a/tests/tcg/multiarch/signals.c b/tests/tcg/multiarch/signals.c
new file mode 100644
index 00..998c8fdefd
--- /dev/null
+++ b/tests/tcg/multiarch/signals.c
@@ -0,0 +1,149 @@
+/*
+ * linux-user signal handling tests.
+ *
+ * Copyright (c) 2021 Linaro Ltd
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static void error1(const char *filename, int line, const char *fmt, ...)
+{
+va_list ap;
+va_start(ap, fmt);
+fprintf(stderr, "%s:%d: ", filename, line);
+vfprintf(stderr, fmt, ap);
+fprintf(stderr, "\n");
+va_end(ap);
+exit(1);
+}
+
+static int __chk_error(const char *filename, int line, int ret)
+{
+if (ret < 0) {
+error1(filename, line, "%m (ret=%d, errno=%d/%s)",
+   ret, errno, strerror(errno));
+}
+return ret;
+}
+
+#define error(fmt, ...) error1(__FILE__, __LINE__, fmt, ## __VA_ARGS__)
+
+#define chk_error(ret) __chk_error(__FILE__, __LINE__, (ret))
+
+/*
+ * Thread handling
+ */
+typedef struct ThreadJob ThreadJob;
+
+struct ThreadJob {
+int number;
+int sleep;
+int count;
+};
+
+static pthread_t *threads;
+static int max_threads = 10;
+__thread int signal_count;
+int total_signal_count;
+
+static void *background_thread_func(void *arg)
+{
+ThreadJob *job = (ThreadJob *) arg;
+
+printf("thread%d: started\n", job->number);
+while (total_signal_count < job->count) {
+usleep(job->sleep);
+}
+printf("thread%d: saw %d alarms from %d\n", job->number,
+   signal_count, total_signal_count);
+return NULL;
+}
+
+static void spawn_threads(void)
+{
+int i;
+threads = calloc(sizeof(pthread_t), max_threads);
+
+for (i = 0; i < max_threads; i++) {
+ThreadJob *job = calloc(sizeof(ThreadJob), 1);
+job->number = i;
+job->sleep = i * 1000;
+job->count = i * 100;
+pthread_create(threads + i, NULL, background_thread_func, job);
+}
+}
+
+static void close_threads(void)
+{
+int i;
+for (i = 0; i < max_threads; i++) {
+pthread_join(threads[i], NULL);
+}
+free(threads);
+threads = NULL;
+}
+
+static void sig_alarm(int sig, siginfo_t *info, void *puc)
+{
+if (sig != SIGRTMIN) {
+error("unexpected signal");
+}
+signal_count++;
+__atomic_fetch_add(&total_signal_count, 1, __ATOMIC_SEQ_CST);
+}
+
+static void test_signals(void)
+{
+struct sigaction act;
+struct itimerspec it;
+timer_t tid;
+struct sigevent sev;
+
+/* Set up SIG handler */
+act.sa_sigaction = sig_alarm;
+sigemptyset(&act.sa_mask);
+act.sa_flags = SA_SIGINFO;
+chk_error(sigaction(SIGRTMIN, &act, NULL));
+
+/* Create POSIX timer */
+sev.sigev_notify = SIGEV_SIGNAL;
+sev.sigev_signo = SIGRTMIN;
+sev.sigev_value.sival_ptr = &tid;
+chk_error(timer_create(CLOCK_REALTIME, &sev, &tid));
+
+it.it_interval.tv_sec = 0;
+it.it_interval.tv_nsec = 100;
+it.it_value.tv_sec = 0;
+it.it_value.tv_nsec = 100;
+chk_error(timer_settime(tid, 0, &it, NULL));
+
+spawn_threads();
+
+do {
+usleep(1000);
+} while (total_signal_count < 2000);
+
+printf("shutting down after: %d signals\n", total_signal_count);
+
+close_threads();
+
+chk_error(timer_delete(tid));
+}
+
+int main(int argc, char **argv)
+{
+test_signals();
+return 0;
+}
diff --git a/tests/tcg/multiarch/Makefile.target 
b/tests/tcg/multiarch/Makefile.target
index a3a751723d..3f283eabe6 100644
--- a/tests/tcg/multiarch/Makefile.target
+++ b/tests/tcg/multiarch/Makefile.target
@@ -30,6 +30,8 @@ testthread: LDFLAGS+=-lpthread
 
 threadcount: LDFLAGS+=-lpthread
 
+signals: LDFLAGS+=-lrt -lpthread
+
 # We define the runner for test-mmap after the individual
 # architectures have defined their supported pages sizes. If no
 # additional page sizes are defined we only run the default test.
-- 
2.20.1




[PATCH] target/riscv: fix a typo with interrupt names

2021-04-21 Thread Emmanuel Blot
Interrupt names have been swapped in 205377f8 and do not follow
IRQ_*_EXT definition order.

Signed-off-by: Emmanuel Blot 
---
 target/riscv/cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index 7d6ed80f6b6..c79503ce967 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -88,8 +88,8 @@ const char * const riscv_intr_names[] = {
 "vs_timer",
 "m_timer",
 "u_external",
+"s_external",
 "vs_external",
-"h_external",
 "m_external",
 "reserved",
 "reserved",
-- 
2.31.1




Re: [RFC PATCH] tests/tcg: add a multiarch signals test to stress test signal delivery

2021-04-21 Thread Philippe Mathieu-Daudé
+Laurent

On 4/21/21 3:29 PM, Alex Bennée wrote:
> This adds a simple signal test that combines the POSIX timer_create
> with signal delivery across multiple threads.
> 
> [AJB: So I wrote this in an attempt to flush out issues with the
> s390x-linux-user handling. However I suspect I've done something wrong
> or opened a can of signal handling worms.
> 
> Nominally this runs fine on real hardware but I variously get failures
> when running it under translation and while debugging QEMU running the
> test. I've also exposed a shortcomming with the gdb stub when dealing
> with guest TLS data so yay ;-). So I post this as an RFC in case
> anyone else can offer insight or can verify they are seeing the same
> strange behaviour?]
> 
> Signed-off-by: Alex Bennée 
> ---
>  tests/tcg/multiarch/signals.c   | 149 
>  tests/tcg/multiarch/Makefile.target |   2 +
>  2 files changed, 151 insertions(+)
>  create mode 100644 tests/tcg/multiarch/signals.c
> 
> diff --git a/tests/tcg/multiarch/signals.c b/tests/tcg/multiarch/signals.c
> new file mode 100644
> index 00..998c8fdefd
> --- /dev/null
> +++ b/tests/tcg/multiarch/signals.c
> @@ -0,0 +1,149 @@
> +/*
> + * linux-user signal handling tests.
> + *
> + * Copyright (c) 2021 Linaro Ltd
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static void error1(const char *filename, int line, const char *fmt, ...)
> +{
> +va_list ap;
> +va_start(ap, fmt);
> +fprintf(stderr, "%s:%d: ", filename, line);
> +vfprintf(stderr, fmt, ap);
> +fprintf(stderr, "\n");
> +va_end(ap);
> +exit(1);
> +}
> +
> +static int __chk_error(const char *filename, int line, int ret)
> +{
> +if (ret < 0) {
> +error1(filename, line, "%m (ret=%d, errno=%d/%s)",
> +   ret, errno, strerror(errno));
> +}
> +return ret;
> +}
> +
> +#define error(fmt, ...) error1(__FILE__, __LINE__, fmt, ## __VA_ARGS__)
> +
> +#define chk_error(ret) __chk_error(__FILE__, __LINE__, (ret))
> +
> +/*
> + * Thread handling
> + */
> +typedef struct ThreadJob ThreadJob;
> +
> +struct ThreadJob {
> +int number;
> +int sleep;
> +int count;
> +};
> +
> +static pthread_t *threads;
> +static int max_threads = 10;
> +__thread int signal_count;
> +int total_signal_count;
> +
> +static void *background_thread_func(void *arg)
> +{
> +ThreadJob *job = (ThreadJob *) arg;
> +
> +printf("thread%d: started\n", job->number);
> +while (total_signal_count < job->count) {
> +usleep(job->sleep);
> +}
> +printf("thread%d: saw %d alarms from %d\n", job->number,
> +   signal_count, total_signal_count);
> +return NULL;
> +}
> +
> +static void spawn_threads(void)
> +{
> +int i;
> +threads = calloc(sizeof(pthread_t), max_threads);
> +
> +for (i = 0; i < max_threads; i++) {
> +ThreadJob *job = calloc(sizeof(ThreadJob), 1);
> +job->number = i;
> +job->sleep = i * 1000;
> +job->count = i * 100;
> +pthread_create(threads + i, NULL, background_thread_func, job);
> +}
> +}
> +
> +static void close_threads(void)
> +{
> +int i;
> +for (i = 0; i < max_threads; i++) {
> +pthread_join(threads[i], NULL);
> +}
> +free(threads);
> +threads = NULL;
> +}
> +
> +static void sig_alarm(int sig, siginfo_t *info, void *puc)
> +{
> +if (sig != SIGRTMIN) {
> +error("unexpected signal");
> +}
> +signal_count++;
> +__atomic_fetch_add(&total_signal_count, 1, __ATOMIC_SEQ_CST);
> +}
> +
> +static void test_signals(void)
> +{
> +struct sigaction act;
> +struct itimerspec it;
> +timer_t tid;
> +struct sigevent sev;
> +
> +/* Set up SIG handler */
> +act.sa_sigaction = sig_alarm;
> +sigemptyset(&act.sa_mask);
> +act.sa_flags = SA_SIGINFO;
> +chk_error(sigaction(SIGRTMIN, &act, NULL));
> +
> +/* Create POSIX timer */
> +sev.sigev_notify = SIGEV_SIGNAL;
> +sev.sigev_signo = SIGRTMIN;
> +sev.sigev_value.sival_ptr = &tid;
> +chk_error(timer_create(CLOCK_REALTIME, &sev, &tid));
> +
> +it.it_interval.tv_sec = 0;
> +it.it_interval.tv_nsec = 100;
> +it.it_value.tv_sec = 0;
> +it.it_value.tv_nsec = 100;
> +chk_error(timer_settime(tid, 0, &it, NULL));
> +
> +spawn_threads();
> +
> +do {
> +usleep(1000);
> +} while (total_signal_count < 2000);
> +
> +printf("shutting down after: %d signals\n", total_signal_count);
> +
> +close_threads();
> +
> +chk_error(timer_delete(tid));
> +}
> +
> +int main(int argc, char **argv)
> +{
> +test_signals();
> +return 0;
> +}
> diff --git a/tests/tcg/multiarch/Makefile.target 
> b/tests/tcg/multiarch/Makefile.target
> index a3a751723d..3f283eabe6 100644
> --- a/tests/tcg/multiarch/Makefile.target
> +++ b/tests/tcg/multiarch/Make

Re: [PATCH] Fix typo in CFI build documentation

2021-04-21 Thread Philippe Mathieu-Daudé
Hi Serge,

Cc'ing qemu-trivial@

On 4/20/21 5:48 PM, serge-sans-paille wrote:
> Signed-off-by: serge-sans-paille 

It looks your git-config is misconfigured... Maybe you used
an incorrect profile :) Can you repost please?

> ---
>  docs/devel/control-flow-integrity.rst | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

For the fix:
Reviewed-by: Philippe Mathieu-Daudé 

> 
> diff --git a/docs/devel/control-flow-integrity.rst 
> b/docs/devel/control-flow-integrity.rst
> index d89d707..e6b73a4 100644
> --- a/docs/devel/control-flow-integrity.rst
> +++ b/docs/devel/control-flow-integrity.rst
> @@ -39,7 +39,7 @@ later).
>  Given the use of LTO, a version of AR that supports LLVM IR is required.
>  The easies way of doing this is by selecting the AR provided by LLVM::
>  
> - AR=llvm-ar-9 CC=clang-9 CXX=lang++-9 /path/to/configure --enable-cfi
> + AR=llvm-ar-9 CC=clang-9 CXX=clang++-9 /path/to/configure --enable-cfi
>  
>  CFI is enabled on every binary produced.
>  
> @@ -131,7 +131,7 @@ lld with version 11+.
>  In other words, to compile with fuzzing and CFI, clang 11+ is required, and
>  lld needs to be used as a linker::
>  
> - AR=llvm-ar-11 CC=clang-11 CXX=lang++-11 /path/to/configure --enable-cfi \
> + AR=llvm-ar-11 CC=clang-11 CXX=clang++-11 /path/to/configure --enable-cfi \
> -enable-fuzzing --extra-ldflags="-fuse-ld=lld"
>  
>  and then, compile the fuzzers as usual.
> 




Re: [PATCH v4 16/19] qapi/expr.py: Add docstrings

2021-04-21 Thread Markus Armbruster
John Snow  writes:

[...]
> I've made a re-spin. Let's try something new, if you don't mind:
>
> I've pushed a "almost v5" copy onto my gitlab, where edits made against 
> this patch are in their own commit so that all of the pending edits I've 
> made are easily visible.
>
> Here's the "merge request", which I made against my own fork of master:
> https://gitlab.com/jsnow/qemu/-/merge_requests/1/diffs
>
> (It's marked "WIP", so there's no risk of me accidentally merging it -- 
> and if I did, it would be to my own "master" branch, so no worries about 
> us goofing this up.)
>
> If you click "Commits (21)" at the top, underneath "WIP: 
> python-qapi-cleanup-pt3", you can see the list of commits in the re-spin.
>
> (Four of these commits are the DO-NOT-MERGE ones I carry around as a 
> testing pre-requisite.)
>
>  From here, you can see the "[RFC] docstring diff" patch which shows all 
> the edits I've made so far based on your feedback and my tinkering.
>
> https://gitlab.com/jsnow/qemu/-/merge_requests/1/diffs?commit_id=3f0e9fb71304edb381ce3b9bf0ff08624fb277bc
>
> I invite you to leave feedback here on this view (and anywhere else in 
> the series that still needs adjusting, if you are so willing to humor 
> me) by highlighting the line and clicking the comment box icon on the 
> left. If you left-click and drag the comment box, you can target a range 
> of lines.
>
> (You can even propose a diff directly using this method, which allows me 
> to just accept your proposal directly.)
>
> If you leave any comments here, I can resolve each individual nugget of 
> feedback by clicking "Resolve Thread" in my view, which will help me 
> keep track of which items I believe I have addressed and which items I 
> have not. This will help me make sure I don't miss any of your feedback, 
> and it helps me keep track of what edits I've made for the next changelog.
>
> Willing to try it out?
>
> Once we're both happy with it, I will send it back to the list for final 
> assessment using our traditional process. Anyone else who wants to come 
> comment on the gitlab draft is of course more than welcome to.

I have only a few minor remarks, and I'm too lazy to create a gitlab
account just for them.

* Commit 3f0e9fb713 qapi/expr: [RFC] docstring diff

  - You mixed up check_name_lower() and check_name_camel()

  - Nitpick: check_defn_name_str() has inconsistent function name
markup.

  - I'd like to suggest a tweak of check_defn_name_str() :param meta:

  That's all.  Converged quickly.  Nice!  Incremental diff appended.

* Old "[PATCH v4 17/19] qapi/expr.py: Use tuples instead of lists for
  static data" is gone.  I think this leaves commit 913e3fd6f8's "Later
  patches will make use of that" dangling.  Let's not drop old PATCH 17.
  Put it right after 913e3fd6f8 if that's trivial.  If not, put it
  wherever it creates the least work for you.


diff --git a/scripts/qapi/expr.py b/scripts/qapi/expr.py
index f2bb92ab79..5c9060cb1b 100644
--- a/scripts/qapi/expr.py
+++ b/scripts/qapi/expr.py
@@ -124,7 +124,7 @@ def check_name_lower(name: str, info: QAPISourceInfo, 
source: str,
  permit_upper: bool = False,
  permit_underscore: bool = False) -> None:
 """
-Ensure that ``name`` is a valid user defined type name.
+Ensure that ``name`` is a valid command or member name.
 
 This means it must be a valid QAPI name as checked by
 `check_name_str()`, but where the stem prohibits uppercase
@@ -147,7 +147,7 @@ def check_name_lower(name: str, info: QAPISourceInfo, 
source: str,
 
 def check_name_camel(name: str, info: QAPISourceInfo, source: str) -> None:
 """
-Ensure that ``name`` is a valid command or member name.
+Ensure that ``name`` is a valid user-defined type name.
 
 This means it must be a valid QAPI name as checked by
 `check_name_str()`, but where the stem must be in CamelCase.
@@ -168,14 +168,14 @@ def check_defn_name_str(name: str, info: QAPISourceInfo, 
meta: str) -> None:
 Ensure that ``name`` is a valid definition name.
 
 Based on the value of ``meta``, this means that:
-  - 'event' names adhere to `check_name_upper`.
-  - 'command' names adhere to `check_name_lower`.
+  - 'event' names adhere to `check_name_upper()`.
+  - 'command' names adhere to `check_name_lower()`.
   - Else, meta is a type, and must pass `check_name_camel()`.
 These names must not end with ``Kind`` nor ``List``.
 
 :param name: Name to check.
 :param info: QAPI schema source file information.
-:param meta: Type name of the QAPI expression.
+:param meta: Meta-type name of the QAPI expression.
 
 :raise QAPISemError: When ``name`` fails validation.
 """




Re: [PATCH v3 02/33] block/nbd: fix how state is cleared on nbd_open() failure paths

2021-04-21 Thread Roman Kagan
On Fri, Apr 16, 2021 at 11:08:40AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We have two "return error" paths in nbd_open() after
> nbd_process_options(). Actually we should call nbd_clear_bdrvstate()
> on these paths. Interesting that nbd_process_options() calls
> nbd_clear_bdrvstate() by itself.
> 
> Let's fix leaks and refactor things to be more obvious:
> 
> - intialize yank at top of nbd_open()
> - move yank cleanup to nbd_clear_bdrvstate()
> - refactor nbd_open() so that all failure paths except for
>   yank-register goes through nbd_clear_bdrvstate()
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 36 ++--
>  1 file changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 739ae2941f..a407a3814b 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -152,8 +152,12 @@ static void 
> nbd_co_establish_connection_cancel(BlockDriverState *bs,
>  static int nbd_client_handshake(BlockDriverState *bs, Error **errp);
>  static void nbd_yank(void *opaque);
>  
> -static void nbd_clear_bdrvstate(BDRVNBDState *s)
> +static void nbd_clear_bdrvstate(BlockDriverState *bs)
>  {
> +BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> +
> +yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
> +
>  object_unref(OBJECT(s->tlscreds));
>  qapi_free_SocketAddress(s->saddr);
>  s->saddr = NULL;
> @@ -2279,9 +2283,6 @@ static int nbd_process_options(BlockDriverState *bs, 
> QDict *options,
>  ret = 0;
>  
>   error:
> -if (ret < 0) {
> -nbd_clear_bdrvstate(s);
> -}
>  qemu_opts_del(opts);
>  return ret;
>  }
> @@ -2292,11 +2293,6 @@ static int nbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  int ret;
>  BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
>  
> -ret = nbd_process_options(bs, options, errp);
> -if (ret < 0) {
> -return ret;
> -}
> -
>  s->bs = bs;
>  qemu_co_mutex_init(&s->send_mutex);
>  qemu_co_queue_init(&s->free_sema);
> @@ -2305,20 +2301,23 @@ static int nbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  return -EEXIST;
>  }
>  
> +ret = nbd_process_options(bs, options, errp);
> +if (ret < 0) {
> +goto fail;
> +}
> +
>  /*
>   * establish TCP connection, return error if it fails
>   * TODO: Configurable retry-until-timeout behaviour.
>   */
>  if (nbd_establish_connection(bs, s->saddr, errp) < 0) {
> -yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
> -return -ECONNREFUSED;
> +ret = -ECONNREFUSED;
> +goto fail;
>  }
>  
>  ret = nbd_client_handshake(bs, errp);

Not that this was introduced by this patch, but once you're at it:
AFAICT nbd_client_handshake() calls yank_unregister_instance() on some
error path(s); I assume this needs to go too, otherwise it's called
twice (and asserts).

Roman.

>  if (ret < 0) {
> -yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
> -nbd_clear_bdrvstate(s);
> -return ret;
> +goto fail;
>  }
>  /* successfully connected */
>  s->state = NBD_CLIENT_CONNECTED;
> @@ -2330,6 +2329,10 @@ static int nbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  aio_co_schedule(bdrv_get_aio_context(bs), s->connection_co);
>  
>  return 0;
> +
> +fail:
> +nbd_clear_bdrvstate(bs);
> +return ret;
>  }
>  
>  static int nbd_co_flush(BlockDriverState *bs)
> @@ -2373,11 +2376,8 @@ static void nbd_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  
>  static void nbd_close(BlockDriverState *bs)
>  {
> -BDRVNBDState *s = bs->opaque;
> -
>  nbd_client_close(bs);
> -yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
> -nbd_clear_bdrvstate(s);
> +nbd_clear_bdrvstate(bs);
>  }
>  
>  /*



[Bug 1368178] Re: Windows ME falsely detects qemu's videocards as Number Nine Imagine 128

2021-04-21 Thread Thomas Huth
** Tags removed: 128 edition millenium

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1368178

Title:
  Windows ME falsely detects qemu's videocards as Number Nine Imagine
  128

Status in QEMU:
  New

Bug description:
  A fresh installation of  Windows Millennium Edition (Windows ME,
  WinME) as guest OS on qemu interprets qemu's videocards as Number Nine
  Imagine  128 with the consequence, that

  1. It is impossible to change color depth.
  2. WinME uses the i128.drv Driver that is shipped with WinMe.
  3. Forcing WinME to use other drivers has no effect.

  
  It also doesn't matter what option for -vga was given to QEMU at command line.
  cirrus, std,   vmware,  qxl etc. all have no effect, the videocard detected 
by Windows Me stays at Number Nine Imagine 128.

  Even selecting another driver in WinME and forcing WinME to use
  drivers such as the Cirrus Logic 5446 PCI driver has no effect.

  I also want to mention, that the BIOS isn't detected by WinME properly.
  The device manager of WinME shows errors with the Plug & Play BIOS driver 
BIOS.vxd.

  
  That is the QEMU Version:

  # qemu-system-i386 --version  


  QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.3), Copyright (c) 
2003-2008 Fabrice Bellard  

  And this was the complete command line, that was given: 
  # sudo /usr/bin/qemu-system-i386 -hda WinME_QEMU.img -cdrom drivers.iso -boot 
c -no-acpi -no-hpet -soundhw sb16 -net nic -cpu pentium3 -m 256 -vga cirrus

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1368178/+subscriptions



[PATCH 0/2] plugins: Freeing allocated values in hash tables.

2021-04-21 Thread Mahmoud Mandour
A hash table made using ``g_hash_table_new`` requires manually
freeing any dynamically allocated keys/values. The two patches
in this series fixes this issue in hotblocks and hotpages plugins.

Mahmoud Mandour (2):
  plugins/hotblocks: Properly freed the hash table values
  plugins/hotpages: Properly freed the hash table values

 contrib/plugins/hotblocks.c | 3 ++-
 contrib/plugins/hotpages.c  | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

-- 
2.25.1




[PATCH 2/2] plugins/hotpages: Properly freed the hash table values

2021-04-21 Thread Mahmoud Mandour
Freed the values stored in the hash table ``pages``
returned by ``g_hash_table_get_values()`` by freeing the sorted
list and destroyed the hash table afterward.

Signed-off-by: Mahmoud Mandour 
---
 contrib/plugins/hotpages.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/contrib/plugins/hotpages.c b/contrib/plugins/hotpages.c
index bf53267532..2094ebd15f 100644
--- a/contrib/plugins/hotpages.c
+++ b/contrib/plugins/hotpages.c
@@ -94,9 +94,10 @@ static void plugin_exit(qemu_plugin_id_t id, void *p)
rec->cpu_read, rec->reads,
rec->cpu_write, rec->writes);
 }
-g_list_free(it);
+g_list_free_full(it, g_free);
 }
 
+g_hash_table_destroy(pages);
 qemu_plugin_outs(report->str);
 }
 
-- 
2.25.1




[PATCH 1/2] plugins/hotblocks: Properly freed the hash table values

2021-04-21 Thread Mahmoud Mandour
Freed the values stored in the hash table ``hotblocks``
returned by ``g_hash_table_get_values()`` by freeing the sorted
list and destroyed the hash table afterward.

Signed-off-by: Mahmoud Mandour 
---
 contrib/plugins/hotblocks.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/contrib/plugins/hotblocks.c b/contrib/plugins/hotblocks.c
index 4b08340143..64692c0670 100644
--- a/contrib/plugins/hotblocks.c
+++ b/contrib/plugins/hotblocks.c
@@ -68,10 +68,11 @@ static void plugin_exit(qemu_plugin_id_t id, void *p)
rec->insns, rec->exec_count);
 }
 
-g_list_free(it);
+g_list_free_full(it, g_free);
 g_mutex_unlock(&lock);
 }
 
+g_hash_table_destroy(hotblocks);
 qemu_plugin_outs(report->str);
 }
 
-- 
2.25.1




Re: [PATCH v2] i386: Add ratelimit for bus locks acquired in guest

2021-04-21 Thread Eduardo Habkost
On Wed, Apr 21, 2021 at 02:26:42PM +0800, Chenyi Qiang wrote:
> Hi, Eduardo, thanks for your comments!
> 
> 
> On 4/21/2021 12:34 AM, Eduardo Habkost wrote:
> > Hello,
> > 
> > Thanks for the patch.  Comments below:
> > 
> > On Tue, Apr 20, 2021 at 05:37:36PM +0800, Chenyi Qiang wrote:
> > > Virtual Machines can exploit bus locks to degrade the performance of
> > > system. To address this kind of performance DOS attack, bus lock VM exit
> > > is introduced in KVM and it will report the bus locks detected in guest,
> > > which can help userspace to enforce throttling policies.
> > > 
> > 
> > Is there anything today that would protect the system from
> > similar attacks from userspace with access to /dev/kvm?
> > 
> 
> I can't fully understand your meaning for "similar attack with access to
> /dev/kvm". But there are some similar associated detection features on bare
> metal.

What I mean is: you say guests can make a performance DoS attack
on the host, and your patch mitigates that.

What would be the available methods to prevent untrusted
userspace running on the host with access to /dev/kvm from making
a similar DoS attack on the host?

> 
> 1. Split lock 
> detection:https://lore.kernel.org/lkml/158031147976.396.8941798847364718785.tip-bot2@tip-bot2/
> Some CPUs can raise an #AC trap when a split lock is attempted.

Would split_lock_detect=fatal be enough to prevent the above attacks?

Is split_lock_detect=fatal the only available way to prevent them?

> 
> 2. Bus lock Debug Exception:
> https://lore.kernel.org/lkml/20210322135325.682257-1-fenghua...@intel.com/
> The kernel can be notified by an #DB trap after a user instruction acquires
> a bus lock and is executed.

I see a rate limit option mentioned at the above URL.  Would a
host kernel bus lock rate limit option make this QEMU patch
redundant?

-- 
Eduardo




Re: [PATCH v3] memory: Directly dispatch alias accesses on origin memory region

2021-04-21 Thread Peter Xu
On Wed, Apr 21, 2021 at 11:33:55AM +0100, Mark Cave-Ayland wrote:
> On 20/04/2021 21:59, Peter Xu wrote:
> 
> > > > I agree with this sentiment: it has taken me a while to figure out what
> > > > was happening, and that was only because I spotted accesses being
> > > > rejected with -d guest_errors.
> > > > 
> > > >  From my perspective the names memory_region_dispatch_read() and
> > > > memory_region_dispatch_write() were the misleading part, although I
> > > > remember thinking it odd whilst trying to use them that I had to start
> > > > delving into sections etc. just to recurse a memory access.
> 
> > I think it should always be a valid request to trigger memory access via 
> > the MR
> > layer, say, what if the caller has no address space context at all?
> 
> For these cases you can just use the global default address_space_memory
> which is the solution I used in the second version of my patch e.g.
> 
> val = address_space_ldl_be(&address_space_memory, addr, attrs, &r);

Yes, but what if it's an MR that does not belong to address_space_memory?  We
can still use the other AS, however that's something we don't actually need to
worry if we can directly operate on MRs.

The other thing is if there're plenty of layers of a deep hidden MR, then we'll
also need to cache all the offsets (e.g., mr A is subregion of mr B, B is
subregion of C, C belong to a AS, then we need to record offset of A+B+C to
finally be able to access this MR from the AS?) which seems an overkill if we
know exactly we want to operate on this mr.

I randomly looked at memory_region_dispatch_write(), and I think most of them
indeed do not have the AS context.  As Peter Maydell mentioned in the other
thread, if we have plenty of users use this interface, maybe there's a reason?
I'm thinking is it possible that they "worked" simply because current users
haven't used alias that much.

> 
> > From the
> > name of memory_region_dispatch_write|read I don't see either on why we 
> > should
> > not take care of alias mrs.  That's also the reason I'd even prefer this 
> > patch
> > rather than an assert.
> 
> The problem I see here is that this patch is breaking the abstraction
> between generating the flatview from the memory topology and dispatching a
> request to it.
> 
> If you look at the existing code then aliased memory regions are
> de-referenced at flatview construction time, so you end up with a flatview
> where each range points to a target (leaf or terminating) memory region plus
> offset. You can see this if you compare the output of "info mtree" with
> "info mtree -f" in the monitor.

Yes it's a fair point.  However per my understanding, address space is solving
some other problems rather than doing memory access on its own.

Staring from flatview: AS operations uses flatview indeed, and also take care
of translations (e.g. flatview_translate), however all of them are logic to
route a AS memory access to memory region only.  If we have the mr in hand, I
see nothing helping in there but extra (useless) work to resolve the mr..

One thing I noticed that may be tricky is what we did in prepare_mmio_access()
before lands e.g. memory_region_dispatch_write() in flatview_write_continue(),
as there're tricks on taking BQL or flushing MMIO buffers. I'm not sure whether
it means if we're going to have a MR layer memory access API then the user
should be aware of them (e.g., we should have BQL taken before calling MR
APIs?).  Or, we can simply move prepare_mmio_access() into the new memory
region API too?  In all cases, that's still not an obvious reason to not having
the memory region API on its own.

We also calculate size of memory ops (memory_access_size), but what if we know
them before hand?  They could be redundant calculations too.

Then we already lands memory_region_dispatch_write().

So if we see memory_region_dispatch_write() is the point where the MR access
really starts.  I don't know whether it works for RAM, but I think that's not a
major concern either..  Then there's a fair point to export it to work for all
general cases, including aliasing.

My point may not stand solid enough as I didn't really use the mr API so I
could have things overlooked... so I think if AS APIs will work for both of you
then why not. :) However I just wanted to express my understanding that AS APIs
should majorly solve some problems else comparing to (if there's going to have)
the memory region APIs, so if we're sure we don't have those AS problems
(routing to the mr not needed as we have the mr pointer; offsetting not needed
as we even know the direct offset of the mr to write to; we know exactly the
size to operate, and so on..) then it's a valid request to ask for a memory
region layer API.

Thanks,

-- 
Peter Xu




Re: [PATCH v2] i386: Add ratelimit for bus locks acquired in guest

2021-04-21 Thread Xiaoyao Li

On 4/21/2021 10:12 PM, Eduardo Habkost wrote:

On Wed, Apr 21, 2021 at 02:26:42PM +0800, Chenyi Qiang wrote:

Hi, Eduardo, thanks for your comments!


On 4/21/2021 12:34 AM, Eduardo Habkost wrote:

Hello,

Thanks for the patch.  Comments below:

On Tue, Apr 20, 2021 at 05:37:36PM +0800, Chenyi Qiang wrote:

Virtual Machines can exploit bus locks to degrade the performance of
system. To address this kind of performance DOS attack, bus lock VM exit
is introduced in KVM and it will report the bus locks detected in guest,
which can help userspace to enforce throttling policies.



Is there anything today that would protect the system from
similar attacks from userspace with access to /dev/kvm?



I can't fully understand your meaning for "similar attack with access to
/dev/kvm". But there are some similar associated detection features on bare
metal.


What I mean is: you say guests can make a performance DoS attack
on the host, and your patch mitigates that.

What would be the available methods to prevent untrusted
userspace running on the host with access to /dev/kvm from making
a similar DoS attack on the host?



1. Split lock 
detection:https://lore.kernel.org/lkml/158031147976.396.8941798847364718785.tip-bot2@tip-bot2/
Some CPUs can raise an #AC trap when a split lock is attempted.


Would split_lock_detect=fatal be enough to prevent the above attacks?


NO.

There are two types bus lock:
1. split lock - lock on cacheable memory while the memory across two 
cache lines.
2. non-wb lock - lock on non-writableback memory (you can find it on 
Intel ISE chapter 8, 
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html)


split lock detection can only prevent 1)


Is split_lock_detect=fatal the only available way to prevent them?


as above, 2) non-wb lock can be prevented by "non-wb lock disable" feature





2. Bus lock Debug Exception:
https://lore.kernel.org/lkml/20210322135325.682257-1-fenghua...@intel.com/
The kernel can be notified by an #DB trap after a user instruction acquires
a bus lock and is executed.


I see a rate limit option mentioned at the above URL.  Would a
host kernel bus lock rate limit option make this QEMU patch
redundant?



No. Bus lock Debug exception cannot be used to detect the bus lock 
happens in guest (vmx non-root mode).


We have patch to virtualize this feature for guest 
https://lore.kernel.org/kvm/20210202090433.13441-1-chenyi.qi...@intel.com/


that guest will have its own setting of bus lock debug exception on or off.

What's more important is that, even we force set the 
MSR_DEBUGCTL.BUS_LOCK_DETECT for guest, guest still can escape from it.
Because bus lock #DB is a trap which is delivered after the instruction 
completes. If the instruction acquires bus lock subsequently faults 
e.g., #PF, then no bus lock #DB generated. But the bus lock does happen.


But with bus lock VM exit, even the instruction faults, it will cause a 
BUS LOCK VM exit.






Transferring bugs from Launchpad to Gitlab (was: Re: Upstream bug comments on the mailing list and Launchpad)

2021-04-21 Thread Thomas Huth

On 19/04/2021 18.04, Peter Maydell wrote:

On Mon, 19 Apr 2021 at 16:52, Thomas Huth  wrote:


On 15/04/2021 11.49, Kashyap Chamarthy wrote:
...

PS: I recall there was discussion on the list of moving to a different
  GitLab tracker.  As Thomas Huth mentioned on IRC, more people seem
  to have account on GitLab than on Launchpad.


I think we basically agreed that we want to migrate to gitlab, but e.g.
Peter suggested that we copy the open bugs on launchpad to the gitlab
tracker first, so that we don't have two active trackers at the same time...
thus this needs someone with enough spare time to work on such a convertion
script first...


In the meantime, can we disable bug reporting on gitlab? It's
confusing to users to have two places, one of which is not
checked by anybody. We already have two stray bugreports:
https://gitlab.com/qemu-project/qemu/-/issues


I'd rather like to see us doing the move finally. I'm very much a Python 
ignorant, but since apparently nobody else has spare time to work on this at 
the moment, I've now done some copy-n-paste coding and came up with a script 
that can transfer bugs from Launchpad to Gitlab (see the attachment). You 
need a "private token" passed to the script via the GITLAB_PRIVATE_TOKEN 
environment variable so that it is able to access gitlab (such a token can 
be generated in your private setting pages on gitlab). Some transferred bugs 
can be seen here:


 https://gitlab.com/thuth/trackertest/-/issues

Notes:

- Not all information is transferred, e.g. no attachments yet.
  But considering that there is a link to the original launchpad
  bug, you can always look up the information there, so I think
  that should be fine.

- The script currently also adds the "tags" from the launchpad
  bug as "labels" in gitlab - do we want that, or do we rather
  want to restrict our labels in gitlab to a smaller set?

- Formatting of the bug description and comments is a little bit
  tricky, since the text is pre-formatted ASCII in launchpad and
  markdown on gitlab ... For now, I put the text from launchpad
  into "" HTML tags on gitlab, that should result in usable
  texts most of the time.

- Biggest problem so far: After transferring 50 bugs or so,
  gitlab refuses to take more bugs automatically with the
  message: 'Your issue has been recognized as spam. Please,
  change the content or solve the reCAPTCHA to proceed.'
  ... and solving captchas in a script is certainly way out
  of scope here. So if we want to copy over all bugs from
  launchpad, we might need to do that in junks.

- Do we want to auto-close the bugs in launchpad after we've
  transferred them? Or should they stay open until the bugs
  have been addressed? I tend to close them (with an appropriate
  comment added to the bug, too), but we could also mark the
  transferred bugs with a tag instead.

- Do we want to transfer also old "Wishlist" bug tickets from
  Launchpad? I think it's quite unlikely that we will ever
  address them... maybe we should close them on Launchpad with
  an appropriate comment?

 Thomas
#!/usr/bin/env python3

import argparse
import logging
import os
import re
import sys
import time
import gitlab

from launchpadlib.launchpad import Launchpad
import lazr.restfulclient.errors

LOG_FORMAT = "%(levelname)s - %(funcName)s - %(message)s"
logging.basicConfig(format=LOG_FORMAT)
LOG = logging.getLogger(__name__)

parser = argparse.ArgumentParser(description="Close incomplete bug reports.")
parser.add_argument('-l',
'--lp-project-name',
dest='lp_project_name',
default='qemu',
help='The Launchpad project name (qemu).')
parser.add_argument('-g',
'--gl-project-id',
dest='gl_project_id',
help='The Gitlab project ID.')
parser.add_argument('--verbose', '-v',
help='Enable debug logging.',
action="store_true")
parser.add_argument('--search-text', '-s',
dest='search_text',
help='Look up bugs by searching for text.')
args = parser.parse_args()
LOG.info(args)

LOG.setLevel(logging.DEBUG if args.verbose else logging.INFO)


def get_launchpad():
	cache_dir = os.path.expanduser("~/.launchpadlib/cache/")
	if not os.path.exists(cache_dir):
		os.makedirs(cache_dir, 0o700)
	launchpad = Launchpad.login_anonymously(args.lp_project_name + '-bugs',
	'production', cache_dir)
	return launchpad


def show_bug_task(bug_task):
	print('Bug #%d - %s' % (bug_task.bug.id, str(bug_task.bug.title)))
	if args.verbose:
		print(' - Description: %s' % bug_task.bug.description)
		print(' - Tags: %s' % bug_task.bug.tags)
		print(' - Status: %s' % bug_task.status)
		print(' - Assignee: %s' % bug_task.assignee)
		for message in bug_task.bug.messages:
			print('  - Message: %s' % message.content)


def transfer_to_gitlab(launchpad, project, bug_task):
	bug = bug

[PATCH v2] vfio-ccw: Permit missing IRQs

2021-04-21 Thread Eric Farman
Commit 690e29b91102 ("vfio-ccw: Refactor ccw irq handler") changed
one of the checks for the IRQ notifier registration from saying
"the host needs to recognize the only IRQ that exists" to saying
"the host needs to recognize ANY IRQ that exists."

And this worked fine, because the subsequent change to support the
CRW IRQ notifier doesn't get into this code when running on an older
kernel, thanks to a guard by a capability region. The later addition
of the REQ(uest) IRQ by commit b2f96f9e4f5f ("vfio-ccw: Connect the
device request notifier") broke this assumption because there is no
matching capability region. Thus, running new QEMU on an older
kernel fails with:

  vfio: unexpected number of irqs 2

Let's adapt the message here so that there's a better clue of what
IRQ is missing.

Furthermore, let's make the REQ(uest) IRQ not fail when attempting
to register it, to permit running vfio-ccw on a newer QEMU with an
older kernel.

Fixes: b2f96f9e4f5f ("vfio-ccw: Connect the device request notifier")
Signed-off-by: Eric Farman 
---

Notes:
v1->v2:
 - Keep existing "invalid number of IRQs" message with adapted text [CH]
 - Move the "is this an error" test to the registration of the IRQ in
   question, rather than making it allowable for any IRQ mismatch [CH]
 - Drop Fixes tag for initial commit [EF]

v1: 
https://lore.kernel.org/qemu-devel/20210419184906.2847283-1-far...@linux.ibm.com/

 hw/vfio/ccw.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index b2df708e4b..400bc07fe2 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -412,8 +412,8 @@ static void vfio_ccw_register_irq_notifier(VFIOCCWDevice 
*vcdev,
 }
 
 if (vdev->num_irqs < irq + 1) {
-error_setg(errp, "vfio: unexpected number of irqs %u",
-   vdev->num_irqs);
+error_setg(errp, "vfio: IRQ %u not available (number of irqs %u)",
+   irq, vdev->num_irqs);
 return;
 }
 
@@ -696,13 +696,15 @@ static void vfio_ccw_realize(DeviceState *dev, Error 
**errp)
 
 vfio_ccw_register_irq_notifier(vcdev, VFIO_CCW_REQ_IRQ_INDEX, &err);
 if (err) {
-goto out_req_notifier_err;
+/*
+ * Report this error, but do not make it a failing condition.
+ * Lack of this IRQ in the host does not prevent normal operation.
+ */
+error_report_err(err);
 }
 
 return;
 
-out_req_notifier_err:
-vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_CRW_IRQ_INDEX);
 out_crw_notifier_err:
 vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX);
 out_io_notifier_err:
-- 
2.25.1




Re: [PATCH v2] i386: Add ratelimit for bus locks acquired in guest

2021-04-21 Thread Eduardo Habkost
On Wed, Apr 21, 2021 at 10:50:10PM +0800, Xiaoyao Li wrote:
> On 4/21/2021 10:12 PM, Eduardo Habkost wrote:
> > On Wed, Apr 21, 2021 at 02:26:42PM +0800, Chenyi Qiang wrote:
> > > Hi, Eduardo, thanks for your comments!
> > > 
> > > 
> > > On 4/21/2021 12:34 AM, Eduardo Habkost wrote:
> > > > Hello,
> > > > 
> > > > Thanks for the patch.  Comments below:
> > > > 
> > > > On Tue, Apr 20, 2021 at 05:37:36PM +0800, Chenyi Qiang wrote:
> > > > > Virtual Machines can exploit bus locks to degrade the performance of
> > > > > system. To address this kind of performance DOS attack, bus lock VM 
> > > > > exit
> > > > > is introduced in KVM and it will report the bus locks detected in 
> > > > > guest,
> > > > > which can help userspace to enforce throttling policies.
> > > > > 
> > > > 
> > > > Is there anything today that would protect the system from
> > > > similar attacks from userspace with access to /dev/kvm?
> > > > 
> > > 
> > > I can't fully understand your meaning for "similar attack with access to
> > > /dev/kvm". But there are some similar associated detection features on 
> > > bare
> > > metal.
> > 
> > What I mean is: you say guests can make a performance DoS attack
> > on the host, and your patch mitigates that.
> > 
> > What would be the available methods to prevent untrusted
> > userspace running on the host with access to /dev/kvm from making
> > a similar DoS attack on the host?

Thanks for all the clarifications below.  Considering them,
what's the answer to the question above?

> > 
> > > 
> > > 1. Split lock 
> > > detection:https://lore.kernel.org/lkml/158031147976.396.8941798847364718785.tip-bot2@tip-bot2/
> > > Some CPUs can raise an #AC trap when a split lock is attempted.
> > 
> > Would split_lock_detect=fatal be enough to prevent the above attacks?
> 
> NO.
> 
> There are two types bus lock:
> 1. split lock - lock on cacheable memory while the memory across two cache
> lines.
> 2. non-wb lock - lock on non-writableback memory (you can find it on Intel
> ISE chapter 8, 
> https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html)
> 
> split lock detection can only prevent 1)
> 
> > Is split_lock_detect=fatal the only available way to prevent them?
> 
> as above, 2) non-wb lock can be prevented by "non-wb lock disable" feature

Bus lock VM exit applies to both 1 and 2, correct?

> 
> > 
> > > 
> > > 2. Bus lock Debug Exception:
> > > https://lore.kernel.org/lkml/20210322135325.682257-1-fenghua...@intel.com/
> > > The kernel can be notified by an #DB trap after a user instruction 
> > > acquires
> > > a bus lock and is executed.
> > 
> > I see a rate limit option mentioned at the above URL.  Would a
> > host kernel bus lock rate limit option make this QEMU patch
> > redundant?
> > 
> 
> No. Bus lock Debug exception cannot be used to detect the bus lock happens
> in guest (vmx non-root mode).
> 
> We have patch to virtualize this feature for guest
> https://lore.kernel.org/kvm/20210202090433.13441-1-chenyi.qi...@intel.com/
> 
> that guest will have its own setting of bus lock debug exception on or off.
> 
> What's more important is that, even we force set the
> MSR_DEBUGCTL.BUS_LOCK_DETECT for guest, guest still can escape from it.
> Because bus lock #DB is a trap which is delivered after the instruction
> completes. If the instruction acquires bus lock subsequently faults e.g.,
> #PF, then no bus lock #DB generated. But the bus lock does happen.
> 
> But with bus lock VM exit, even the instruction faults, it will cause a BUS
> LOCK VM exit.
> 
> 

-- 
Eduardo




Re: [PATCH v3 2/3] vhost-user-blk: perform immediate cleanup if disconnect on initialization

2021-04-21 Thread Kevin Wolf
Am 25.03.2021 um 16:12 hat Denis Plotnikov geschrieben:
> Commit 4bcad76f4c39 ("vhost-user-blk: delay vhost_user_blk_disconnect")
> introduced postponing vhost_dev cleanup aiming to eliminate qemu aborts
> because of connection problems with vhost-blk daemon.
> 
> However, it introdues a new problem. Now, any communication errors
> during execution of vhost_dev_init() called by vhost_user_blk_device_realize()
> lead to qemu abort on assert in vhost_dev_get_config().
> 
> This happens because vhost_user_blk_disconnect() is postponed but
> it should have dropped s->connected flag by the time
> vhost_user_blk_device_realize() performs a new connection opening.
> On the connection opening, vhost_dev initialization in
> vhost_user_blk_connect() relies on s->connection flag and
> if it's not dropped, it skips vhost_dev initialization and returns
> with success. Then, vhost_user_blk_device_realize()'s execution flow
> goes to vhost_dev_get_config() where it's aborted on the assert.
> 
> To fix the problem this patch adds immediate cleanup on device
> initialization(in vhost_user_blk_device_realize()) using different
> event handlers for initialization and operation introduced in the
> previous patch.
> On initialization (in vhost_user_blk_device_realize()) we fully
> control the initialization process. At that point, nobody can use the
> device since it isn't initialized and we don't need to postpone any
> cleanups, so we can do cleaup right away when there is a communication
> problem with the vhost-blk daemon.
> On operation we leave it as is, since the disconnect may happen when
> the device is in use, so the device users may want to use vhost_dev's data
> to do rollback before vhost_dev is re-initialized (e.g. in 
> vhost_dev_set_log()).
> 
> Signed-off-by: Denis Plotnikov 
> Reviewed-by: Raphael Norwitz 

I think there is something wrong with this patch.

I'm debugging an error case, specifically num-queues being larger in
QEMU that in the vhost-user-blk export. Before this patch, it has just
an unfriendly error message:

qemu-system-x86_64: -device 
vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4:
 Unexpected end-of-file before all data were read
qemu-system-x86_64: -device 
vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4:
 Failed to read msg header. Read 0 instead of 12. Original request 24.
qemu-system-x86_64: -device 
vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4:
 vhost-user-blk: get block config failed
qemu-system-x86_64: Failed to set msg fds.
qemu-system-x86_64: vhost VQ 0 ring restore failed: -1: Resource temporarily 
unavailable (11)

After the patch, it crashes:

#0  0x55d0a4bd in vhost_user_read_cb (source=0x568f4690, 
condition=(G_IO_IN | G_IO_HUP), opaque=0x7fffcbf0) at 
../hw/virtio/vhost-user.c:313
#1  0x55d950d3 in qio_channel_fd_source_dispatch 
(source=0x57c3f750, callback=0x55d0a478 , 
user_data=0x7fffcbf0) at ../io/channel-watch.c:84
#2  0x77b32a9f in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#3  0x77b84a98 in g_main_context_iterate.constprop () at 
/lib64/libglib-2.0.so.0
#4  0x77b32163 in g_main_loop_run () at /lib64/libglib-2.0.so.0
#5  0x55d0a724 in vhost_user_read (dev=0x57bc62f8, 
msg=0x7fffcc50) at ../hw/virtio/vhost-user.c:402
#6  0x55d0ee6b in vhost_user_get_config (dev=0x57bc62f8, 
config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost-user.c:2133
#7  0x55d56d46 in vhost_dev_get_config (hdev=0x57bc62f8, 
config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost.c:1566
#8  0x55cdd150 in vhost_user_blk_device_realize (dev=0x57bc60b0, 
errp=0x7fffcf90) at ../hw/block/vhost-user-blk.c:510
#9  0x55d08f6d in virtio_device_realize (dev=0x57bc60b0, 
errp=0x7fffcff0) at ../hw/virtio/virtio.c:3660

The problem is that vhost_user_read_cb() still accesses dev->opaque even
though the device has been cleaned up meanwhile when the connection was
closed (the vhost_user_blk_disconnect() added by this patch), so it's
NULL now. This problem was actually mentioned in the comment that is
removed by this patch.

I tried to fix this by making vhost_user_read() cope with the fact that
the device might have been cleaned up meanwhile, but then I'm running
into the next set of problems.

The first is that retrying is pointless, the error condition is in the
configuration, it will never change.

The other is that after many repetitions of the same error message, I
got a crash where the device is cleaned up a second time in
vhost_dev_init() and the virtqueues are already NULL.

So it seems to me that erroring out during the initialisation phase
makes a lot more sense than retrying.

Kevin

>  hw/block/vhost-user-blk.c | 48 +++
>  1 file changed, 24 insertions(+), 24 deletions(-)
> 
> d

[PATCH-for-6.0] net: tap: fix crash on hotplug

2021-04-21 Thread Cole Robinson
Attempting to hotplug a tap nic with libvirt will crash qemu:

$ sudo virsh attach-interface f32 network default
error: Failed to attach interface
error: Unable to read from monitor: Connection reset by peer

0x55875b7f3a99 in tap_send (opaque=0x55875e39eae0) at ../net/tap.c:206
206 if (!s->nc.peer->do_not_pad) {
gdb$ bt

s->nc.peer may not be set at this point. This seems to be an
expected case, as qemu_send_packet_* explicitly checks for NULL
s->nc.peer later.

Fix it by checking for s->nc.peer here too. Padding is applied if
s->nc.peer is not set.

https://bugzilla.redhat.com/show_bug.cgi?id=1949786
Fixes: 969e50b61a2

Signed-off-by: Cole Robinson 
---
* Or should we skip padding if nc.peer is unset? I didn't dig into it
* tap-win3.c and slirp.c may need a similar fix, but the slirp case
  didn't crash in a simple test.

 net/tap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tap.c b/net/tap.c
index dd42ac6134..937559dbb8 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -203,7 +203,7 @@ static void tap_send(void *opaque)
 size -= s->host_vnet_hdr_len;
 }
 
-if (!s->nc.peer->do_not_pad) {
+if (!s->nc.peer || !s->nc.peer->do_not_pad) {
 if (eth_pad_short_frame(min_pkt, &min_pktsz, buf, size)) {
 buf = min_pkt;
 size = min_pktsz;
-- 
2.31.1




Re: [PATCH v2] i386: Add ratelimit for bus locks acquired in guest

2021-04-21 Thread Xiaoyao Li

On 4/21/2021 11:18 PM, Eduardo Habkost wrote:

On Wed, Apr 21, 2021 at 10:50:10PM +0800, Xiaoyao Li wrote:

On 4/21/2021 10:12 PM, Eduardo Habkost wrote:

On Wed, Apr 21, 2021 at 02:26:42PM +0800, Chenyi Qiang wrote:

Hi, Eduardo, thanks for your comments!


On 4/21/2021 12:34 AM, Eduardo Habkost wrote:

Hello,

Thanks for the patch.  Comments below:

On Tue, Apr 20, 2021 at 05:37:36PM +0800, Chenyi Qiang wrote:

Virtual Machines can exploit bus locks to degrade the performance of
system. To address this kind of performance DOS attack, bus lock VM exit
is introduced in KVM and it will report the bus locks detected in guest,
which can help userspace to enforce throttling policies.



Is there anything today that would protect the system from
similar attacks from userspace with access to /dev/kvm?



I can't fully understand your meaning for "similar attack with access to
/dev/kvm". But there are some similar associated detection features on bare
metal.


What I mean is: you say guests can make a performance DoS attack
on the host, and your patch mitigates that.

What would be the available methods to prevent untrusted
userspace running on the host with access to /dev/kvm from making
a similar DoS attack on the host?


Thanks for all the clarifications below.  Considering them,
what's the answer to the question above?


One choice would be enabling BUS LOCK VM exit by default/unconditionally 
in KVM. (Our original attempt is to enable it unconditionally, but 
people want it to be opted in by user)


Or we can use split_lock_detect=fatal to prevent all the split lock in 
guest. For non-wb lock, all the memory should be cacheable as long as no 
device passthrough to guest. More aggressively, we can enable non-wb 
lock disable for guest in KVM.






1. Split lock 
detection:https://lore.kernel.org/lkml/158031147976.396.8941798847364718785.tip-bot2@tip-bot2/
Some CPUs can raise an #AC trap when a split lock is attempted.


Would split_lock_detect=fatal be enough to prevent the above attacks?


NO.

There are two types bus lock:
1. split lock - lock on cacheable memory while the memory across two cache
lines.
2. non-wb lock - lock on non-writableback memory (you can find it on Intel
ISE chapter 8, 
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html)

split lock detection can only prevent 1)


Is split_lock_detect=fatal the only available way to prevent them?


as above, 2) non-wb lock can be prevented by "non-wb lock disable" feature


Bus lock VM exit applies to both 1 and 2, correct?


yes!







2. Bus lock Debug Exception:
https://lore.kernel.org/lkml/20210322135325.682257-1-fenghua...@intel.com/
The kernel can be notified by an #DB trap after a user instruction acquires
a bus lock and is executed.


I see a rate limit option mentioned at the above URL.  Would a
host kernel bus lock rate limit option make this QEMU patch
redundant?



No. Bus lock Debug exception cannot be used to detect the bus lock happens
in guest (vmx non-root mode).

We have patch to virtualize this feature for guest
https://lore.kernel.org/kvm/20210202090433.13441-1-chenyi.qi...@intel.com/

that guest will have its own setting of bus lock debug exception on or off.

What's more important is that, even we force set the
MSR_DEBUGCTL.BUS_LOCK_DETECT for guest, guest still can escape from it.
Because bus lock #DB is a trap which is delivered after the instruction
completes. If the instruction acquires bus lock subsequently faults e.g.,
#PF, then no bus lock #DB generated. But the bus lock does happen.

But with bus lock VM exit, even the instruction faults, it will cause a BUS
LOCK VM exit.









Re: firmware selection for SEV-ES

2021-04-21 Thread Tom Lendacky
On 4/21/21 4:54 AM, Laszlo Ersek wrote:
> Hi Brijesh, Tom,

Hi Laszlo,

> 
> in QEMU's "docs/interop/firmware.json", the @FirmwareFeature enumeration
> has a constant called @amd-sev. We should introduce an @amd-sev-es
> constant as well, minimally for the following reason:
> 
> AMD document #56421 ("SEV-ES Guest-Hypervisor Communication Block
> Standardization") revision 1.40 says in "4.6 System Management Mode
> (SMM)" that "SMM will not be supported in this version of the
> specification". This is reflected in OVMF, so an OVMF binary that's
> supposed to run in a SEV-ES guest must be built without "-D
> SMM_REQUIRE". (As a consequence, such a binary should be built also
> without "-D SECURE_BOOT_ENABLE".)
> 
> At the level of "docs/interop/firmware.json", this means that management
> applications should be enabled to look for the @amd-sev-es feature (and
> it also means, for OS distributors, that any firmware descriptor
> exposing @amd-sev-es will currently have to lack all three of:
> @requires-smm, @secure-boot, @enrolled-keys).
> 
> I have three questions:
> 
> 
> (1) According to
> ,
>  SEV-ES is
> explicitly requested in the domain XML via setting bit#2 in the "policy"
> element.
> 
> Can this setting be used by libvirt to look for such a firmware
> descriptor that exposes @amd-sev-es?
> 
> 
> (2) "docs/interop/firmware.json" documents @amd-sev as follows:
> 
> # @amd-sev: The firmware supports running under AMD Secure Encrypted
> #   Virtualization, as specified in the AMD64 Architecture
> #   Programmer's Manual. QEMU command line options related to
> #   this feature are documented in
> #   "docs/amd-memory-encryption.txt".
> 
> Documenting the new @amd-sev-es enum constant with very slight
> customizations for the same text should be possible, I reckon. However,
> "docs/amd-memory-encryption.txt" (nor
> "docs/confidential-guest-support.txt") seem to mention SEV-ES.
> 
> Can you guys propose a patch for "docs/amd-memory-encryption.txt"?

Yes, I can submit a patch to update the documentation.

> 
> I guess that would be next to this snippet:
> 
>> # ${QEMU} \
>>sev-guest,id=sev0,policy=0x1...\
> 
> 
> (3) Is the "AMD64 Architecture Programmer's Manual" the specification
> that we should reference under @amd-sev-es as well (i.e., same as with
> @amd-sev), or is there a more specific document?

Yes, the same specification applies to SEV-ES.

Thanks,
Tom

> 
> Thanks,
> Laszlo
> 



Re: [PATCH v3 2/3] vhost-user-blk: perform immediate cleanup if disconnect on initialization

2021-04-21 Thread Denis Plotnikov



On 21.04.2021 18:24, Kevin Wolf wrote:

Am 25.03.2021 um 16:12 hat Denis Plotnikov geschrieben:

Commit 4bcad76f4c39 ("vhost-user-blk: delay vhost_user_blk_disconnect")
introduced postponing vhost_dev cleanup aiming to eliminate qemu aborts
because of connection problems with vhost-blk daemon.

However, it introdues a new problem. Now, any communication errors
during execution of vhost_dev_init() called by vhost_user_blk_device_realize()
lead to qemu abort on assert in vhost_dev_get_config().

This happens because vhost_user_blk_disconnect() is postponed but
it should have dropped s->connected flag by the time
vhost_user_blk_device_realize() performs a new connection opening.
On the connection opening, vhost_dev initialization in
vhost_user_blk_connect() relies on s->connection flag and
if it's not dropped, it skips vhost_dev initialization and returns
with success. Then, vhost_user_blk_device_realize()'s execution flow
goes to vhost_dev_get_config() where it's aborted on the assert.

To fix the problem this patch adds immediate cleanup on device
initialization(in vhost_user_blk_device_realize()) using different
event handlers for initialization and operation introduced in the
previous patch.
On initialization (in vhost_user_blk_device_realize()) we fully
control the initialization process. At that point, nobody can use the
device since it isn't initialized and we don't need to postpone any
cleanups, so we can do cleaup right away when there is a communication
problem with the vhost-blk daemon.
On operation we leave it as is, since the disconnect may happen when
the device is in use, so the device users may want to use vhost_dev's data
to do rollback before vhost_dev is re-initialized (e.g. in vhost_dev_set_log()).

Signed-off-by: Denis Plotnikov 
Reviewed-by: Raphael Norwitz 

I think there is something wrong with this patch.

I'm debugging an error case, specifically num-queues being larger in
QEMU that in the vhost-user-blk export. Before this patch, it has just
an unfriendly error message:

qemu-system-x86_64: -device 
vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4:
 Unexpected end-of-file before all data were read
qemu-system-x86_64: -device 
vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4:
 Failed to read msg header. Read 0 instead of 12. Original request 24.
qemu-system-x86_64: -device 
vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4:
 vhost-user-blk: get block config failed
qemu-system-x86_64: Failed to set msg fds.
qemu-system-x86_64: vhost VQ 0 ring restore failed: -1: Resource temporarily 
unavailable (11)

After the patch, it crashes:

#0  0x55d0a4bd in vhost_user_read_cb (source=0x568f4690, 
condition=(G_IO_IN | G_IO_HUP), opaque=0x7fffcbf0) at 
../hw/virtio/vhost-user.c:313
#1  0x55d950d3 in qio_channel_fd_source_dispatch (source=0x57c3f750, 
callback=0x55d0a478 , user_data=0x7fffcbf0) at 
../io/channel-watch.c:84
#2  0x77b32a9f in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#3  0x77b84a98 in g_main_context_iterate.constprop () at 
/lib64/libglib-2.0.so.0
#4  0x77b32163 in g_main_loop_run () at /lib64/libglib-2.0.so.0
#5  0x55d0a724 in vhost_user_read (dev=0x57bc62f8, 
msg=0x7fffcc50) at ../hw/virtio/vhost-user.c:402
#6  0x55d0ee6b in vhost_user_get_config (dev=0x57bc62f8, 
config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost-user.c:2133
#7  0x55d56d46 in vhost_dev_get_config (hdev=0x57bc62f8, 
config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost.c:1566
#8  0x55cdd150 in vhost_user_blk_device_realize (dev=0x57bc60b0, 
errp=0x7fffcf90) at ../hw/block/vhost-user-blk.c:510
#9  0x55d08f6d in virtio_device_realize (dev=0x57bc60b0, 
errp=0x7fffcff0) at ../hw/virtio/virtio.c:3660

The problem is that vhost_user_read_cb() still accesses dev->opaque even
though the device has been cleaned up meanwhile when the connection was
closed (the vhost_user_blk_disconnect() added by this patch), so it's
NULL now. This problem was actually mentioned in the comment that is
removed by this patch.

I tried to fix this by making vhost_user_read() cope with the fact that
the device might have been cleaned up meanwhile, but then I'm running
into the next set of problems.

The first is that retrying is pointless, the error condition is in the
configuration, it will never change.

The other is that after many repetitions of the same error message, I
got a crash where the device is cleaned up a second time in
vhost_dev_init() and the virtqueues are already NULL.

So it seems to me that erroring out during the initialisation phase
makes a lot more sense than retrying.

Kevin


But without the patch there will be another problem which the patch 
actually addresses.


It seems to me that there is a case when the retrying is useless

Re: [RFC PATCH] tests/tcg: add a multiarch signals test to stress test signal delivery

2021-04-21 Thread Alex Bennée


Alex Bennée  writes:

> This adds a simple signal test that combines the POSIX timer_create
> with signal delivery across multiple threads.
>
> [AJB: So I wrote this in an attempt to flush out issues with the
> s390x-linux-user handling. However I suspect I've done something wrong
> or opened a can of signal handling worms.
>
> Nominally this runs fine on real hardware but I variously get failures
> when running it under translation and while debugging QEMU running the
> test. I've also exposed a shortcomming with the gdb stub when dealing
> with guest TLS data so yay ;-). So I post this as an RFC in case
> anyone else can offer insight or can verify they are seeing the same
> strange behaviour?]

To further document my confusion:

  gdb --args $QEMU ./tests/tcg/$ARCH/signals

will SEGV in generated code for every target I've run. This seems to be
some sort of change of behaviour by running inside a debug environment.

Architectures that fail running normally:

./qemu-alpha tests/tcg/alpha-linux-user/signals
fish: “./qemu-alpha tests/tcg/alpha-li…” terminated by signal SIGILL (Illegal 
instruction)

./qemu-sparc64 tests/tcg/sparc64-linux-user/signals
thread0: started
thread1: started
thread2: started
thread3: started
thread4: started
thread5: started
thread6: started
thread7: started
thread8: started
thread9: started
thread0: saw 0 alarms from 0
...
(and hangs)

./qemu-s390x ./tests/tcg/s390x-linux-user/signals
fish: “./qemu-s390x ./tests/tcg/s390x-…” terminated by signal SIGSEGV (Address 
boundary error)

./qemu-sh4 ./tests/tcg/sh4-linux-user/signals
thread0: saw 87 alarms from 238
thread1: started
thread1: saw 0 alarms from 331
thread2: started
thread2: saw 0 alarms from 17088
thread3: started
thread3: saw 0 alarms from 17093
thread4: started
thread4: saw 0 alarms from 17098
thread5: started
thread5: saw 2 alarms from 17106
thread6: started
thread6: saw 0 alarms from 17108
thread7: started
thread7: saw 1 alarms from 17114
thread8: started
thread8: saw 0 alarms from 17118
thread9: started
thread9: saw 0 alarms from 17122
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
fish: “./qemu-sh4 ./tests/tcg/sh4-linu…” terminated by signal SIGSEGV (Address 
boundary error)

And another completely random data point while most arches see most
signals delivered to the main thread qemu-i386 actually sees quite a few
delivered to the other threads which is weird because I though the
signal delivery would be more of a host feature than anything else.

./qemu-i386 ./tests/tcg/i386-linux-user/signals
thread0: started
thread0: saw 134 alarms from 177
thread1: started
thread1: saw 0 alarms from 254
thread2: started
thread2: saw 1 alarms from 300
thread3: started
thread3: saw 1 alarms from 305
thread4: started
thread5: started
thread6: started
thread7: started
thread8: started
thread9: started
thread4: saw 80 alarms from 423
thread5: saw 7 alarms from 525
thread6: saw 4 alarms from 631
thread7: saw 6 alarms from 758
thread8: saw 4 alarms from 822
thread9: saw 635 alarms from 978


-- 
Alex Bennée



[PATCH] pc-bios/s390-ccw/bootmap: Silence compiler warning from Clang

2021-04-21 Thread Thomas Huth
When compiling the s390-ccw bios with Clang, the compiler complains:

 pc-bios/s390-ccw/bootmap.c:302:9: warning: logical not is only applied
  to the left hand side of this comparison [-Wlogical-not-parentheses]
if (!mbr->dev_type == DEV_TYPE_ECKD) {
^  ~~

The code works (more or less by accident), since dev_type can only be
0 or 1, but it's better of course to use the intended != operator here
instead.

Fixes: 5dc739f343 ("Allow booting in case the first virtio-blk disk is bad")
Signed-off-by: Thomas Huth 
---
 pc-bios/s390-ccw/bootmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pc-bios/s390-ccw/bootmap.c b/pc-bios/s390-ccw/bootmap.c
index 44df7d16af..e4b2e5a1b0 100644
--- a/pc-bios/s390-ccw/bootmap.c
+++ b/pc-bios/s390-ccw/bootmap.c
@@ -299,7 +299,7 @@ static void ipl_eckd_cdl(void)
 sclp_print("Bad block size in zIPL section of IPL2 record.\n");
 return;
 }
-if (!mbr->dev_type == DEV_TYPE_ECKD) {
+if (mbr->dev_type != DEV_TYPE_ECKD) {
 sclp_print("Non-ECKD device type in zIPL section of IPL2 record.\n");
 return;
 }
-- 
2.27.0




Re: [PATCH-for-6.0] net: tap: fix crash on hotplug

2021-04-21 Thread Philippe Mathieu-Daudé
Cc'ing Bin.

On 4/21/21 5:22 PM, Cole Robinson wrote:
> Attempting to hotplug a tap nic with libvirt will crash qemu:
> 
> $ sudo virsh attach-interface f32 network default
> error: Failed to attach interface
> error: Unable to read from monitor: Connection reset by peer
> 
> 0x55875b7f3a99 in tap_send (opaque=0x55875e39eae0) at ../net/tap.c:206
> 206   if (!s->nc.peer->do_not_pad) {
> gdb$ bt
> 
> s->nc.peer may not be set at this point. This seems to be an
> expected case, as qemu_send_packet_* explicitly checks for NULL
> s->nc.peer later.
> 
> Fix it by checking for s->nc.peer here too. Padding is applied if
> s->nc.peer is not set.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1949786
> Fixes: 969e50b61a2
> 
> Signed-off-by: Cole Robinson 
> ---
> * Or should we skip padding if nc.peer is unset? I didn't dig into it
> * tap-win3.c and slirp.c may need a similar fix, but the slirp case
>   didn't crash in a simple test.
> 
>  net/tap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/tap.c b/net/tap.c
> index dd42ac6134..937559dbb8 100644
> --- a/net/tap.c
> +++ b/net/tap.c
> @@ -203,7 +203,7 @@ static void tap_send(void *opaque)
>  size -= s->host_vnet_hdr_len;
>  }
>  
> -if (!s->nc.peer->do_not_pad) {
> +if (!s->nc.peer || !s->nc.peer->do_not_pad) {
>  if (eth_pad_short_frame(min_pkt, &min_pktsz, buf, size)) {
>  buf = min_pkt;
>  size = min_pktsz;
> 




Re: [PATCH] pc-bios/s390-ccw/bootmap: Silence compiler warning from Clang

2021-04-21 Thread Christian Borntraeger




On 21.04.21 18:33, Thomas Huth wrote:

When compiling the s390-ccw bios with Clang, the compiler complains:

  pc-bios/s390-ccw/bootmap.c:302:9: warning: logical not is only applied
   to the left hand side of this comparison [-Wlogical-not-parentheses]
 if (!mbr->dev_type == DEV_TYPE_ECKD) {
 ^  ~~

The code works (more or less by accident), since dev_type can only be
0 or 1, but it's better of course to use the intended != operator here
instead.

Fixes: 5dc739f343 ("Allow booting in case the first virtio-blk disk is bad")
Signed-off-by: Thomas Huth 


Reviewed-by: Christian Borntraeger 


---
  pc-bios/s390-ccw/bootmap.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pc-bios/s390-ccw/bootmap.c b/pc-bios/s390-ccw/bootmap.c
index 44df7d16af..e4b2e5a1b0 100644
--- a/pc-bios/s390-ccw/bootmap.c
+++ b/pc-bios/s390-ccw/bootmap.c
@@ -299,7 +299,7 @@ static void ipl_eckd_cdl(void)
  sclp_print("Bad block size in zIPL section of IPL2 record.\n");
  return;
  }
-if (!mbr->dev_type == DEV_TYPE_ECKD) {
+if (mbr->dev_type != DEV_TYPE_ECKD) {
  sclp_print("Non-ECKD device type in zIPL section of IPL2 record.\n");
  return;
  }





Re: [PATCH] pc-bios/s390-ccw/bootmap: Silence compiler warning from Clang

2021-04-21 Thread Philippe Mathieu-Daudé
On 4/21/21 6:33 PM, Thomas Huth wrote:
> When compiling the s390-ccw bios with Clang, the compiler complains:
> 
>  pc-bios/s390-ccw/bootmap.c:302:9: warning: logical not is only applied
>   to the left hand side of this comparison [-Wlogical-not-parentheses]
> if (!mbr->dev_type == DEV_TYPE_ECKD) {
> ^  ~~
> 
> The code works (more or less by accident), since dev_type can only be
> 0 or 1, but it's better of course to use the intended != operator here
> instead.
> 
> Fixes: 5dc739f343 ("Allow booting in case the first virtio-blk disk is bad")
> Signed-off-by: Thomas Huth 
> ---
>  pc-bios/s390-ccw/bootmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Philippe Mathieu-Daudé 




[PATCH] hw/arm/smmuv3: Another range invalidation fix

2021-04-21 Thread Eric Auger
6d9cd115b9 ("hw/arm/smmuv3: Enforce invalidation on a power of two range")
failed to completely fix misalignment issues with range
invalidation. For instance invalidations patterns like "invalidate 32
4kB pages starting from 0xff395000 are not correctly handled" due
to the fact the previous fix only made sure the number of invalidated
pages were a power of 2 but did not properly handle the start
address was not aligned with the range. This can be noticed when
boothing a fedora 33 with protected virtio-blk-pci.

Signed-off-by: Eric Auger 
Fixes: 6d9cd115b9 ("hw/arm/smmuv3: Enforce invalidation on a power of two 
range")

---

This bug was found with SMMU RIL avocado-qemu acceptance tests
---
 hw/arm/smmuv3.c | 49 +
 1 file changed, 25 insertions(+), 24 deletions(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 8705612535..16f285a566 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -856,43 +856,44 @@ static void smmuv3_inv_notifiers_iova(SMMUState *s, int 
asid, dma_addr_t iova,
 
 static void smmuv3_s1_range_inval(SMMUState *s, Cmd *cmd)
 {
-uint8_t scale = 0, num = 0, ttl = 0;
-dma_addr_t addr = CMD_ADDR(cmd);
+dma_addr_t end, addr = CMD_ADDR(cmd);
 uint8_t type = CMD_TYPE(cmd);
 uint16_t vmid = CMD_VMID(cmd);
+uint8_t scale = CMD_SCALE(cmd);
+uint8_t num = CMD_NUM(cmd);
+uint8_t ttl = CMD_TTL(cmd);
 bool leaf = CMD_LEAF(cmd);
 uint8_t tg = CMD_TG(cmd);
-uint64_t first_page = 0, last_page;
-uint64_t num_pages = 1;
+uint64_t num_pages;
+uint8_t granule;
 int asid = -1;
 
-if (tg) {
-scale = CMD_SCALE(cmd);
-num = CMD_NUM(cmd);
-ttl = CMD_TTL(cmd);
-num_pages = (num + 1) * BIT_ULL(scale);
-}
-
 if (type == SMMU_CMD_TLBI_NH_VA) {
 asid = CMD_ASID(cmd);
 }
 
+if (!tg) {
+trace_smmuv3_s1_range_inval(vmid, asid, addr, tg, 1, ttl, leaf);
+smmuv3_inv_notifiers_iova(s, asid, addr, tg, 1);
+smmu_iotlb_inv_iova(s, asid, addr, tg, 1, ttl);
+}
+
+/* RIL in use */
+
+num_pages = (num + 1) * BIT_ULL(scale);
+granule = tg * 2 + 10;
+
 /* Split invalidations into ^2 range invalidations */
-last_page = num_pages - 1;
-while (num_pages) {
-uint8_t granule = tg * 2 + 10;
-uint64_t mask, count;
+end = addr + (num_pages << granule) - 1;
 
-mask = dma_aligned_pow2_mask(first_page, last_page, 64 - granule);
-count = mask + 1;
+while (addr != end + 1) {
+uint64_t mask = dma_aligned_pow2_mask(addr, end, 64);
 
-trace_smmuv3_s1_range_inval(vmid, asid, addr, tg, count, ttl, leaf);
-smmuv3_inv_notifiers_iova(s, asid, addr, tg, count);
-smmu_iotlb_inv_iova(s, asid, addr, tg, count, ttl);
-
-num_pages -= count;
-first_page += count;
-addr += count * BIT_ULL(granule);
+num_pages = (mask + 1) >> granule;
+trace_smmuv3_s1_range_inval(vmid, asid, addr, tg, num_pages, ttl, 
leaf);
+smmuv3_inv_notifiers_iova(s, asid, addr, tg, num_pages);
+smmu_iotlb_inv_iova(s, asid, addr, tg, num_pages, ttl);
+addr += mask + 1;
 }
 }
 
-- 
2.26.3




Re: [Virtio-fs] [PATCH v2 01/25] DAX: vhost-user: Rework slave return values

2021-04-21 Thread Dr. David Alan Gilbert
* Greg Kurz (gr...@kaod.org) wrote:
> On Wed, 14 Apr 2021 16:51:13 +0100
> "Dr. David Alan Gilbert (git)"  wrote:
> 
> > From: "Dr. David Alan Gilbert" 
> > 
> > All the current slave handlers on the qemu side generate an 'int'
> > return value that's squashed down to a bool (!!ret) and stuffed into
> > a uint64_t (field of a union) to be returned.
> > 
> > Move the uint64_t type back up through the individual handlers so
> > that we can make one actually return a full uint64_t.
> > 
> > Note that the definition in the interop spec says most of these
> > cases are defined as returning 0 on success and non-0 for failure,
> > so it's OK to change from a bool to another non-0.
> > 
> > Vivek:
> > This is needed because upcoming patches in series will add new functions
> > which want to return full error code. Existing functions continue to
> > return true/false so, it should not lead to change of behavior for
> > existing users.
> > 
> > Signed-off-by: Dr. David Alan Gilbert 
> > ---
> 
> LGTM
> 
> Just an indentation nit...
> 
> >  hw/virtio/vhost-backend.c |  6 +++---
> >  hw/virtio/vhost-user.c| 31 ---
> >  include/hw/virtio/vhost-backend.h |  2 +-
> >  3 files changed, 20 insertions(+), 19 deletions(-)
> > 
> > diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
> > index 31b33bde37..1686c94767 100644
> > --- a/hw/virtio/vhost-backend.c
> > +++ b/hw/virtio/vhost-backend.c
> > @@ -401,8 +401,8 @@ int vhost_backend_invalidate_device_iotlb(struct 
> > vhost_dev *dev,
> >  return -ENODEV;
> >  }
> >  
> > -int vhost_backend_handle_iotlb_msg(struct vhost_dev *dev,
> > -  struct vhost_iotlb_msg *imsg)
> > +uint64_t vhost_backend_handle_iotlb_msg(struct vhost_dev *dev,
> > +struct vhost_iotlb_msg *imsg)
> >  {
> >  int ret = 0;
> >  
> > @@ -429,5 +429,5 @@ int vhost_backend_handle_iotlb_msg(struct vhost_dev 
> > *dev,
> >  break;
> >  }
> >  
> > -return ret;
> > +return !!ret;
> >  }
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index ded0c10453..3e4a25e108 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -1405,24 +1405,25 @@ static int vhost_user_reset_device(struct vhost_dev 
> > *dev)
> >  return 0;
> >  }
> >  
> > -static int vhost_user_slave_handle_config_change(struct vhost_dev *dev)
> > +static uint64_t vhost_user_slave_handle_config_change(struct vhost_dev 
> > *dev)
> >  {
> >  int ret = -1;
> >  
> >  if (!dev->config_ops) {
> > -return -1;
> > +return true;
> >  }
> >  
> >  if (dev->config_ops->vhost_dev_config_notifier) {
> >  ret = dev->config_ops->vhost_dev_config_notifier(dev);
> >  }
> >  
> > -return ret;
> > +return !!ret;
> >  }
> >  
> > -static int vhost_user_slave_handle_vring_host_notifier(struct vhost_dev 
> > *dev,
> > -   VhostUserVringArea 
> > *area,
> > -   int fd)
> > +static uint64_t vhost_user_slave_handle_vring_host_notifier(
> > +struct vhost_dev *dev,
> > +   VhostUserVringArea *area,
> > +   int fd)
> 
> ... here.

I've reworked that to:
+static uint64_t vhost_user_slave_handle_vring_host_notifier(
+struct vhost_dev *dev,
+VhostUserVringArea *area,
+int fd)

the function name is getting too hideously long to actually put the
arguments in line with the bracket.

> Anyway,
> 
> Reviewed-by: Greg Kurz 

Thanks.

Dave

> 
> >  {
> >  int queue_idx = area->u64 & VHOST_USER_VRING_IDX_MASK;
> >  size_t page_size = qemu_real_host_page_size;
> > @@ -1436,7 +1437,7 @@ static int 
> > vhost_user_slave_handle_vring_host_notifier(struct vhost_dev *dev,
> >  if (!virtio_has_feature(dev->protocol_features,
> >  VHOST_USER_PROTOCOL_F_HOST_NOTIFIER) ||
> >  vdev == NULL || queue_idx >= virtio_get_num_queues(vdev)) {
> > -return -1;
> > +return true;
> >  }
> >  
> >  n = &user->notifier[queue_idx];
> > @@ -1449,18 +1450,18 @@ static int 
> > vhost_user_slave_handle_vring_host_notifier(struct vhost_dev *dev,
> >  }
> >  
> >  if (area->u64 & VHOST_USER_VRING_NOFD_MASK) {
> > -return 0;
> > +return false;
> >  }
> >  
> >  /* Sanity check. */
> >  if (area->size != page_size) {
> > -return -1;
> > +return true;
> >  }
> >  
> >  addr = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
> >  fd, area->offset);
> >  if (addr == MAP_FAILED) {
> > -return -1;
> > +return true;
> >  }
> >  
> >  name = g_strdup_printf("vhost-user/host-notifier@%p mmaps[%d]",
> > @@ -1471,13 +1472,13 @@ static int 
> > vhost_user_slave_handle_vring_host_notif

[PATCH v2 0/2] avocado-qemu: New SMMUv3 tests

2021-04-21 Thread Eric Auger
This series adds SMMU functional tests using Fedora cloud-init
images. Compared to v1, guests with and without RIL
(range invalidation support) are tested (resp fedora 33 and 31).
For each, we test the protection of virtio-net-pci and
virtio-block-pci devices. Also strict=no and passthrough
modes are tested. So there is a total of 6 tests.

Note this allowed to identify yet another RIL issue:
[PATCH] hw/arm/smmuv3: Another range invalidation fix

This small series applies on top of Cleber's series:
- [PATCH 0/3] Acceptance Tests: support choosing specific
  distro and version
- [PATCH v3 00/11] Acceptance Test: introduce base class for
  Linux based tests.

Special thanks to Cleber for his support and for the series
this patch set depends on.

Best Regards

Eric

The series, its dependencies and the SMMU fix can be found at
https://github.com/eauger/qemu/tree/smmu_acceptance_v2


Eric Auger (2):
  Acceptance Tests: Add default kernel params and pxeboot url to the
KNOWN_DISTROS collection
  avocado_qemu: Add SMMUv3 tests

 tests/acceptance/avocado_qemu/__init__.py |  46 +++-
 tests/acceptance/smmu.py  | 133 ++
 2 files changed, 175 insertions(+), 4 deletions(-)
 create mode 100644 tests/acceptance/smmu.py

-- 
2.26.3




[PATCH v2 2/2] avocado_qemu: Add SMMUv3 tests

2021-04-21 Thread Eric Auger
Add new tests checking the good behavior of the SMMUv3 protecting
2 virtio pci devices (block and net). We check the guest boots and
we are able to install a package. Different guest configs are tested:
standard, passthrough an strict=0. This is tested with both fedora 31 and
33. The former uses a 5.3 kernel without range invalidation whereas the
latter uses a 5.8 kernel that features range invalidation.

Signed-off-by: Eric Auger 

---

v1 -> v2:
- removed ssh import
- combined add_command_args() and common_vm_setup()
- moved tags in class' docstring and added tags=arch:aarch64
- use self.get_default_kernel_params()
- added RIL tests with fed33 + introduce new tags
---
 tests/acceptance/smmu.py | 133 +++
 1 file changed, 133 insertions(+)
 create mode 100644 tests/acceptance/smmu.py

diff --git a/tests/acceptance/smmu.py b/tests/acceptance/smmu.py
new file mode 100644
index 00..bcb5416a56
--- /dev/null
+++ b/tests/acceptance/smmu.py
@@ -0,0 +1,133 @@
+# SMMUv3 Functional tests
+#
+# Copyright (c) 2021 Red Hat, Inc.
+#
+# Author:
+#  Eric Auger 
+#
+# This work is licensed under the terms of the GNU GPL, version 2 or
+# later.  See the COPYING file in the top-level directory.
+
+import os
+
+from avocado_qemu import LinuxTest, BUILD_DIR
+
+class SMMU(LinuxTest):
+"""
+:avocado: tags=accel:kvm
+:avocado: tags=cpu:host
+:avocado: tags=arch:aarch64
+:avocado: tags=smmu
+"""
+
+IOMMU_ADDON = ',iommu_platform=on,disable-modern=off,disable-legacy=on'
+kernel_path = None
+initrd_path = None
+kernel_params = None
+
+def set_up_boot(self):
+path = self.download_boot()
+self.vm.add_args('-device', 'virtio-blk-pci,bus=pcie.0,scsi=off,' +
+ 'drive=drv0,id=virtio-disk0,bootindex=1,'
+ 'werror=stop,rerror=stop' + self.IOMMU_ADDON)
+self.vm.add_args('-drive',
+ 'file=%s,if=none,cache=writethrough,id=drv0' % path)
+
+def setUp(self):
+super(SMMU, self).setUp(None, 'virtio-net-pci' + self.IOMMU_ADDON)
+
+def common_vm_setup(self, custom_kernel=None):
+self.require_accelerator("kvm")
+self.vm.add_args("-machine", "virt")
+self.vm.add_args("-accel", "kvm")
+self.vm.add_args("-cpu", "host")
+self.vm.add_args("-smp", "8")
+self.vm.add_args("-m", "4096")
+self.vm.add_args("-machine", "iommu=smmuv3")
+self.vm.add_args("-d", "guest_errors")
+self.vm.add_args('-bios', os.path.join(BUILD_DIR, 'pc-bios',
+ 'edk2-aarch64-code.fd'))
+self.vm.add_args('-device', 'virtio-rng-pci,rng=rng0')
+self.vm.add_args('-object',
+ 'rng-random,id=rng0,filename=/dev/urandom')
+
+if custom_kernel is None:
+return
+
+kernel_url = self.get_pxeboot_url() + 'vmlinuz'
+initrd_url = self.get_pxeboot_url() + 'initrd.img'
+self.kernel_path = self.fetch_asset(kernel_url)
+self.initrd_path = self.fetch_asset(initrd_url)
+
+def run_and_check(self):
+if self.kernel_path:
+self.vm.add_args('-kernel', self.kernel_path,
+ '-append', self.kernel_params,
+ '-initrd', self.initrd_path)
+self.launch_and_wait()
+self.ssh_command('cat /proc/cmdline')
+self.ssh_command('dnf -y install numactl-devel')
+
+
+# 5.3 kernel without RIL #
+
+def test_smmu_noril(self):
+"""
+:avocado: tags=smmu_noril
+:avocado: tags=smmu_noril_tests
+:avocado: tags=distro_version:31
+"""
+self.common_vm_setup()
+self.run_and_check()
+
+def test_smmu_noril_passthrough(self):
+"""
+:avocado: tags=smmu_noril_passthrough
+:avocado: tags=smmu_noril_tests
+:avocado: tags=distro_version:31
+"""
+self.common_vm_setup(True)
+self.kernel_params = self.get_default_kernel_params() + ' 
iommu.passthrough=on'
+self.run_and_check()
+
+def test_smmu_noril_nostrict(self):
+"""
+:avocado: tags=smmu_noril_nostrict
+:avocado: tags=smmu_noril_tests
+:avocado: tags=distro_version:31
+"""
+self.common_vm_setup(True)
+self.kernel_params = self.get_default_kernel_params() + ' 
iommu.strict=0'
+self.run_and_check()
+
+# 5.8 kernel featuring range invalidation
+# >= v5.7 kernel
+
+def test_smmu_ril(self):
+"""
+:avocado: tags=smmu_ril
+:avocado: tags=smmu_ril_tests
+:avocado: tags=distro_version:33
+"""
+self.common_vm_setup()
+self.run_and_check()
+
+def test_smmu_ril_passthrough(self):
+"""
+:avocado: tags=smmu_ril_passthrough
+:avocado: tags=smmu_ril_tests
+:avocado: tags=distro_version:33
+"""
+self.common_vm_setup(True)
+

  1   2   3   4   >