Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

2022-12-07 Thread Yuan Yao
On Fri, Dec 02, 2022 at 02:13:44PM +0800, Chao Peng wrote:
> Unmap the existing guest mappings when memory attribute is changed
> between shared and private. This is needed because shared pages and
> private pages are from different backends, unmapping existing ones
> gives a chance for page fault handler to re-populate the mappings
> according to the new attribute.
>
> Only architecture has private memory support needs this and the
> supported architecture is expected to rewrite the weak
> kvm_arch_has_private_mem().
>
> Also, during memory attribute changing and the unmapping time frame,
> page fault handler may happen in the same memory range and can cause
> incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> page fault handler retry during this time frame.
>
> Signed-off-by: Chao Peng 
> ---
>  include/linux/kvm_host.h |   7 +-
>  virt/kvm/kvm_main.c  | 168 ++-
>  2 files changed, 116 insertions(+), 59 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3d69484d2704..3331c0c92838 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t 
> cr2_or_gpa,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
>  struct kvm_gfn_range {
>   struct kvm_memory_slot *slot;
>   gfn_t start;
> @@ -264,6 +263,8 @@ struct kvm_gfn_range {
>   bool may_block;
>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> +
> +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> @@ -785,11 +786,12 @@ struct kvm {
>
>  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>   struct mmu_notifier mmu_notifier;
> +#endif
>   unsigned long mmu_invalidate_seq;
>   long mmu_invalidate_in_progress;
>   gfn_t mmu_invalidate_range_start;
>   gfn_t mmu_invalidate_range_end;
> -#endif
> +
>   struct list_head devices;
>   u64 manual_dirty_log_protect;
>   struct dentry *debugfs_dentry;
> @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu 
> *vcpu);
>  int kvm_arch_post_init_vm(struct kvm *kvm);
>  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
>
>  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
>  /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ad55dfbc75d7..4e1e1e113bf0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
>  }
>  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
>
> +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> +{
> + /*
> +  * The count increase must become visible at unlock time as no
> +  * spte can be established without taking the mmu_lock and
> +  * count is also read inside the mmu_lock critical section.
> +  */
> + kvm->mmu_invalidate_in_progress++;
> +
> + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> + kvm->mmu_invalidate_range_start = INVALID_GPA;
> + kvm->mmu_invalidate_range_end = INVALID_GPA;
> + }
> +}
> +
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> +
> + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> + kvm->mmu_invalidate_range_start = start;
> + kvm->mmu_invalidate_range_end = end;
> + } else {
> + /*
> +  * Fully tracking multiple concurrent ranges has diminishing
> +  * returns. Keep things simple and just find the minimal range
> +  * which includes the current and new ranges. As there won't be
> +  * enough information to subtract a range after its invalidate
> +  * completes, any ranges invalidated concurrently will
> +  * accumulate and persist until all outstanding invalidates
> +  * complete.
> +  */
> + kvm->mmu_invalidate_range_start =
> + min(kvm->mmu_invalidate_range_start, start);
> + kvm->mmu_invalidate_range_end =
> + max(kvm->mmu_invalidate_range_end, end);
> + }
> +}
> +
> +void kvm_mmu_invalidate_end(struct kvm *kvm)
> +{
> + /*
> +  * This sequence increase will notify the kvm page fault that
> +  * the page that is going to be mapped in the spte could have
> +  * been freed.
> +  */
> + kvm->mmu_invalidate_seq++;
> + smp_wmb();
> + /*
> +  * The above sequence increase must be visible before the
> +  * below count decrease, which is ensured by the smp_wmb above
> +

[PATCH] configure: Fix check-tcg not executing any tests

2022-12-07 Thread Mukilan Thiyagarajan
After configuring with --target-list=hexagon-linux-user
running `make check-tcg` just prints the following:

```
make: Nothing to be done for 'check-tcg'
```

In the probe_target_compiler function, the 'break'
command is used incorrectly. There are no lexically
enclosing loops associated with that break command which
is an unspecfied behaviour in the POSIX standard.

The dash shell implementation aborts the currently executing
loop, in this case, causing the rest of the logic for the loop
in line 2490 to be skipped, which means no Makefiles are
generated for the tcg target tests.

Fixes: c3b570b5a9a24d25 (configure: don't enable
cross compilers unless in target_list)

Signed-off-by: Mukilan Thiyagarajan 
---
 configure | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/configure b/configure
index 26c7bc5154..7a804fb657 100755
--- a/configure
+++ b/configure
@@ -1881,9 +1881,7 @@ probe_target_compiler() {
   # We shall skip configuring the target compiler if the user didn't
   # bother enabling an appropriate guest. This avoids building
   # extraneous firmware images and tests.
-  if test "${target_list#*$1}" != "$1"; then
-  break;
-  else
+  if test "${target_list#*$1}" = "$1"; then
   return 1
   fi
 
-- 
2.17.1




Re: [PATCH 09/15] hw/riscv: microchip_pfsoc: Fix the number of interrupt sources of PLIC

2022-12-07 Thread Conor Dooley
On Thu, Dec 01, 2022 at 10:08:05PM +0800, Bin Meng wrote:
> Per chapter 6.5.2 in [1], the number of interupt sources including
> interrupt source 0 should be 187.
> 
> [1] PolarFire SoC MSS TRM:
> https://ww1.microchip.com/downloads/aemDocuments/documents/FPGA/ProductDocuments/ReferenceManuals/PolarFire_SoC_FPGA_MSS_Technical_Reference_Manual_VC.pdf

Reviewed-by: Conor Dooley 
Thanks!

> 
> Fixes: 56f6e31e7b7e ("hw/riscv: Initial support for Microchip PolarFire SoC 
> Icicle Kit board")
> Signed-off-by: Bin Meng 
> ---
> 
>  include/hw/riscv/microchip_pfsoc.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/hw/riscv/microchip_pfsoc.h 
> b/include/hw/riscv/microchip_pfsoc.h
> index a757b240e0..9720bac2d5 100644
> --- a/include/hw/riscv/microchip_pfsoc.h
> +++ b/include/hw/riscv/microchip_pfsoc.h
> @@ -150,7 +150,7 @@ enum {
>  #define MICROCHIP_PFSOC_MANAGEMENT_CPU_COUNT1
>  #define MICROCHIP_PFSOC_COMPUTE_CPU_COUNT   4
>  
> -#define MICROCHIP_PFSOC_PLIC_NUM_SOURCES185
> +#define MICROCHIP_PFSOC_PLIC_NUM_SOURCES187
>  #define MICROCHIP_PFSOC_PLIC_NUM_PRIORITIES 7
>  #define MICROCHIP_PFSOC_PLIC_PRIORITY_BASE  0x04
>  #define MICROCHIP_PFSOC_PLIC_PENDING_BASE   0x1000
> -- 
> 2.34.1
> 
> 
> 


signature.asc
Description: PGP signature


Re: [PATCH] blockdev: add 'media=cdrom' argument to support usb cdrom emulated as cdrom

2022-12-07 Thread Paolo Bonzini
It should be like this:

-device usb-bot,id=bot0
-device scsi-{cd,hd},bus=bot0.0,drive=drive0

Libvirt has the code to generate the options for SCSI controllers, but
usb-bot only allows one disk attached to it so it's easier to make it a
 element.

Paolo

Il sab 3 dic 2022, 13:52 Zhipeng Lu  ha scritto:

> Could you give the detail qemu cmdline about usb-bot?
>
> 在 2022/12/2 17:40, Paolo Bonzini 写道:
> > On 12/2/22 03:26, Zhipeng Lu wrote:
> >> NAME  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
> >> sda 8:00  100M  1 disk
> >> vda   252:00   10G  0 disk
> >> ├─vda1252:101G  0 part /boot
> >> └─vda2252:209G  0 part
> >>├─rhel-root 253:008G  0 lvm  /
> >>└─rhel-swap 253:101G  0 lvm  [SWAP]
> >> lshw -short|grep cdrom -i
> >> No cdrom.
> >>
> >> My patch is to solve this problem, usb cdrom emulated as cdrom.
> >
> > This is a libvirt bug, it should use usb-bot instead of usb-storage
> > together with -blockdev.  Then it can add a scsi-cd device below usb-bot.
> >
> > Paolo
> >
> >>
> >>
> >> 在 2022/12/1 23:35, Markus Armbruster 写道:
> >>> luzhipeng  writes:
> >>>
>  From: zhipeng Lu 
> 
>  The drive interface supports media=cdrom so that the usb cdrom
>  can be emulated as cdrom in qemu, but libvirt deprived the drive
>  interface, so media=cdrom is added to the blockdev interface to
>  support usb cdrom emulated as cdrom
> 
>  Signed-off-by: zhipeng Lu 
> >>>
> >>> What problem are you trying to solve?
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >
> >
> >
>
>
>


Re: [RFC PATCH for 8.0 10/13] virtio-net: Migrate vhost inflight descriptors

2022-12-07 Thread Eugenio Perez Martin
On Mon, Dec 5, 2022 at 9:52 PM Parav Pandit  wrote:
>
>
> > From: Eugenio Pérez 
> > Sent: Monday, December 5, 2022 12:05 PM
> >
> > There is currently no data to be migrated, since nothing populates or read
> > the fields on virtio-net.
> >
> > The migration of in-flight descriptors is modelled after the migration of
> > requests in virtio-blk. With some differences:
> > * virtio-blk migrates queue number on each request. Here we only add a
> >   vq if it has descriptors to migrate, and then we make all descriptors
> >   in an array.
> > * Use of QTAILQ since it works similar to signal the end of the inflight
> >   descriptors: 1 for more data, 0 if end. But do it for each vq instead
> >   of for each descriptor.
> > * Usage of VMState macros.
> >
> > The fields of descriptors would be way more complicated if we use the
> > VirtQueueElements directly, since there would be a few levels of
> > indirections. Using VirtQueueElementOld for the moment, and migrate to
> > VirtQueueElement for the final patch.
> >
> > TODO: Proper migration versioning
> > TODO: Do not embed vhost-vdpa structs
> > TODO: Migrate the VirtQueueElement, not VirtQueueElementOld.
> >
> > Signed-off-by: Eugenio Pérez 
> > ---
> >  include/hw/virtio/virtio-net.h |   2 +
> >  include/migration/vmstate.h|  11 +++
> >  hw/net/virtio-net.c| 129 +
> >  3 files changed, 142 insertions(+)
> >
> > diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> > index ef234ffe7e..ae7c017ef0 100644
> > --- a/include/hw/virtio/virtio-net.h
> > +++ b/include/hw/virtio/virtio-net.h
> > @@ -151,9 +151,11 @@ typedef struct VirtIONetQueue {
> >  QEMUTimer *tx_timer;
> >  QEMUBH *tx_bh;
> >  uint32_t tx_waiting;
> > +uint32_t tx_inflight_num, rx_inflight_num;
> >  struct {
> >  VirtQueueElement *elem;
> >  } async_tx;
> > +VirtQueueElement **tx_inflight, **rx_inflight;
> >  struct VirtIONet *n;
> >  } VirtIONetQueue;
> >
> > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> > index 9726d2d09e..9e0dfef9ee 100644
> > --- a/include/migration/vmstate.h
> > +++ b/include/migration/vmstate.h
> > @@ -626,6 +626,17 @@ extern const VMStateInfo vmstate_info_qlist;
> >  .offset = vmstate_offset_varray(_state, _field, _type),  \
> >  }
> >
> > +#define VMSTATE_STRUCT_VARRAY_ALLOC_UINT16(_field, _state,
> > _field_num,\
> > +   _version, _vmsd, _type) {   
> >\
> > +.name   = (stringify(_field)), 
> >\
> > +.version_id = (_version),  
> >\
> > +.vmsd   = &(_vmsd),
> >\
> > +.num_offset = vmstate_offset_value(_state, _field_num, uint16_t),  
> >\
> > +.size   = sizeof(_type),   
> >\
> > +.flags  = VMS_STRUCT | VMS_VARRAY_UINT16 | VMS_ALLOC |
> > VMS_POINTER,   \
> > +.offset = vmstate_offset_pointer(_state, _field, _type),   
> >\
> > +}
> > +
> >  #define VMSTATE_STRUCT_VARRAY_ALLOC(_field, _state, _field_num,
> > _version, _vmsd, _type) {\
> >  .name   = (stringify(_field)),   \
> >  .version_id = (_version),\
> > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index
> > aba12759d5..ffd7bf1fc7 100644
> > --- a/hw/net/virtio-net.c
> > +++ b/hw/net/virtio-net.c
> > @@ -3077,6 +3077,13 @@ static bool mac_table_doesnt_fit(void *opaque,
> > int version_id)
> >  return !mac_table_fits(opaque, version_id);  }
> >
> > +typedef struct VirtIONetInflightQueue {
> > +uint16_t idx;
> > +uint16_t num;
> > +QTAILQ_ENTRY(VirtIONetInflightQueue) entry;
> > +VirtQueueElementOld *elems;
> > +} VirtIONetInflightQueue;
> > +
> >  /* This temporary type is shared by all the WITH_TMP methods
> >   * although only some fields are used by each.
> >   */
> > @@ -3086,6 +3093,7 @@ struct VirtIONetMigTmp {
> >  uint16_tcurr_queue_pairs_1;
> >  uint8_t has_ufo;
> >  uint32_thas_vnet_hdr;
> > +QTAILQ_HEAD(, VirtIONetInflightQueue) queues_inflight;
> >  };
> >
> >  /* The 2nd and subsequent tx_waiting flags are loaded later than @@ -
> > 3231,6 +3239,124 @@ static const VMStateDescription
> > vmstate_virtio_net_rss = {
> >  },
> >  };
> >
> > +static const VMStateDescription vmstate_virtio_net_inflight_queue = {
> > +.name  = "virtio-net-device/inflight/queue",
> > +.fields = (VMStateField[]) {
> > +VMSTATE_UINT16(idx, VirtIONetInflightQueue),
> > +VMSTATE_UINT16(num, VirtIONetInflightQueue),
> > +
> > +VMSTATE_STRUCT_VARRAY_ALLOC_UINT16(elems,
> > VirtIONetInflightQueue, num,
> > +   0, 
> > vmstate_virtqueue_element_old,
> > +   

Re: [PATCH for-8.0] hw/rtc/mc146818rtc: Make this rtc device target independent

2022-12-07 Thread Thomas Huth

On 07/12/2022 00.38, Bernhard Beschow wrote:



Am 6. Dezember 2022 20:06:41 UTC schrieb Thomas Huth :

The only code that is really, really target dependent is the apic-related
code in rtc_policy_slew_deliver_irq(). By moving this code into the hw/i386/
folder (renamed to rtc_apic_policy_slew_deliver_irq()) and passing this
function as parameter to mc146818_rtc_init(), we can make the RTC completely
target-independent.

Signed-off-by: Thomas Huth 
---
include/hw/rtc/mc146818rtc.h |  7 +--
hw/alpha/dp264.c |  2 +-
hw/hppa/machine.c|  2 +-
hw/i386/microvm.c|  3 ++-
hw/i386/pc.c | 10 +-
hw/mips/jazz.c   |  2 +-
hw/ppc/pnv.c |  2 +-
hw/rtc/mc146818rtc.c | 34 +++---
hw/rtc/meson.build   |  3 +--
9 files changed, 32 insertions(+), 33 deletions(-)

diff --git a/include/hw/rtc/mc146818rtc.h b/include/hw/rtc/mc146818rtc.h
index 1db0fcee92..c687953cc4 100644
--- a/include/hw/rtc/mc146818rtc.h
+++ b/include/hw/rtc/mc146818rtc.h
@@ -46,14 +46,17 @@ struct RTCState {
 Notifier clock_reset_notifier;
 LostTickPolicy lost_tick_policy;
 Notifier suspend_notifier;
+bool (*policy_slew_deliver_irq)(RTCState *s);
 QLIST_ENTRY(RTCState) link;
};

#define RTC_ISA_IRQ 8

-ISADevice *mc146818_rtc_init(ISABus *bus, int base_year,
- qemu_irq intercept_irq);
+ISADevice *mc146818_rtc_init(ISABus *bus, int base_year, qemu_irq 
intercept_irq,
+ bool (*policy_slew_deliver_irq)(RTCState *s));
void rtc_set_memory(ISADevice *dev, int addr, int val);
int rtc_get_memory(ISADevice *dev, int addr);
+bool rtc_apic_policy_slew_deliver_irq(RTCState *s);
+void qmp_rtc_reset_reinjection(Error **errp);

#endif /* HW_RTC_MC146818RTC_H */
diff --git a/hw/alpha/dp264.c b/hw/alpha/dp264.c
index c502c8c62a..8723942b52 100644
--- a/hw/alpha/dp264.c
+++ b/hw/alpha/dp264.c
@@ -118,7 +118,7 @@ static void clipper_init(MachineState *machine)
 qdev_connect_gpio_out(i82378_dev, 0, isa_irq);

 /* Since we have an SRM-compatible PALcode, use the SRM epoch.  */
-mc146818_rtc_init(isa_bus, 1900, rtc_irq);
+mc146818_rtc_init(isa_bus, 1900, rtc_irq, NULL);

 /* VGA setup.  Don't bother loading the bios.  */
 pci_vga_init(pci_bus);
diff --git a/hw/hppa/machine.c b/hw/hppa/machine.c
index de1cc7ab71..311031714a 100644
--- a/hw/hppa/machine.c
+++ b/hw/hppa/machine.c
@@ -232,7 +232,7 @@ static void machine_hppa_init(MachineState *machine)
 assert(isa_bus);

 /* Realtime clock, used by firmware for PDC_TOD call. */
-mc146818_rtc_init(isa_bus, 2000, NULL);
+mc146818_rtc_init(isa_bus, 2000, NULL, NULL);

 /* Serial ports: Lasi and Dino use a 7.272727 MHz clock. */
 serial_mm_init(addr_space, LASI_UART_HPA + 0x800, 0,
diff --git a/hw/i386/microvm.c b/hw/i386/microvm.c
index 170a331e3f..d0ed4dca50 100644
--- a/hw/i386/microvm.c
+++ b/hw/i386/microvm.c
@@ -267,7 +267,8 @@ static void microvm_devices_init(MicrovmMachineState *mms)

 if (mms->rtc == ON_OFF_AUTO_ON ||
 (mms->rtc == ON_OFF_AUTO_AUTO && !kvm_enabled())) {
-rtc_state = mc146818_rtc_init(isa_bus, 2000, NULL);
+rtc_state = mc146818_rtc_init(isa_bus, 2000, NULL,
+  rtc_apic_policy_slew_deliver_irq);
 microvm_set_rtc(mms, rtc_state);
 }

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 546b703cb4..650e7bc199 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1244,6 +1244,13 @@ static void pc_superio_init(ISABus *isa_bus, bool 
create_fdctrl,
 g_free(a20_line);
}

+bool rtc_apic_policy_slew_deliver_irq(RTCState *s)
+{
+apic_reset_irq_delivered();
+qemu_irq_raise(s->irq);
+return apic_get_irq_delivered();
+}
+
void pc_basic_device_init(struct PCMachineState *pcms,
   ISABus *isa_bus, qemu_irq *gsi,
   ISADevice **rtc_state,
@@ -1299,7 +1306,8 @@ void pc_basic_device_init(struct PCMachineState *pcms,
 pit_alt_irq = qdev_get_gpio_in(hpet, HPET_LEGACY_PIT_INT);
 rtc_irq = qdev_get_gpio_in(hpet, HPET_LEGACY_RTC_INT);
 }
-*rtc_state = mc146818_rtc_init(isa_bus, 2000, rtc_irq);
+*rtc_state = mc146818_rtc_init(isa_bus, 2000, rtc_irq,
+   rtc_apic_policy_slew_deliver_irq);


In my PIIX consolidation series [1] I'm instantiating the RTC in the south 
bridges since embedding the struct in the host device is the preferred new way. 
In the end there is one initialization shared by both PIIX3 and -4. While PIIX3 
(PC) will require rtc_apic_policy_slew_deliver_irq, PIIX4 (Malta) won't. 
Furthermore, my goal ist to reuse PIIX4 in the PC machine to eliminate today's 
Frankenstein PIIX4 ACPI controller. Any idea how to solve this?


I assume that you could ignore this in the shared initialization code and 
just add the pointer in the code that sets up the x86 boards. It's a little 
bit ugly

Re: [PATCH v15 1/6] qmp: add QMP command x-query-virtio

2022-12-07 Thread Jonah Palmer


On 12/2/22 10:21, Markus Armbruster wrote:

Philippe Mathieu-Daudé  writes:


On 2/12/22 13:23, Jonah Palmer wrote:

On 11/30/22 11:16, Philippe Mathieu-Daudé wrote:

Hi,

On 11/8/22 14:24, Jonah Palmer wrote:

From: Laurent Vivier

This new command lists all the instances of VirtIODevices with
their canonical QOM path and name.

[Jonah: @virtio_list duplicates information that already exists in
   the QOM composition tree. However, extracting necessary information
   from this tree seems to be a bit convoluted.

   Instead, we still create our own list of realized virtio devices
   but use @qmp_qom_get with the device's canonical QOM path to confirm
   that the device exists and is realized. If the device exists but
   is actually not realized, then we remove it from our list (for
   synchronicity to the QOM composition tree).

How could this happen?


   Also, the QMP command @x-query-virtio is redundant as @qom-list
   and @qom-get are sufficient to search '/machine/' for realized
   virtio devices. However, @x-query-virtio is much more convenient
   in listing realized virtio devices.]

Signed-off-by: Laurent Vivier
Signed-off-by: Jonah Palmer
---
   hw/virtio/meson.build  |  2 ++
   hw/virtio/virtio-stub.c    | 14 
   hw/virtio/virtio.c | 44 
   include/hw/virtio/virtio.h |  1 +
   qapi/meson.build   |  1 +
   qapi/qapi-schema.json  |  1 +
   qapi/virtio.json   | 68 ++
   tests/qtest/qmp-cmd-test.c |  1 +
   8 files changed, 132 insertions(+)
   create mode 100644 hw/virtio/virtio-stub.c
   create mode 100644 qapi/virtio.json
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 5d607aeaa0..bdfa82e9c0 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -13,12 +13,18 @@
     #include "qemu/osdep.h"
   #include "qapi/error.h"
+#include "qapi/qmp/qdict.h"
+#include "qapi/qapi-commands-virtio.h"
+#include "qapi/qapi-commands-qom.h"
+#include "qapi/qapi-visit-virtio.h"
+#include "qapi/qmp/qjson.h"
   #include "cpu.h"
   #include "trace.h"
   #include "qemu/error-report.h"
   #include "qemu/log.h"
   #include "qemu/main-loop.h"
   #include "qemu/module.h"
+#include "qom/object_interfaces.h"
   #include "hw/virtio/virtio.h"
   #include "migration/qemu-file-types.h"
   #include "qemu/atomic.h"
@@ -29,6 +35,9 @@
   #include "sysemu/runstate.h"
   #include "standard-headers/linux/virtio_ids.h"
   +/* QAPI list of realized VirtIODevices */
+static QTAILQ_HEAD(, VirtIODevice) virtio_list;
+
   /*
    * The alignment to use between consumer and producer parts of vring.
    * x86 pagesize again. This is the default, used by transports like PCI
@@ -3698,6 +3707,7 @@ static void virtio_device_realize(DeviceState *dev, Error 
**errp)
   vdev->listener.commit = virtio_memory_listener_commit;
   vdev->listener.name = "virtio";
   memory_listener_register(&vdev->listener, vdev->dma_as);
+    QTAILQ_INSERT_TAIL(&virtio_list, vdev, next);
   }
     static void virtio_device_unrealize(DeviceState *dev)
@@ -3712,6 +3722,7 @@ static void virtio_device_unrealize(DeviceState *dev)
   vdc->unrealize(dev);
   }
   +    QTAILQ_REMOVE(&virtio_list, vdev, next);
   g_free(vdev->bus_name);
   vdev->bus_name = NULL;
   }
@@ -3885,6 +3896,8 @@ static void virtio_device_class_init(ObjectClass *klass, 
void *data)
   vdc->stop_ioeventfd = virtio_device_stop_ioeventfd_impl;
     vdc->legacy_features |= VIRTIO_LEGACY_FEATURES;
+
+    QTAILQ_INIT(&virtio_list);
   }
     bool virtio_device_ioeventfd_enabled(VirtIODevice *vdev)
@@ -3895,6 +3908,37 @@ bool virtio_device_ioeventfd_enabled(VirtIODevice *vdev)
   return virtio_bus_ioeventfd_enabled(vbus);
   }
   +VirtioInfoList *qmp_x_query_virtio(Error **errp)
+{
+    VirtioInfoList *list = NULL;
+    VirtioInfoList *node;
+    VirtIODevice *vdev;
+
+    QTAILQ_FOREACH(vdev, &virtio_list, next) {
+    DeviceState *dev = DEVICE(vdev);
+    Error *err = NULL;
+    QObject *obj = qmp_qom_get(dev->canonical_path, "realized", &err);
+
+    if (err == NULL) {
+    GString *is_realized = qobject_to_json_pretty(obj, true);
+    /* virtio device is NOT realized, remove it from list */

Why not check dev->realized instead of calling qmp_qom_get() & 
qobject_to_json_pretty()?

This check queries the QOM composition tree to check that the device actually 
exists and is
realized. In other words, we just want to confirm with the QOM composition tree 
for the device.

Again, how could this happen?

If @virtio_list isn't reliable, why have it in the first place?


Honestly, I'm not sure how this even could happen, since the @virtio_list is 
managed at the realization
and unrealization of a virtio device. Given this, I do feel as though the list 
is reliable, although
this might just benaïve of me to say. After giving this a second look, the @virtio_list is 
only really needed to provide a nice list of all realized virtio 

Re: [RFC PATCH for 8.0 10/13] virtio-net: Migrate vhost inflight descriptors

2022-12-07 Thread Eugenio Perez Martin
On Tue, Dec 6, 2022 at 4:24 AM Jason Wang  wrote:
>
> On Tue, Dec 6, 2022 at 1:05 AM Eugenio Pérez  wrote:
> >
> > There is currently no data to be migrated, since nothing populates or
> > read the fields on virtio-net.
> >
> > The migration of in-flight descriptors is modelled after the migration
> > of requests in virtio-blk. With some differences:
> > * virtio-blk migrates queue number on each request. Here we only add a
> >   vq if it has descriptors to migrate, and then we make all descriptors
> >   in an array.
> > * Use of QTAILQ since it works similar to signal the end of the inflight
> >   descriptors: 1 for more data, 0 if end. But do it for each vq instead
> >   of for each descriptor.
> > * Usage of VMState macros.
> >
> > The fields of descriptors would be way more complicated if we use the
> > VirtQueueElements directly, since there would be a few levels of
> > indirections. Using VirtQueueElementOld for the moment, and migrate to
> > VirtQueueElement for the final patch.
> >
> > TODO: Proper migration versioning
> > TODO: Do not embed vhost-vdpa structs
> > TODO: Migrate the VirtQueueElement, not VirtQueueElementOld.
> >
> > Signed-off-by: Eugenio Pérez 
> > ---
> >  include/hw/virtio/virtio-net.h |   2 +
> >  include/migration/vmstate.h|  11 +++
> >  hw/net/virtio-net.c| 129 +
> >  3 files changed, 142 insertions(+)
> >
> > diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> > index ef234ffe7e..ae7c017ef0 100644
> > --- a/include/hw/virtio/virtio-net.h
> > +++ b/include/hw/virtio/virtio-net.h
> > @@ -151,9 +151,11 @@ typedef struct VirtIONetQueue {
> >  QEMUTimer *tx_timer;
> >  QEMUBH *tx_bh;
> >  uint32_t tx_waiting;
> > +uint32_t tx_inflight_num, rx_inflight_num;
> >  struct {
> >  VirtQueueElement *elem;
> >  } async_tx;
> > +VirtQueueElement **tx_inflight, **rx_inflight;
> >  struct VirtIONet *n;
> >  } VirtIONetQueue;
> >
> > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> > index 9726d2d09e..9e0dfef9ee 100644
> > --- a/include/migration/vmstate.h
> > +++ b/include/migration/vmstate.h
> > @@ -626,6 +626,17 @@ extern const VMStateInfo vmstate_info_qlist;
> >  .offset = vmstate_offset_varray(_state, _field, _type),  \
> >  }
> >
> > +#define VMSTATE_STRUCT_VARRAY_ALLOC_UINT16(_field, _state, _field_num, 
> >\
> > +   _version, _vmsd, _type) {   
> >\
> > +.name   = (stringify(_field)), 
> >\
> > +.version_id = (_version),  
> >\
> > +.vmsd   = &(_vmsd),
> >\
> > +.num_offset = vmstate_offset_value(_state, _field_num, uint16_t),  
> >\
> > +.size   = sizeof(_type),   
> >\
> > +.flags  = VMS_STRUCT | VMS_VARRAY_UINT16 | VMS_ALLOC | 
> > VMS_POINTER,   \
> > +.offset = vmstate_offset_pointer(_state, _field, _type),   
> >\
> > +}
> > +
> >  #define VMSTATE_STRUCT_VARRAY_ALLOC(_field, _state, _field_num, _version, 
> > _vmsd, _type) {\
> >  .name   = (stringify(_field)),   \
> >  .version_id = (_version),\
> > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > index aba12759d5..ffd7bf1fc7 100644
> > --- a/hw/net/virtio-net.c
> > +++ b/hw/net/virtio-net.c
> > @@ -3077,6 +3077,13 @@ static bool mac_table_doesnt_fit(void *opaque, int 
> > version_id)
> >  return !mac_table_fits(opaque, version_id);
> >  }
> >
> > +typedef struct VirtIONetInflightQueue {
> > +uint16_t idx;
> > +uint16_t num;
> > +QTAILQ_ENTRY(VirtIONetInflightQueue) entry;
> > +VirtQueueElementOld *elems;
> > +} VirtIONetInflightQueue;
> > +
> >  /* This temporary type is shared by all the WITH_TMP methods
> >   * although only some fields are used by each.
> >   */
> > @@ -3086,6 +3093,7 @@ struct VirtIONetMigTmp {
> >  uint16_tcurr_queue_pairs_1;
> >  uint8_t has_ufo;
> >  uint32_thas_vnet_hdr;
> > +QTAILQ_HEAD(, VirtIONetInflightQueue) queues_inflight;
> >  };
> >
> >  /* The 2nd and subsequent tx_waiting flags are loaded later than
> > @@ -3231,6 +3239,124 @@ static const VMStateDescription 
> > vmstate_virtio_net_rss = {
> >  },
> >  };
> >
> > +static const VMStateDescription vmstate_virtio_net_inflight_queue = {
> > +.name  = "virtio-net-device/inflight/queue",
> > +.fields = (VMStateField[]) {
> > +VMSTATE_UINT16(idx, VirtIONetInflightQueue),
> > +VMSTATE_UINT16(num, VirtIONetInflightQueue),
> > +
> > +VMSTATE_STRUCT_VARRAY_ALLOC_UINT16(elems, VirtIONetInflightQueue, 
> > num,
> > +   0, 
> > vmstate_virtqueue_element_old,
> > +

Re: [PATCH for-8.0] hw/rtc/mc146818rtc: Make this rtc device target independent

2022-12-07 Thread Thomas Huth

On 07/12/2022 00.12, BALATON Zoltan wrote:

On Tue, 6 Dec 2022, Thomas Huth wrote:

The only code that is really, really target dependent is the apic-related
code in rtc_policy_slew_deliver_irq(). By moving this code into the hw/i386/
folder (renamed to rtc_apic_policy_slew_deliver_irq()) and passing this
function as parameter to mc146818_rtc_init(), we can make the RTC completely
target-independent.

Signed-off-by: Thomas Huth 
---

...

@@ -124,9 +118,8 @@ void qmp_rtc_reset_reinjection(Error **errp)

static bool rtc_policy_slew_deliver_irq(RTCState *s)
{
-    apic_reset_irq_delivered();
-    qemu_irq_raise(s->irq);
-    return apic_get_irq_delivered();
+    assert(s->policy_slew_deliver_irq);


Is this assert necessary here? Since it seems that creating the timer that 
would call this is testing for s->policy_slew_deliver_irq being non-NULL 
there should be no way to call this without policy_slew_deliver_irq set.


There was an assert(0) in the original code on non-x86 targets, too, see 
below. I would like to keep that logic here.


If 
you drop the assert then this function also become redundant and 
s->policy_slew_deliver_irq() can be used directly instead simplifying this a 
bit more.


I'd agree, but I really would like to keep the assert(). (additionally, the 
patch stays smaller this way)



+    return s->policy_slew_deliver_irq(s);
}

static void rtc_coalesced_timer(void *opaque)
@@ -145,13 +138,6 @@ static void rtc_coalesced_timer(void *opaque)

    rtc_coalesced_timer_update(s);
}
-#else
-static bool rtc_policy_slew_deliver_irq(RTCState *s)
-{
-    assert(0);


This ^ is the assert() I was talking about.


-    return false;
-}
-#endif


 Thomas




Re: [RFC PATCH for 8.0 00/13] vDPA-net inflight descriptors migration with SVQ

2022-12-07 Thread Eugenio Perez Martin
On Tue, Dec 6, 2022 at 8:08 AM Jason Wang  wrote:
>
> On Tue, Dec 6, 2022 at 1:04 AM Eugenio Pérez  wrote:
> >
> > The state of the descriptors (avail or used) may not be recoverable just
> > looking at the guest memory.  Out of order used descriptor may override
> > previous avail ones in the descriptor table or avail vring.
> >
> > Currently we're not migrating this status in net devices because virtio-net,
> > vhost-kernel etc use the descriptors in order,
>
> Note that this might not be the truth (when zerocopy is enabled).
>

Good point. So will virtio-net wait for those to complete then? How
does qemu handle if there are still inflight descriptors?

> > so the information always
> > recoverable from guest's memory.  However, vDPA devices may use them out of
> > order, and other kind of devices like block need this support.
> >
> > Shadow virtqueue is able to track these and resend them at the destination.
>
> As discussed, there's a bootstrap issue here:
>
> When SVQ needs to be enabled on demand, do we still need another way
> to get inflight ones without the help of SVQ?
>

To send and retrieve the descriptor without SVQ needs to be developed
on top of this. I should have made that more clear here in the cover
letter.

Thanks!

> Thanks
>
> > Add them to the virtio-net migration description so they are not lose in the
> > process.
> >
> > This is a very early RFC just to validate the first draft so expect 
> > leftovers.
> > To fetch and request the descriptors from a device without SVQ need to be
> > implemented on top. Some other notable pending items are:
> > * Do not send the descriptors actually recoverable from the guest memory.
> > * Properly version the migrate data.
> > * Properly abstract the descriptors access from virtio-net to SVQ.
> > * Do not use VirtQueueElementOld but migrate directly VirtQueueElement.
> > * Replace lots of assertions with runtime conditionals.
> > * Other TODOs in the patch message or code changes.
> >
> > Thanks.
> >
> > Eugenio Pérez (13):
> >   vhost: add available descriptor list in SVQ
> >   vhost: iterate only available descriptors at SVQ stop
> >   vhost: merge avail list and next avail descriptors detach
> >   vhost: add vhost_svq_save_inflight
> >   virtio: Specify uint32_t as VirtQueueElementOld members type
> >   virtio: refactor qemu_get_virtqueue_element
> >   virtio: refactor qemu_put_virtqueue_element
> >   virtio: expose VirtQueueElementOld
> >   virtio: add vmstate_virtqueue_element_old
> >   virtio-net: Migrate vhost inflight descriptors
> >   virtio-net: save inflight descriptors at vhost shutdown
> >   vhost: expose vhost_svq_add_element
> >   vdpa: Recover inflight descriptors
> >
> >  hw/virtio/vhost-shadow-virtqueue.h |   9 ++
> >  include/hw/virtio/virtio-net.h |   2 +
> >  include/hw/virtio/virtio.h |  32 ++
> >  include/migration/vmstate.h|  22 
> >  hw/net/vhost_net.c |  56 ++
> >  hw/net/virtio-net.c| 129 +++
> >  hw/virtio/vhost-shadow-virtqueue.c |  52 +++--
> >  hw/virtio/vhost-vdpa.c |  11 --
> >  hw/virtio/virtio.c | 162 ++---
> >  9 files changed, 392 insertions(+), 83 deletions(-)
> >
> > --
> > 2.31.1
> >
> >
>




[PATCH 2/2] target/riscv: Clear mstatus.MPRV when leaving M-mode for priv spec 1.12+

2022-12-07 Thread Bin Meng
Since priv spec v1.12, MRET and SRET now clear mstatus.MPRV when
leaving M-mode.

Signed-off-by: Bin Meng 

---

 target/riscv/op_helper.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/target/riscv/op_helper.c b/target/riscv/op_helper.c
index a047d38152..878bcb03b8 100644
--- a/target/riscv/op_helper.c
+++ b/target/riscv/op_helper.c
@@ -154,6 +154,9 @@ target_ulong helper_sret(CPURISCVState *env)
 get_field(mstatus, MSTATUS_SPIE));
 mstatus = set_field(mstatus, MSTATUS_SPIE, 1);
 mstatus = set_field(mstatus, MSTATUS_SPP, PRV_U);
+if (env->priv_ver >= PRIV_VERSION_1_12_0) {
+mstatus = set_field(mstatus, MSTATUS_MPRV, 0);
+}
 env->mstatus = mstatus;
 
 if (riscv_has_ext(env, RVH) && !riscv_cpu_virt_enabled(env)) {
@@ -203,6 +206,9 @@ target_ulong helper_mret(CPURISCVState *env)
 mstatus = set_field(mstatus, MSTATUS_MPIE, 1);
 mstatus = set_field(mstatus, MSTATUS_MPP, PRV_U);
 mstatus = set_field(mstatus, MSTATUS_MPV, 0);
+if ((env->priv_ver >= PRIV_VERSION_1_12_0) && (prev_priv != PRV_M)) {
+mstatus = set_field(mstatus, MSTATUS_MPRV, 0);
+}
 env->mstatus = mstatus;
 riscv_cpu_set_mode(env, prev_priv);
 
-- 
2.34.1




[PATCH 1/2] target/riscv: Simplify helper_sret() a little bit

2022-12-07 Thread Bin Meng
There are 2 paths in helper_sret() and the same mstatus update codes
are replicated. Extract the common parts to simplify it a little bit.

Signed-off-by: Bin Meng 
---

 target/riscv/op_helper.c | 20 ++--
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/target/riscv/op_helper.c b/target/riscv/op_helper.c
index d7af7f056b..a047d38152 100644
--- a/target/riscv/op_helper.c
+++ b/target/riscv/op_helper.c
@@ -149,21 +149,21 @@ target_ulong helper_sret(CPURISCVState *env)
 }
 
 mstatus = env->mstatus;
+prev_priv = get_field(mstatus, MSTATUS_SPP);
+mstatus = set_field(mstatus, MSTATUS_SIE,
+get_field(mstatus, MSTATUS_SPIE));
+mstatus = set_field(mstatus, MSTATUS_SPIE, 1);
+mstatus = set_field(mstatus, MSTATUS_SPP, PRV_U);
+env->mstatus = mstatus;
 
 if (riscv_has_ext(env, RVH) && !riscv_cpu_virt_enabled(env)) {
 /* We support Hypervisor extensions and virtulisation is disabled */
 target_ulong hstatus = env->hstatus;
 
-prev_priv = get_field(mstatus, MSTATUS_SPP);
 prev_virt = get_field(hstatus, HSTATUS_SPV);
 
 hstatus = set_field(hstatus, HSTATUS_SPV, 0);
-mstatus = set_field(mstatus, MSTATUS_SPP, 0);
-mstatus = set_field(mstatus, SSTATUS_SIE,
-get_field(mstatus, SSTATUS_SPIE));
-mstatus = set_field(mstatus, SSTATUS_SPIE, 1);
 
-env->mstatus = mstatus;
 env->hstatus = hstatus;
 
 if (prev_virt) {
@@ -171,14 +171,6 @@ target_ulong helper_sret(CPURISCVState *env)
 }
 
 riscv_cpu_set_virt_enabled(env, prev_virt);
-} else {
-prev_priv = get_field(mstatus, MSTATUS_SPP);
-
-mstatus = set_field(mstatus, MSTATUS_SIE,
-get_field(mstatus, MSTATUS_SPIE));
-mstatus = set_field(mstatus, MSTATUS_SPIE, 1);
-mstatus = set_field(mstatus, MSTATUS_SPP, PRV_U);
-env->mstatus = mstatus;
 }
 
 riscv_cpu_set_mode(env, prev_priv);
-- 
2.34.1




Re: How to best make include/hw/pci/pcie_sriov.h self-contained

2022-12-07 Thread Philippe Mathieu-Daudé

On 7/12/22 07:25, Markus Armbruster wrote:

pcie_sriov.h needs PCI_NUM_REGIONS from pci.h, but doesn't include it.
pci.h must be included before pcie_sriov.h or else compile fails.

Adding #include "pci/pci.h" to pcie_sriov would be wrong, because it
would close an inclusion loop: pci.h includes pcie.h (for
PCIExpressDevice) includes pcie_sriov.h (for PCIESriovPF) includes pci.h
(for PCI_NUM_REGIONS).

The obvious solution is to move PCI_NUM_REGIONS pci.h somewhere
pcie_sriov.h can include without creating a loop.

We already have a few headers that don't include anything: pci_ids.h,
pci_regs.h (includes include/standard-headers/linux/pci_regs.h, which
doesn't count), pcie_regs.h.  Moving PCI_NUM_REGIONS to one of these
would work, but it doesn't feel right.

We could create a new one, say pci_defs.h.  Just for PCI_NUM_REGIONS
feels silly.  So, what else should move there?


Sounds good to me. Eventually name it pci_standard_defs.h?

We can move the first 100 lines of pci.h there, PCI_ROM_SLOT, 
PCI_NUM_REGIONS, PCI HEADER_TYPE, PCI_NUM_PINS, cap_present, and 
eventually PCIINTxRoute & PCIReqIDType.




Any other ideas?

In case you wonder why I bother you with this...

Back in 2016, we discussed[1] rules for headers, and these were
generally liked:

1. Have a carefully curated header that's included everywhere first.  We
got that already thanks to Peter: osdep.h.

2. Headers should normally include everything they need beyond osdep.h.
If exceptions are needed for some reason, they must be documented in
the header.  If all that's needed from a header is typedefs, put
those into qemu/typedefs.h instead of including the header.

3. Cyclic inclusion is forbidden.

I'm working on patches to get include/ closer to obeying 2.

[1] Message-ID: <87h9g8j57d@blackfin.pond.sub.org>
 https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg03345.html







Re: [PATCH] configure: Fix check-tcg not executing any tests

2022-12-07 Thread Philippe Mathieu-Daudé

Hi Mukilan,

On 7/12/22 09:23, Mukilan Thiyagarajan wrote:

After configuring with --target-list=hexagon-linux-user
running `make check-tcg` just prints the following:

```
make: Nothing to be done for 'check-tcg'
```

In the probe_target_compiler function, the 'break'
command is used incorrectly. There are no lexically
enclosing loops associated with that break command which
is an unspecfied behaviour in the POSIX standard.

The dash shell implementation aborts the currently executing
loop, in this case, causing the rest of the logic for the loop
in line 2490 to be skipped, which means no Makefiles are
generated for the tcg target tests.

Fixes: c3b570b5a9a24d25 (configure: don't enable
cross compilers unless in target_list)


When posting a patch fixing an issue introduced by another one,
you'll get more feedback if Cc'ing the author/reviewers of such
patch.

Also Cc'ing the maintainers also help in having your patch picked
up :) See:

https://www.qemu.org/docs/master/devel/submitting-a-patch.html#cc-the-relevant-maintainer

I've Cc'ed the corresponding developers for you.

Regards,

Phil.


Signed-off-by: Mukilan Thiyagarajan 
---
  configure | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/configure b/configure
index 26c7bc5154..7a804fb657 100755
--- a/configure
+++ b/configure
@@ -1881,9 +1881,7 @@ probe_target_compiler() {
# We shall skip configuring the target compiler if the user didn't
# bother enabling an appropriate guest. This avoids building
# extraneous firmware images and tests.
-  if test "${target_list#*$1}" != "$1"; then
-  break;
-  else
+  if test "${target_list#*$1}" = "$1"; then
return 1
fi
  





Re: [PATCH v12 2/7] s390x/cpu topology: reporting the CPU topology to the guest

2022-12-07 Thread Cédric Le Goater

On 11/29/22 18:42, Pierre Morel wrote:

The guest uses the STSI instruction to get information on the
CPU topology.

Let us implement the STSI instruction for the basis CPU topology
level, level 2.

Signed-off-by: Pierre Morel 
---
  target/s390x/cpu.h  |  77 +++
  hw/s390x/s390-virtio-ccw.c  |  12 +--
  target/s390x/cpu_topology.c | 186 
  target/s390x/kvm/kvm.c  |   6 +-
  target/s390x/meson.build|   1 +
  5 files changed, 274 insertions(+), 8 deletions(-)
  create mode 100644 target/s390x/cpu_topology.c

diff --git a/target/s390x/cpu.h b/target/s390x/cpu.h
index 7d6d01325b..dd878ac916 100644
--- a/target/s390x/cpu.h
+++ b/target/s390x/cpu.h
@@ -175,6 +175,7 @@ struct ArchCPU {
  /* needed for live migration */
  void *irqstate;
  uint32_t irqstate_saved_size;
+void *machine_data;
  };
  
  
@@ -565,6 +566,80 @@ typedef union SysIB {

  } SysIB;
  QEMU_BUILD_BUG_ON(sizeof(SysIB) != 4096);
  
+/*

+ * CPU Topology List provided by STSI with fc=15 provides a list
+ * of two different Topology List Entries (TLE) types to specify
+ * the topology hierarchy.
+ *
+ * - Container Topology List Entry
+ *   Defines a container to contain other Topology List Entries
+ *   of any type, nested containers or CPU.
+ * - CPU Topology List Entry
+ *   Specifies the CPUs position, type, entitlement and polarization
+ *   of the CPUs contained in the last Container TLE.
+ *
+ * There can be theoretically up to five levels of containers, QEMU
+ * uses only one level, the socket level.
+ *
+ * A container of with a nesting level (NL) greater than 1 can only
+ * contain another container of nesting level NL-1.
+ *
+ * A container of nesting level 1 (socket), contains as many CPU TLE
+ * as needed to describe the position and qualities of all CPUs inside
+ * the container.
+ * The qualities of a CPU are polarization, entitlement and type.
+ *
+ * The CPU TLE defines the position of the CPUs of identical qualities
+ * using a 64bits mask which first bit has its offset defined by
+ * the CPU address orgin field of the CPU TLE like in:
+ * CPU address = origin * 64 + bit position within the mask
+ *
+ */
+/* Container type Topology List Entry */
+typedef struct SysIBTl_container {
+uint8_t nl;
+uint8_t reserved[6];
+uint8_t id;
+} QEMU_PACKED QEMU_ALIGNED(8) SysIBTl_container;
+QEMU_BUILD_BUG_ON(sizeof(SysIBTl_container) != 8);
+
+/* CPU type Topology List Entry */
+typedef struct SysIBTl_cpu {
+uint8_t nl;
+uint8_t reserved0[3];
+uint8_t reserved1:5;
+uint8_t dedicated:1;
+uint8_t polarity:2;
+uint8_t type;
+uint16_t origin;
+uint64_t mask;
+} QEMU_PACKED QEMU_ALIGNED(8) SysIBTl_cpu;
+QEMU_BUILD_BUG_ON(sizeof(SysIBTl_cpu) != 16);
+
+#define S390_TOPOLOGY_MAG  6
+#define S390_TOPOLOGY_MAG6 0
+#define S390_TOPOLOGY_MAG5 1
+#define S390_TOPOLOGY_MAG4 2
+#define S390_TOPOLOGY_MAG3 3
+#define S390_TOPOLOGY_MAG2 4
+#define S390_TOPOLOGY_MAG1 5
+/* Configuration topology */
+typedef struct SysIB_151x {
+uint8_t  reserved0[2];
+uint16_t length;
+uint8_t  mag[S390_TOPOLOGY_MAG];
+uint8_t  reserved1;
+uint8_t  mnest;
+uint32_t reserved2;
+char tle[0];
+} QEMU_PACKED QEMU_ALIGNED(8) SysIB_151x;
+QEMU_BUILD_BUG_ON(sizeof(SysIB_151x) != 16);
+
+/* Max size of a SYSIB structure is when all CPU are alone in a container */
+#define S390_TOPOLOGY_SYSIB_SIZE (sizeof(SysIB_151x) + 
\
+  S390_MAX_CPUS * (sizeof(SysIBTl_container) + 
\
+   sizeof(SysIBTl_cpu)))
+
  /* MMU defines */
  #define ASCE_ORIGIN   (~0xfffULL) /* segment table origin 
*/
  #define ASCE_SUBSPACE 0x200   /* subspace group control   
*/
@@ -843,4 +918,6 @@ S390CPU *s390_cpu_addr2state(uint16_t cpu_addr);
  
  #include "exec/cpu-all.h"
  
+void insert_stsi_15_1_x(S390CPU *cpu, int sel2, __u64 addr, uint8_t ar);

+
  #endif
diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
index 973bbdd36e..4be07959fd 100644
--- a/hw/s390x/s390-virtio-ccw.c
+++ b/hw/s390x/s390-virtio-ccw.c
@@ -64,11 +64,10 @@ S390CPU *s390_cpu_addr2state(uint16_t cpu_addr)
  return S390_CPU(ms->possible_cpus->cpus[cpu_addr].cpu);
  }
  
-static S390CPU *s390x_new_cpu(const char *typename, uint32_t core_id,

-  Error **errp)
+static void s390x_new_cpu(MachineState *ms, uint32_t core_id, Error **errp)
  {
-S390CPU *cpu = S390_CPU(object_new(typename));
-S390CPU *ret = NULL;
+S390CcwMachineState *s390ms = S390_CCW_MACHINE(ms);
+S390CPU *cpu = S390_CPU(object_new(ms->cpu_type));
  
  if (!object_property_set_int(OBJECT(cpu), "core-id", core_id, errp)) {

  goto out;
@@ -76,11 +75,10 @@ static S390CPU *s390x_new_cpu(const char *typename, 
uint32_t core_id,
  if (!qdev_realize(DEVICE(cpu), NULL, errp))

Re: REG: TTC Timer

2022-12-07 Thread Gowri Shankar
Hello Konrad,
Could you please help me to solve it?
Thanks & Regards,
P.Gowrishankar

On Mon, Dec 5, 2022 at 4:26 PM Gowri Shankar  wrote:

> Hi Konrad,
>
> Thanks for your quick response.
>
> Now I want to increment the TTC counter value to enable the system tick.
> How to configure the TTC register to increment it in QEMU.
>
> I found the steps to enable the TTC counter which is below. But not able
> to increment. If possible could you please share the example source code?
>
>   1. Select clock input source, set prescaler value (slcr.MIO_MUX_SEL
> registers, TTC Clock Control register). Ensure TTC is disabled
> (ttc.Counter_Control_x [DIS] = 1) before proceeding with this step.
> 2. Set interval value (Interval register). This step is optional, for
> interval mode only.
> 3. Set match value (Match registers). This step is optional, if matching
> is to be enabled.
> 4. Enable interrupt (Interrupt Enable register). This step is optional, if
> interrupt is to be enabled.
> 5. Enable/disable waveform output, enable/disable matching, set counting
> direction, set mode, enable counter (TTC Counter Control register). This
> step starts the counter.
>
> Thanks & Regards,
> P.Gowrishankar
>
> On Mon, Dec 5, 2022 at 4:07 PM Konrad, Frederic 
> wrote:
>
>> Hi Philippe,
>> Hi Gowri,
>>
>> The zcu102 has a zynqmp soc object (hw/arm/xlnx-zcu102.c:125):
>>
>> static void xlnx_zcu102_init(MachineState *machine)
>> {
>> ...
>> object_initialize_child(OBJECT(machine), "soc", &s->soc,
>> TYPE_XLNX_ZYNQMP);
>>
>> So the TTCs should work in the ZCU102.
>>
>> Best Regards,
>> Fred
>>
>> -Original Message-
>> From: Philippe Mathieu-Daudé 
>> Sent: 05 December 2022 09:24
>> To: Gowri Shankar ; QEMU Developers <
>> qemu-devel@nongnu.org>; qemu-arm 
>> Cc: qemu-disc...@nongnu.org; Konrad, Frederic ;
>> Iglesias, Francisco ; Alistair Francis <
>> alistair.fran...@wdc.com>
>> Subject: Re: REG: TTC Timer
>>
>> On 22/11/22 12:27, Gowri Shankar wrote:
>> > Hi Team,
>> >
>> > Advance Thanks for Your support.
>> >
>> > Could you please clarify one point here?
>> > I am using a Xilinx ZCU102 machine with QEMU7.1.0.
>> >
>> > I have seen QEMU 7.1.0 release has TTC timers for the Xilinx-zynqmp
>> > SoC model.
>> > url: https://wiki.qemu.org/ChangeLog/7.1
>> > 
>> >
>> > In this case, can the ZCU102 machine also use the TTC feature?
>> > If yes and possible, Could you please share the example code snippet?
>> > --
>> > Thanks & Regards,
>> > P. Gowrishankar.
>> > +919944802490
>>
>> Cc'ing qemu-arm@ mailing list and Xilinx ZCU102 machine developers.
>>
>>
>
> --
> Thanks & Regards,
> P. Gowrishankar.
> +919944802490
>
>
>
>

-- 
Thanks & Regards,
P. Gowrishankar.
+919944802490


Re: [PATCH v3] intel-iommu: Document iova_tree

2022-12-07 Thread Eric Auger
Hi Peter,

On 12/6/22 23:13, Peter Xu wrote:
> It seems not super clear on when iova_tree is used, and why.  Add a rich
> comment above iova_tree to track why we needed the iova_tree, and when we
> need it.
>
> Also comment for the map/unmap messages, on how they're used and
> implications (e.g. unmap can be larger than the mapped ranges).
>
> Suggested-by: Jason Wang 
> Signed-off-by: Peter Xu 

Reviewed-by: Eric Auger 

Thanks

Eric
> ---
> v3:
> - Adjust according to Eric's comment
> ---
>  include/exec/memory.h | 28 ++
>  include/hw/i386/intel_iommu.h | 38 ++-
>  2 files changed, 65 insertions(+), 1 deletion(-)
>
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 91f8a2395a..269ecb873b 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -129,6 +129,34 @@ struct IOMMUTLBEntry {
>  /*
>   * Bitmap for different IOMMUNotifier capabilities. Each notifier can
>   * register with one or multiple IOMMU Notifier capability bit(s).
> + *
> + * Normally there're two use cases for the notifiers:
> + *
> + *   (1) When the device needs accurate synchronizations of the vIOMMU page
> + *   tables, it needs to register with both MAP|UNMAP notifies (which
> + *   is defined as IOMMU_NOTIFIER_IOTLB_EVENTS below).
> + *
> + *   Regarding to accurate synchronization, it's when the notified
> + *   device maintains a shadow page table and must be notified on each
> + *   guest MAP (page table entry creation) and UNMAP (invalidation)
> + *   events (e.g. VFIO). Both notifications must be accurate so that
> + *   the shadow page table is fully in sync with the guest view.
> + *
> + *   (2) When the device doesn't need accurate synchronizations of the
> + *   vIOMMU page tables, it needs to register only with UNMAP or
> + *   DEVIOTLB_UNMAP notifies.
> + *
> + *   It's when the device maintains a cache of IOMMU translations
> + *   (IOTLB) and is able to fill that cache by requesting translations
> + *   from the vIOMMU through a protocol similar to ATS (Address
> + *   Translation Service).
> + *
> + *   Note that in this mode the vIOMMU will not maintain a shadowed
> + *   page table for the address space, and the UNMAP messages can be
> + *   actually larger than the real invalidations (just like how the
> + *   Linux IOMMU driver normally works, where an invalidation can be
> + *   enlarged as long as it still covers the target range).  The IOMMU
> + *   notifiee should be able to take care of over-sized invalidations.
>   */
>  typedef enum {
>  IOMMU_NOTIFIER_NONE = 0,
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 46d973e629..89dcbc5e1e 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -109,7 +109,43 @@ struct VTDAddressSpace {
>  QLIST_ENTRY(VTDAddressSpace) next;
>  /* Superset of notifier flags that this address space has */
>  IOMMUNotifierFlag notifier_flags;
> -IOVATree *iova_tree;  /* Traces mapped IOVA ranges */
> +/*
> + * @iova_tree traces mapped IOVA ranges.
> + *
> + * The tree is not needed if no MAP notifier is registered with current
> + * VTD address space, because all guest invalidate commands can be
> + * directly passed to the IOMMU UNMAP notifiers without any further
> + * reshuffling.
> + *
> + * The tree OTOH is required for MAP typed iommu notifiers for a few
> + * reasons.
> + *
> + * Firstly, there's no way to identify whether an PSI (Page Selective
> + * Invalidations) or DSI (Domain Selective Invalidations) event is an
> + * MAP or UNMAP event within the message itself.  Without having prior
> + * knowledge of existing state vIOMMU doesn't know whether it should
> + * notify MAP or UNMAP for a PSI message it received when caching mode
> + * is enabled (for MAP notifiers).
> + *
> + * Secondly, PSI messages received from guest driver can be enlarged in
> + * range, covers but not limited to what the guest driver wanted to
> + * invalidate.  When the range to invalidates gets bigger than the
> + * limit of a PSI message, it can even become a DSI which will
> + * invalidate the whole domain.  If the vIOMMU directly notifies the
> + * registered device with the unmodified range, it may confuse the
> + * registered drivers (e.g. vfio-pci) on either:
> + *
> + *   (1) Trying to map the same region more than once (for
> + *   VFIO_IOMMU_MAP_DMA, -EEXIST will trigger), or,
> + *
> + *   (2) Trying to UNMAP a range that is still partially mapped.
> + *
> + * That accuracy is not required for UNMAP-only notifiers, but it is a
> + * must-to-have for notifiers registered with MAP events, because the
> + * vIOMMU needs to make sure the shadow page table is always in sync
> + * with the 

Re: [PATCH v12 2/7] s390x/cpu topology: reporting the CPU topology to the guest

2022-12-07 Thread Pierre Morel




On 12/7/22 10:12, Cédric Le Goater wrote:

On 11/29/22 18:42, Pierre Morel wrote:

The guest uses the STSI instruction to get information on the
CPU topology.

Let us implement the STSI instruction for the basis CPU topology
level, level 2.

Signed-off-by: Pierre Morel 
---


...snip...


+
+#define S390_TOPOLOGY_MAX_MNEST 2
+
+void insert_stsi_15_1_x(S390CPU *cpu, int sel2, __u64 addr, uint8_t ar)
+{
+    union {
+    char place_holder[S390_TOPOLOGY_SYSIB_SIZE];
+    SysIB_151x sysib;
+    } buffer QEMU_ALIGNED(8);
+    int len;
+
+    if (s390_is_pv() || !s390_has_topology() ||
+    sel2 < 2 || sel2 > S390_TOPOLOGY_MAX_MNEST) {
+    setcc(cpu, 3);
+    return;
+    }

+    s390_prepare_topology(S390_CCW_MACHINE(cpu->machine_data));
+
+    len = setup_stsi(cpu, &buffer.sysib, sel2);



The S390_CPU_TOPOLOGY object is created by the machine at init time
and the two above routines are the only users of this object.


This is right at this moment but the object will be used in the next 
patches for implementing reset, patch 3, and migration, patch 4.





The first loops on all possible CPUs to populate the bitmask array
'socket' under S390_CPU_TOPOLOGY and the second uses the result to
populate the buffer returned to the guest OS.

I don't understand why the S390_CPU_TOPOLOGY object is needed at all.
AFAICT, this is just adding extra complexity.


I used an object because I thought it could be cleaner for the 
implementation of reset and migration.



Is the pachset preparing
ground for some more features ? 


Yes it is, I removed the books and drawers topology containers from this 
patch series in the version 10 of the patch series to postpone their 
implementation.


The next series on topology implementation will also add, beside the 
implementation of drawers and books, the possibility to modify the 
topology during the life of a guest.


These, book, drawer and the topology modification will need to be migrated.

Is there a good alternative to facilitate the implementation of the 
migration ?


Of course we can put all together inside the CcwMachineState but 
wouldn't the use of a dedicated object make it all cleaner?


Regards,
Pierre

If so, it should be explained in the

commit log.

As for now, I see no good justification for S390_CPU_TOPOLOGY and we
could add support with a simple routine called from insert_stsi_15_1_x().

Thanks,

C.


--
Pierre Morel
IBM Lab Boeblingen



Re: [PATCH v12 1/7] s390x/cpu topology: Creating CPU topology device

2022-12-07 Thread Pierre Morel




On 12/6/22 22:06, Janis Schoetterl-Glausch wrote:

On Tue, 2022-12-06 at 15:35 +0100, Pierre Morel wrote:


On 12/6/22 14:35, Janis Schoetterl-Glausch wrote:

On Tue, 2022-12-06 at 11:32 +0100, Pierre Morel wrote:


On 12/6/22 10:31, Janis Schoetterl-Glausch wrote:

On Tue, 2022-11-29 at 18:42 +0100, Pierre Morel wrote:

We will need a Topology device to transfer the topology
during migration and to implement machine reset.

The device creation is fenced by s390_has_topology().

Signed-off-by: Pierre Morel 
---
include/hw/s390x/cpu-topology.h| 44 +++
include/hw/s390x/s390-virtio-ccw.h |  1 +
hw/s390x/cpu-topology.c| 87 ++
hw/s390x/s390-virtio-ccw.c | 25 +
hw/s390x/meson.build   |  1 +
5 files changed, 158 insertions(+)
create mode 100644 include/hw/s390x/cpu-topology.h
create mode 100644 hw/s390x/cpu-topology.c



[...]

+static DeviceState *s390_init_topology(MachineState *machine, Error **errp)

+{
+DeviceState *dev;
+
+dev = qdev_new(TYPE_S390_CPU_TOPOLOGY);
+
+object_property_add_child(&machine->parent_obj,
+  TYPE_S390_CPU_TOPOLOGY, OBJECT(dev));


Why set this property, and why on the machine parent?


For what I understood setting the num_cores and num_sockets as
properties of the CPU Topology object allows to have them better
integrated in the QEMU object framework.


That I understand.


The topology is added to the S390CcwmachineState, it is the parent of
the machine.


But why? And is it added to the S390CcwMachineState, or its parent?


it is added to the S390CcwMachineState.
We receive the MachineState as the "machine" parameter here and it is
added to the "machine->parent_obj" which is the S390CcwMachineState.


Oh, I was confused. &machine->parent_obj is just a cast of MachineState* to 
Object*.
It's the very same object.
And what is the reason to add the topology as child property?
Just so it shows up in the qtree? Wouldn't it anyway under the sysbus?


Yes it would appear on the info qtree but not in the qom-tree












+object_property_set_int(OBJECT(dev), "num-cores",
+machine->smp.cores * machine->smp.threads, errp);
+object_property_set_int(OBJECT(dev), "num-sockets",
+machine->smp.sockets, errp);
+
+sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), errp);


I must admit that I haven't fully grokked qemu's memory management yet.
Is the topology devices now owned by the sysbus?


Yes it is so we see it on the qtree with its properties.



If so, is it fine to have a pointer to it S390CcwMachineState?


Why not?


If it's owned by the sysbus and the object is not explicitly referenced
for the pointer, it might be deallocated and then you'd have a dangling pointer.


Why would it be deallocated ?


That's beside the point, if you transfer ownership, you have no control over 
when
the deallocation happens.
It's going to be fine in practice, but I don't think you should rely on it.
I think you could just do sysbus_realize instead of ..._and_unref,
but like I said, I haven't fully understood qemu memory management.
(It would also leak in a sense, but since the machine exists forever that 
should be fine)


If I understand correctly:

- qdev_new adds a reference count to the new created object, dev.

- object_property_add_child adds a reference count to the child also 
here the new created device dev so the ref count of dev is 2 .


after the unref on dev, the ref count of dev get down to 1

then it seems OK. Did I miss something?

Regards,
Pierre




as long it is not unrealized it belongs to the sysbus doesn't it?

Regards,
Pierre





--
Pierre Morel
IBM Lab Boeblingen



[PATCH v2 06/16] hw/intc: sifive_plic: Drop PLICMode_H

2022-12-07 Thread Bin Meng
H-mode has been removed since priv spec 1.10. Drop it.

Signed-off-by: Bin Meng 
Reviewed-by: Wilfred Mallawa 
Reviewed-by: Alistair Francis 
---

(no changes since v1)

 include/hw/intc/sifive_plic.h | 1 -
 hw/intc/sifive_plic.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/include/hw/intc/sifive_plic.h b/include/hw/intc/sifive_plic.h
index 134cf39a96..d3f45ec248 100644
--- a/include/hw/intc/sifive_plic.h
+++ b/include/hw/intc/sifive_plic.h
@@ -33,7 +33,6 @@ DECLARE_INSTANCE_CHECKER(SiFivePLICState, SIFIVE_PLIC,
 typedef enum PLICMode {
 PLICMode_U,
 PLICMode_S,
-PLICMode_H,
 PLICMode_M
 } PLICMode;
 
diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
index 0c7696520d..936dcf74bc 100644
--- a/hw/intc/sifive_plic.c
+++ b/hw/intc/sifive_plic.c
@@ -42,7 +42,6 @@ static PLICMode char_to_mode(char c)
 switch (c) {
 case 'U': return PLICMode_U;
 case 'S': return PLICMode_S;
-case 'H': return PLICMode_H;
 case 'M': return PLICMode_M;
 default:
 error_report("plic: invalid mode '%c'", c);
-- 
2.34.1




[PATCH v2 10/16] hw/riscv: microchip_pfsoc: Fix the number of interrupt sources of PLIC

2022-12-07 Thread Bin Meng
Per chapter 6.5.2 in [1], the number of interupt sources including
interrupt source 0 should be 187.

[1] PolarFire SoC MSS TRM:
https://ww1.microchip.com/downloads/aemDocuments/documents/FPGA/ProductDocuments/ReferenceManuals/PolarFire_SoC_FPGA_MSS_Technical_Reference_Manual_VC.pdf

Fixes: 56f6e31e7b7e ("hw/riscv: Initial support for Microchip PolarFire SoC 
Icicle Kit board")
Signed-off-by: Bin Meng 
Reviewed-by: Wilfred Mallawa 
Reviewed-by: Alistair Francis 
Reviewed-by: Conor Dooley 
---

(no changes since v1)

 include/hw/riscv/microchip_pfsoc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/hw/riscv/microchip_pfsoc.h 
b/include/hw/riscv/microchip_pfsoc.h
index 69a686b54a..577efad0c4 100644
--- a/include/hw/riscv/microchip_pfsoc.h
+++ b/include/hw/riscv/microchip_pfsoc.h
@@ -153,7 +153,7 @@ enum {
 #define MICROCHIP_PFSOC_MANAGEMENT_CPU_COUNT1
 #define MICROCHIP_PFSOC_COMPUTE_CPU_COUNT   4
 
-#define MICROCHIP_PFSOC_PLIC_NUM_SOURCES185
+#define MICROCHIP_PFSOC_PLIC_NUM_SOURCES187
 #define MICROCHIP_PFSOC_PLIC_NUM_PRIORITIES 7
 #define MICROCHIP_PFSOC_PLIC_PRIORITY_BASE  0x04
 #define MICROCHIP_PFSOC_PLIC_PENDING_BASE   0x1000
-- 
2.34.1




[PATCH v2 05/16] hw/riscv: spike: Remove misleading comments

2022-12-07 Thread Bin Meng
PLIC is not included in the 'spike' machine.

Signed-off-by: Bin Meng 
Reviewed-by: Wilfred Mallawa 
Reviewed-by: Alistair Francis 
---

(no changes since v1)

 hw/riscv/spike.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/hw/riscv/spike.c b/hw/riscv/spike.c
index 1e1d752c00..13946acf0d 100644
--- a/hw/riscv/spike.c
+++ b/hw/riscv/spike.c
@@ -8,7 +8,6 @@
  *
  * 0) HTIF Console and Poweroff
  * 1) CLINT (Timer and IPI)
- * 2) PLIC (Platform Level Interrupt Controller)
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
-- 
2.34.1




[PATCH v2 08/16] hw/intc: sifive_plic: Use error_setg() to propagate the error up via errp in sifive_plic_realize()

2022-12-07 Thread Bin Meng
The realize() callback has an errp for us to propagate the error up.
While we are here, corret the wrong multi-line comment format.

Signed-off-by: Bin Meng 

---

Changes in v2:
- new patch: "hw/intc: sifive_plic: Use error_setg() to propagate the error up 
via errp in sifive_plic_realize()"

 hw/intc/sifive_plic.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
index c9af94a888..9cb4c6d6d4 100644
--- a/hw/intc/sifive_plic.c
+++ b/hw/intc/sifive_plic.c
@@ -379,7 +379,8 @@ static void sifive_plic_realize(DeviceState *dev, Error 
**errp)
 s->m_external_irqs = g_malloc(sizeof(qemu_irq) * s->num_harts);
 qdev_init_gpio_out(dev, s->m_external_irqs, s->num_harts);
 
-/* We can't allow the supervisor to control SEIP as this would allow the
+/*
+ * We can't allow the supervisor to control SEIP as this would allow the
  * supervisor to clear a pending external interrupt which will result in
  * lost a interrupt in the case a PLIC is attached. The SEIP bit must be
  * hardware controlled when a PLIC is attached.
@@ -387,8 +388,8 @@ static void sifive_plic_realize(DeviceState *dev, Error 
**errp)
 for (i = 0; i < s->num_harts; i++) {
 RISCVCPU *cpu = RISCV_CPU(qemu_get_cpu(s->hartid_base + i));
 if (riscv_cpu_claim_interrupts(cpu, MIP_SEIP) < 0) {
-error_report("SEIP already claimed");
-exit(1);
+error_setg(errp, "SEIP already claimed");
+return;
 }
 }
 
-- 
2.34.1




[PATCH v2 16/16] hw/intc: sifive_plic: Fix the pending register range check

2022-12-07 Thread Bin Meng
The pending register upper limit is currently set to
plic->num_sources >> 3, which is wrong, e.g.: considering
plic->num_sources is 7, the upper limit becomes 0 which fails
the range check if reading the pending register at pending_base.

Fixes: 1e24429e40df ("SiFive RISC-V PLIC Block")
Signed-off-by: Bin Meng 
Reviewed-by: Alistair Francis 

---

(no changes since v1)

 hw/intc/sifive_plic.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
index 1a792cc3f5..5522ede2cf 100644
--- a/hw/intc/sifive_plic.c
+++ b/hw/intc/sifive_plic.c
@@ -143,7 +143,8 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr addr, 
unsigned size)
 uint32_t irq = (addr - plic->priority_base) >> 2;
 
 return plic->source_priority[irq];
-} else if (addr_between(addr, plic->pending_base, plic->num_sources >> 3)) 
{
+} else if (addr_between(addr, plic->pending_base,
+(plic->num_sources + 31) >> 3)) {
 uint32_t word = (addr - plic->pending_base) >> 2;
 
 return plic->pending[word];
@@ -202,7 +203,7 @@ static void sifive_plic_write(void *opaque, hwaddr addr, 
uint64_t value,
 sifive_plic_update(plic);
 }
 } else if (addr_between(addr, plic->pending_base,
-plic->num_sources >> 3)) {
+(plic->num_sources + 31) >> 3)) {
 qemu_log_mask(LOG_GUEST_ERROR,
   "%s: invalid pending write: 0x%" HWADDR_PRIx "",
   __func__, addr);
-- 
2.34.1




[PATCH v2 14/16] hw/intc: sifive_plic: Change "priority-base" to start from interrupt source 0

2022-12-07 Thread Bin Meng
At present the SiFive PLIC model "priority-base" expects interrupt
priority register base starting from source 1 instead source 0,
that's why on most platforms "priority-base" is set to 0x04 except
'opentitan' machine. 'opentitan' should have set "priority-base"
to 0x04 too.

Note the irq number calculation in sifive_plic_{read,write} is
correct as the codes make up for the irq number by adding 1.

Let's simply update "priority-base" to start from interrupt source
0 and add a comment to make it crystal clear.

Signed-off-by: Bin Meng 
Reviewed-by: Alistair Francis 
---

(no changes since v1)

 include/hw/riscv/microchip_pfsoc.h | 2 +-
 include/hw/riscv/shakti_c.h| 2 +-
 include/hw/riscv/sifive_e.h| 2 +-
 include/hw/riscv/sifive_u.h| 2 +-
 include/hw/riscv/virt.h| 2 +-
 hw/intc/sifive_plic.c  | 5 +++--
 6 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/hw/riscv/microchip_pfsoc.h 
b/include/hw/riscv/microchip_pfsoc.h
index 577efad0c4..e65ffeb02d 100644
--- a/include/hw/riscv/microchip_pfsoc.h
+++ b/include/hw/riscv/microchip_pfsoc.h
@@ -155,7 +155,7 @@ enum {
 
 #define MICROCHIP_PFSOC_PLIC_NUM_SOURCES187
 #define MICROCHIP_PFSOC_PLIC_NUM_PRIORITIES 7
-#define MICROCHIP_PFSOC_PLIC_PRIORITY_BASE  0x04
+#define MICROCHIP_PFSOC_PLIC_PRIORITY_BASE  0x00
 #define MICROCHIP_PFSOC_PLIC_PENDING_BASE   0x1000
 #define MICROCHIP_PFSOC_PLIC_ENABLE_BASE0x2000
 #define MICROCHIP_PFSOC_PLIC_ENABLE_STRIDE  0x80
diff --git a/include/hw/riscv/shakti_c.h b/include/hw/riscv/shakti_c.h
index daf0aae13f..539fe1156d 100644
--- a/include/hw/riscv/shakti_c.h
+++ b/include/hw/riscv/shakti_c.h
@@ -65,7 +65,7 @@ enum {
 #define SHAKTI_C_PLIC_NUM_SOURCES 28
 /* Excluding Priority 0 */
 #define SHAKTI_C_PLIC_NUM_PRIORITIES 2
-#define SHAKTI_C_PLIC_PRIORITY_BASE 0x04
+#define SHAKTI_C_PLIC_PRIORITY_BASE 0x00
 #define SHAKTI_C_PLIC_PENDING_BASE 0x1000
 #define SHAKTI_C_PLIC_ENABLE_BASE 0x2000
 #define SHAKTI_C_PLIC_ENABLE_STRIDE 0x80
diff --git a/include/hw/riscv/sifive_e.h b/include/hw/riscv/sifive_e.h
index 9e58247fd8..b824a79e2d 100644
--- a/include/hw/riscv/sifive_e.h
+++ b/include/hw/riscv/sifive_e.h
@@ -89,7 +89,7 @@ enum {
  */
 #define SIFIVE_E_PLIC_NUM_SOURCES 53
 #define SIFIVE_E_PLIC_NUM_PRIORITIES 7
-#define SIFIVE_E_PLIC_PRIORITY_BASE 0x04
+#define SIFIVE_E_PLIC_PRIORITY_BASE 0x00
 #define SIFIVE_E_PLIC_PENDING_BASE 0x1000
 #define SIFIVE_E_PLIC_ENABLE_BASE 0x2000
 #define SIFIVE_E_PLIC_ENABLE_STRIDE 0x80
diff --git a/include/hw/riscv/sifive_u.h b/include/hw/riscv/sifive_u.h
index 8f63a183c4..e680d61ece 100644
--- a/include/hw/riscv/sifive_u.h
+++ b/include/hw/riscv/sifive_u.h
@@ -158,7 +158,7 @@ enum {
 
 #define SIFIVE_U_PLIC_NUM_SOURCES 54
 #define SIFIVE_U_PLIC_NUM_PRIORITIES 7
-#define SIFIVE_U_PLIC_PRIORITY_BASE 0x04
+#define SIFIVE_U_PLIC_PRIORITY_BASE 0x00
 #define SIFIVE_U_PLIC_PENDING_BASE 0x1000
 #define SIFIVE_U_PLIC_ENABLE_BASE 0x2000
 #define SIFIVE_U_PLIC_ENABLE_STRIDE 0x80
diff --git a/include/hw/riscv/virt.h b/include/hw/riscv/virt.h
index e1ce0048af..3407c9e8dd 100644
--- a/include/hw/riscv/virt.h
+++ b/include/hw/riscv/virt.h
@@ -98,7 +98,7 @@ enum {
 #define VIRT_IRQCHIP_MAX_GUESTS_BITS 3
 #define VIRT_IRQCHIP_MAX_GUESTS ((1U << VIRT_IRQCHIP_MAX_GUESTS_BITS) - 1U)
 
-#define VIRT_PLIC_PRIORITY_BASE 0x04
+#define VIRT_PLIC_PRIORITY_BASE 0x00
 #define VIRT_PLIC_PENDING_BASE 0x1000
 #define VIRT_PLIC_ENABLE_BASE 0x2000
 #define VIRT_PLIC_ENABLE_STRIDE 0x80
diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
index 1edeb1e1ed..1a792cc3f5 100644
--- a/hw/intc/sifive_plic.c
+++ b/hw/intc/sifive_plic.c
@@ -140,7 +140,7 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr addr, 
unsigned size)
 SiFivePLICState *plic = opaque;
 
 if (addr_between(addr, plic->priority_base, plic->num_sources << 2)) {
-uint32_t irq = ((addr - plic->priority_base) >> 2) + 1;
+uint32_t irq = (addr - plic->priority_base) >> 2;
 
 return plic->source_priority[irq];
 } else if (addr_between(addr, plic->pending_base, plic->num_sources >> 3)) 
{
@@ -187,7 +187,7 @@ static void sifive_plic_write(void *opaque, hwaddr addr, 
uint64_t value,
 SiFivePLICState *plic = opaque;
 
 if (addr_between(addr, plic->priority_base, plic->num_sources << 2)) {
-uint32_t irq = ((addr - plic->priority_base) >> 2) + 1;
+uint32_t irq = (addr - plic->priority_base) >> 2;
 
 if (((plic->num_priorities + 1) & plic->num_priorities) == 0) {
 /*
@@ -428,6 +428,7 @@ static Property sifive_plic_properties[] = {
 /* number of interrupt sources including interrupt source 0 */
 DEFINE_PROP_UINT32("num-sources", SiFivePLICState, num_sources, 1),
 DEFINE_PROP_UINT32("num-priorities", SiFivePLICState, num_priorities, 0),
+/* interrupt priority register base starting from source 0 */
 DEFINE_PROP_UINT32("priority-base", SiFivePLICState, priority_base, 0),
 DEFIN

[PATCH v2 02/16] hw/intc: Select MSI_NONBROKEN in RISC-V AIA interrupt controllers

2022-12-07 Thread Bin Meng
hw/pci/Kconfig says MSI_NONBROKEN should be selected by interrupt
controllers regardless of how MSI is implemented. msi_nonbroken is
initialized to true in both riscv_aplic_realize() and
riscv_imsic_realize().

Select MSI_NONBROKEN in RISCV_APLIC and RISCV_IMSIC.

Signed-off-by: Bin Meng 
Reviewed-by: Alistair Francis 
---

(no changes since v1)

 hw/intc/Kconfig | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/intc/Kconfig b/hw/intc/Kconfig
index 1d4573e803..21441d0a0c 100644
--- a/hw/intc/Kconfig
+++ b/hw/intc/Kconfig
@@ -72,9 +72,11 @@ config RISCV_ACLINT
 
 config RISCV_APLIC
 bool
+select MSI_NONBROKEN
 
 config RISCV_IMSIC
 bool
+select MSI_NONBROKEN
 
 config SIFIVE_PLIC
 bool
-- 
2.34.1




[PATCH v2 01/16] hw/riscv: Select MSI_NONBROKEN in SIFIVE_PLIC

2022-12-07 Thread Bin Meng
hw/pci/Kconfig says MSI_NONBROKEN should be selected by interrupt
controllers regardless of how MSI is implemented. msi_nonbroken is
initialized to true in sifive_plic_realize().

Let SIFIVE_PLIC select MSI_NONBROKEN and drop the selection from
RISC-V machines.

Signed-off-by: Bin Meng 
Reviewed-by: Alistair Francis 
---

(no changes since v1)

 hw/intc/Kconfig  | 1 +
 hw/riscv/Kconfig | 5 -
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/hw/intc/Kconfig b/hw/intc/Kconfig
index ecd2883ceb..1d4573e803 100644
--- a/hw/intc/Kconfig
+++ b/hw/intc/Kconfig
@@ -78,6 +78,7 @@ config RISCV_IMSIC
 
 config SIFIVE_PLIC
 bool
+select MSI_NONBROKEN
 
 config GOLDFISH_PIC
 bool
diff --git a/hw/riscv/Kconfig b/hw/riscv/Kconfig
index 79ff61c464..167dc4cca6 100644
--- a/hw/riscv/Kconfig
+++ b/hw/riscv/Kconfig
@@ -11,7 +11,6 @@ config MICROCHIP_PFSOC
 select MCHP_PFSOC_IOSCB
 select MCHP_PFSOC_MMUART
 select MCHP_PFSOC_SYSREG
-select MSI_NONBROKEN
 select RISCV_ACLINT
 select SIFIVE_PDMA
 select SIFIVE_PLIC
@@ -37,7 +36,6 @@ config RISCV_VIRT
 imply TPM_TIS_SYSBUS
 select RISCV_NUMA
 select GOLDFISH_RTC
-select MSI_NONBROKEN
 select PCI
 select PCI_EXPRESS_GENERIC_BRIDGE
 select PFLASH_CFI01
@@ -53,7 +51,6 @@ config RISCV_VIRT
 
 config SIFIVE_E
 bool
-select MSI_NONBROKEN
 select RISCV_ACLINT
 select SIFIVE_GPIO
 select SIFIVE_PLIC
@@ -64,7 +61,6 @@ config SIFIVE_E
 config SIFIVE_U
 bool
 select CADENCE
-select MSI_NONBROKEN
 select RISCV_ACLINT
 select SIFIVE_GPIO
 select SIFIVE_PDMA
@@ -82,6 +78,5 @@ config SPIKE
 bool
 select RISCV_NUMA
 select HTIF
-select MSI_NONBROKEN
 select RISCV_ACLINT
 select SIFIVE_PLIC
-- 
2.34.1




[PATCH v2 04/16] hw/riscv: Sort machines Kconfig options in alphabetical order

2022-12-07 Thread Bin Meng
SHAKTI_C machine Kconfig option was inserted in disorder. Fix it.

Signed-off-by: Bin Meng 
Reviewed-by: Alistair Francis 
---

(no changes since v1)

 hw/riscv/Kconfig | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/hw/riscv/Kconfig b/hw/riscv/Kconfig
index 1e4b58024f..4550b3b938 100644
--- a/hw/riscv/Kconfig
+++ b/hw/riscv/Kconfig
@@ -4,6 +4,8 @@ config RISCV_NUMA
 config IBEX
 bool
 
+# RISC-V machines in alphabetical order
+
 config MICROCHIP_PFSOC
 bool
 select CADENCE_SDHCI
@@ -22,13 +24,6 @@ config OPENTITAN
 select SIFIVE_PLIC
 select UNIMP
 
-config SHAKTI_C
-bool
-select UNIMP
-select SHAKTI_UART
-select RISCV_ACLINT
-select SIFIVE_PLIC
-
 config RISCV_VIRT
 bool
 imply PCI_DEVICES
@@ -50,6 +45,13 @@ config RISCV_VIRT
 select FW_CFG_DMA
 select PLATFORM_BUS
 
+config SHAKTI_C
+bool
+select RISCV_ACLINT
+select SHAKTI_UART
+select SIFIVE_PLIC
+select UNIMP
+
 config SIFIVE_E
 bool
 select RISCV_ACLINT
-- 
2.34.1




[PATCH v2 15/16] hw/riscv: opentitan: Drop "hartid-base" and "priority-base" initialization

2022-12-07 Thread Bin Meng
"hartid-base" and "priority-base" are zero by default. There is no
need to initialize them to zero again.

Signed-off-by: Bin Meng 
Reviewed-by: Wilfred Mallawa 
---

(no changes since v1)

 hw/riscv/opentitan.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/hw/riscv/opentitan.c b/hw/riscv/opentitan.c
index 78f895d773..85ffdac5be 100644
--- a/hw/riscv/opentitan.c
+++ b/hw/riscv/opentitan.c
@@ -173,10 +173,8 @@ static void lowrisc_ibex_soc_realize(DeviceState *dev_soc, 
Error **errp)
 
 /* PLIC */
 qdev_prop_set_string(DEVICE(&s->plic), "hart-config", "M");
-qdev_prop_set_uint32(DEVICE(&s->plic), "hartid-base", 0);
 qdev_prop_set_uint32(DEVICE(&s->plic), "num-sources", 180);
 qdev_prop_set_uint32(DEVICE(&s->plic), "num-priorities", 3);
-qdev_prop_set_uint32(DEVICE(&s->plic), "priority-base", 0x00);
 qdev_prop_set_uint32(DEVICE(&s->plic), "pending-base", 0x1000);
 qdev_prop_set_uint32(DEVICE(&s->plic), "enable-base", 0x2000);
 qdev_prop_set_uint32(DEVICE(&s->plic), "enable-stride", 32);
-- 
2.34.1




[PATCH v2 03/16] hw/riscv: Fix opentitan dependency to SIFIVE_PLIC

2022-12-07 Thread Bin Meng
Since commit ef6310064820 ("hw/riscv: opentitan: Update to the latest build")
the IBEX PLIC model was replaced with the SiFive PLIC model in the
'opentitan' machine but we forgot the add the dependency there.

Signed-off-by: Bin Meng 
Reviewed-by: Wilfred Mallawa 
Reviewed-by: Alistair Francis 
---

(no changes since v1)

 hw/riscv/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/riscv/Kconfig b/hw/riscv/Kconfig
index 167dc4cca6..1e4b58024f 100644
--- a/hw/riscv/Kconfig
+++ b/hw/riscv/Kconfig
@@ -19,6 +19,7 @@ config MICROCHIP_PFSOC
 config OPENTITAN
 bool
 select IBEX
+select SIFIVE_PLIC
 select UNIMP
 
 config SHAKTI_C
-- 
2.34.1





[PATCH v2 12/16] hw/riscv: sifive_u: Avoid using magic number for "riscv, ndev"

2022-12-07 Thread Bin Meng
At present magic number is used to create "riscv,ndev" property
in the dtb. Let's use the macro SIFIVE_U_PLIC_NUM_SOURCES that
is used to instantiate the PLIC model instead.

Signed-off-by: Bin Meng 
Reviewed-by: Wilfred Mallawa 
Reviewed-by: Alistair Francis 
---

(no changes since v1)

 hw/riscv/sifive_u.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/riscv/sifive_u.c b/hw/riscv/sifive_u.c
index b139824aab..b40a4767e2 100644
--- a/hw/riscv/sifive_u.c
+++ b/hw/riscv/sifive_u.c
@@ -287,7 +287,8 @@ static void create_fdt(SiFiveUState *s, const MemMapEntry 
*memmap,
 qemu_fdt_setprop_cells(fdt, nodename, "reg",
 0x0, memmap[SIFIVE_U_DEV_PLIC].base,
 0x0, memmap[SIFIVE_U_DEV_PLIC].size);
-qemu_fdt_setprop_cell(fdt, nodename, "riscv,ndev", 0x35);
+qemu_fdt_setprop_cell(fdt, nodename, "riscv,ndev",
+  SIFIVE_U_PLIC_NUM_SOURCES - 1);
 qemu_fdt_setprop_cell(fdt, nodename, "phandle", plic_phandle);
 plic_phandle = qemu_fdt_get_phandle(fdt, nodename);
 g_free(cells);
-- 
2.34.1




[PATCH v2 13/16] hw/riscv: virt: Fix the value of "riscv, ndev" in the dtb

2022-12-07 Thread Bin Meng
Commit 28d8c281200f ("hw/riscv: virt: Add optional AIA IMSIC support to virt 
machine")
changed the value of VIRT_IRQCHIP_NUM_SOURCES from 127 to 53, which
is VIRTIO_NDEV and also used as the value of "riscv,ndev" property
in the dtb. Unfortunately this is wrong as VIRT_IRQCHIP_NUM_SOURCES
should include interrupt source 0 but "riscv,ndev" does not.

While we are here, we also fix the comments of platform bus irq range
which is now "64 to 96", but should be "64 to 95", introduced since
commit 1832b7cb3f64 ("hw/riscv: virt: Create a platform bus").

Fixes: 28d8c281200f ("hw/riscv: virt: Add optional AIA IMSIC support to virt 
machine")
Signed-off-by: Bin Meng 
Reviewed-by: Alistair Francis 
---

(no changes since v1)

 include/hw/riscv/virt.h | 5 ++---
 hw/riscv/virt.c | 3 ++-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/hw/riscv/virt.h b/include/hw/riscv/virt.h
index 62513e075c..e1ce0048af 100644
--- a/include/hw/riscv/virt.h
+++ b/include/hw/riscv/virt.h
@@ -87,14 +87,13 @@ enum {
 VIRTIO_IRQ = 1, /* 1 to 8 */
 VIRTIO_COUNT = 8,
 PCIE_IRQ = 0x20, /* 32 to 35 */
-VIRT_PLATFORM_BUS_IRQ = 64, /* 64 to 96 */
-VIRTIO_NDEV = 96 /* Arbitrary maximum number of interrupts */
+VIRT_PLATFORM_BUS_IRQ = 64, /* 64 to 95 */
 };
 
 #define VIRT_PLATFORM_BUS_NUM_IRQS 32
 
 #define VIRT_IRQCHIP_NUM_MSIS 255
-#define VIRT_IRQCHIP_NUM_SOURCES VIRTIO_NDEV
+#define VIRT_IRQCHIP_NUM_SOURCES 96
 #define VIRT_IRQCHIP_NUM_PRIO_BITS 3
 #define VIRT_IRQCHIP_MAX_GUESTS_BITS 3
 #define VIRT_IRQCHIP_MAX_GUESTS ((1U << VIRT_IRQCHIP_MAX_GUESTS_BITS) - 1U)
diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
index 6cf9355b99..94ff2a1584 100644
--- a/hw/riscv/virt.c
+++ b/hw/riscv/virt.c
@@ -468,7 +468,8 @@ static void create_fdt_socket_plic(RISCVVirtState *s,
 plic_cells, s->soc[socket].num_harts * sizeof(uint32_t) * 4);
 qemu_fdt_setprop_cells(mc->fdt, plic_name, "reg",
 0x0, plic_addr, 0x0, memmap[VIRT_PLIC].size);
-qemu_fdt_setprop_cell(mc->fdt, plic_name, "riscv,ndev", VIRTIO_NDEV);
+qemu_fdt_setprop_cell(mc->fdt, plic_name, "riscv,ndev",
+  VIRT_IRQCHIP_NUM_SOURCES - 1);
 riscv_socket_fdt_write_id(mc, mc->fdt, plic_name, socket);
 qemu_fdt_setprop_cell(mc->fdt, plic_name, "phandle",
 plic_phandles[socket]);
-- 
2.34.1




[PATCH v2 11/16] hw/riscv: sifive_e: Fix the number of interrupt sources of PLIC

2022-12-07 Thread Bin Meng
Per chapter 10 in Freedom E310 manuals [1][2][3], E310 G002 and G003
supports 52 interrupt sources while G000 supports 51 interrupt sources.

We use the value of G002 and G003, so it is 53 (including source 0).

[1] G000 manual:
https://sifive.cdn.prismic.io/sifive/4faf3e34-4a42-4c2f-be9e-c77baa4928c7_fe310-g000-manual-v3p2.pdf

[2] G002 manual:
https://sifive.cdn.prismic.io/sifive/034760b5-ac6a-4b1c-911c-f4148bb2c4a5_fe310-g002-v1p5.pdf

[3] G003 manual:
https://sifive.cdn.prismic.io/sifive/3af39c59-6498-471e-9dab-5355a0d539eb_fe310-g003-manual.pdf

Fixes: eb637edb1241 ("SiFive Freedom E Series RISC-V Machine")
Signed-off-by: Bin Meng 
Reviewed-by: Wilfred Mallawa 
Reviewed-by: Alistair Francis 
---

(no changes since v1)

 include/hw/riscv/sifive_e.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/hw/riscv/sifive_e.h b/include/hw/riscv/sifive_e.h
index d738745925..9e58247fd8 100644
--- a/include/hw/riscv/sifive_e.h
+++ b/include/hw/riscv/sifive_e.h
@@ -82,7 +82,12 @@ enum {
 };
 
 #define SIFIVE_E_PLIC_HART_CONFIG "M"
-#define SIFIVE_E_PLIC_NUM_SOURCES 127
+/*
+ * Freedom E310 G002 and G003 supports 52 interrupt sources while
+ * Freedom E310 G000 supports 51 interrupt sources. We use the value
+ * of G002 and G003, so it is 53 (including interrupt source 0).
+ */
+#define SIFIVE_E_PLIC_NUM_SOURCES 53
 #define SIFIVE_E_PLIC_NUM_PRIORITIES 7
 #define SIFIVE_E_PLIC_PRIORITY_BASE 0x04
 #define SIFIVE_E_PLIC_PENDING_BASE 0x1000
-- 
2.34.1




[PATCH v2 09/16] hw/intc: sifive_plic: Update "num-sources" property default value

2022-12-07 Thread Bin Meng
At present the default value of "num-sources" property is zero,
which does not make a lot of sense, as in sifive_plic_realize()
we see s->bitfield_words is calculated by:

  s->bitfield_words = (s->num_sources + 31) >> 5;

if the we don't configure "num-sources" property its default value
zero makes s->bitfield_words zero too, which isn't true because
interrupt source 0 still occupies one word.

Let's change the default value to 1 meaning that only interrupt
source 0 is supported by default and a sanity check in realize().

While we are here, add a comment to describe the exact meaning of
this property that the number should include interrupt source 0.

Signed-off-by: Bin Meng 
Reviewed-by: Alistair Francis 

---

Changes in v2:
- use error_setg() to propagate the error up via errp instead

 hw/intc/sifive_plic.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
index 9cb4c6d6d4..1edeb1e1ed 100644
--- a/hw/intc/sifive_plic.c
+++ b/hw/intc/sifive_plic.c
@@ -363,6 +363,11 @@ static void sifive_plic_realize(DeviceState *dev, Error 
**errp)
 
 parse_hart_config(s);
 
+if (!s->num_sources) {
+error_setg(errp, "plic: invalid number of interrupt sources");
+return;
+}
+
 s->bitfield_words = (s->num_sources + 31) >> 5;
 s->num_enables = s->bitfield_words * s->num_addrs;
 s->source_priority = g_new0(uint32_t, s->num_sources);
@@ -420,7 +425,8 @@ static const VMStateDescription vmstate_sifive_plic = {
 static Property sifive_plic_properties[] = {
 DEFINE_PROP_STRING("hart-config", SiFivePLICState, hart_config),
 DEFINE_PROP_UINT32("hartid-base", SiFivePLICState, hartid_base, 0),
-DEFINE_PROP_UINT32("num-sources", SiFivePLICState, num_sources, 0),
+/* number of interrupt sources including interrupt source 0 */
+DEFINE_PROP_UINT32("num-sources", SiFivePLICState, num_sources, 1),
 DEFINE_PROP_UINT32("num-priorities", SiFivePLICState, num_priorities, 0),
 DEFINE_PROP_UINT32("priority-base", SiFivePLICState, priority_base, 0),
 DEFINE_PROP_UINT32("pending-base", SiFivePLICState, pending_base, 0),
-- 
2.34.1




[PATCH v2 07/16] hw/intc: sifive_plic: Improve robustness of the PLIC config parser

2022-12-07 Thread Bin Meng
At present the PLIC config parser can only handle legal config string
like "MS,MS". However if a config string like ",MS,MS,,MS,MS,," is
given the parser won't get the correct configuration.

This commit improves the config parser to make it more robust.

Signed-off-by: Bin Meng 
Acked-by: Alistair Francis 
---

(no changes since v1)

 hw/intc/sifive_plic.c | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
index 936dcf74bc..c9af94a888 100644
--- a/hw/intc/sifive_plic.c
+++ b/hw/intc/sifive_plic.c
@@ -290,7 +290,7 @@ static void sifive_plic_reset(DeviceState *dev)
  */
 static void parse_hart_config(SiFivePLICState *plic)
 {
-int addrid, hartid, modes;
+int addrid, hartid, modes, m;
 const char *p;
 char c;
 
@@ -299,11 +299,13 @@ static void parse_hart_config(SiFivePLICState *plic)
 p = plic->hart_config;
 while ((c = *p++)) {
 if (c == ',') {
-addrid += ctpop8(modes);
-modes = 0;
-hartid++;
+if (modes) {
+addrid += ctpop8(modes);
+hartid++;
+modes = 0;
+}
 } else {
-int m = 1 << char_to_mode(c);
+m = 1 << char_to_mode(c);
 if (modes == (modes | m)) {
 error_report("plic: duplicate mode '%c' in config: %s",
  c, plic->hart_config);
@@ -314,8 +316,9 @@ static void parse_hart_config(SiFivePLICState *plic)
 }
 if (modes) {
 addrid += ctpop8(modes);
+hartid++;
+modes = 0;
 }
-hartid++;
 
 plic->num_addrs = addrid;
 plic->num_harts = hartid;
@@ -326,11 +329,16 @@ static void parse_hart_config(SiFivePLICState *plic)
 p = plic->hart_config;
 while ((c = *p++)) {
 if (c == ',') {
-hartid++;
+if (modes) {
+hartid++;
+modes = 0;
+}
 } else {
+m = char_to_mode(c);
 plic->addr_config[addrid].addrid = addrid;
 plic->addr_config[addrid].hartid = hartid;
-plic->addr_config[addrid].mode = char_to_mode(c);
+plic->addr_config[addrid].mode = m;
+modes |= (1 << m);
 addrid++;
 }
 }
-- 
2.34.1





Re: [PATCH 02/11] exec: Restrict hwaddr.h to sysemu/

2022-12-07 Thread Claudio Fontana
On 12/6/22 18:09, Philippe Mathieu-Daudé wrote:
> On 6/12/22 16:38, Claudio Fontana wrote:
>> On 12/6/22 15:53, Claudio Fontana wrote:
>>> On 5/17/21 13:11, Philippe Mathieu-Daudé wrote:
 Guard declarations within hwaddr.h against inclusion
 from user-mode emulation.

 To make it clearer this header is sysemu specific,
 move it to the sysemu/ directory.
>>>
>>> Hi Philippe,
>>>
>>> do we need include/exec/sysemu/... .h
>>>
>>> as opposed to just use the existing
>>>
>>> include/sysemu/
>>>
>>> ?
>>
>> ...and I would if anything go include/sysemu/exec/ not include/exec/sysemu ,
>>
>> to highlight first that it is part of the sysemu build, when trying to 
>> reason about what gets built for sysemu vs anything else.
> 
> While refreshing this series I moved these files directly in 
> include/sysemu/. Do you think the exec/ subdirectory {help|meaning}ful?
> 

Hi Philippe, not so much, I think include/sysemu/ is enough.

Ciao,

Claudio



Re: [PATCH 14/15] hw/riscv: opentitan: Drop "hartid-base" and "priority-base" initialization

2022-12-07 Thread Bin Meng
Hi Alistair,

On Wed, Dec 7, 2022 at 12:38 PM Alistair Francis  wrote:
>
> On Fri, Dec 2, 2022 at 12:09 AM Bin Meng  wrote:
> >
> > "hartid-base" and "priority-base" are zero by default. There is no
> > need to initialize them to zero again.
>
> What is the defaults change though? I feel like these are worth leaving in
>

If the defaults change we should review all codes that use this model
and do necessary change accordingly. I just see no need to
re-initialize them to the default value, that's why we have a default
one assigned. But I am fine to keep these codes if you think it's
worth it.

Regards,
Bin



Re: How to best make include/hw/pci/pcie_sriov.h self-contained

2022-12-07 Thread Michael S. Tsirkin
On Wed, Dec 07, 2022 at 07:25:49AM +0100, Markus Armbruster wrote:
> pcie_sriov.h needs PCI_NUM_REGIONS from pci.h, but doesn't include it.
> pci.h must be included before pcie_sriov.h or else compile fails.
> 
> Adding #include "pci/pci.h" to pcie_sriov would be wrong, because it
> would close an inclusion loop: pci.h includes pcie.h (for
> PCIExpressDevice) includes pcie_sriov.h (for PCIESriovPF) includes pci.h
> (for PCI_NUM_REGIONS).
> 
> The obvious solution is to move PCI_NUM_REGIONS pci.h somewhere
> pcie_sriov.h can include without creating a loop.
> 
> We already have a few headers that don't include anything: pci_ids.h,
> pci_regs.h (includes include/standard-headers/linux/pci_regs.h, which
> doesn't count), pcie_regs.h.  Moving PCI_NUM_REGIONS to one of these
> would work, but it doesn't feel right.
> 
> We could create a new one, say pci_defs.h.  Just for PCI_NUM_REGIONS
> feels silly.  So, what else should move there?

I'm ok with pci_defs.h
However, I note that most headers including pci.h don't really
need it. Consider include/hw/virtio/virtio-iommu.h all it needs is
PCIBus typedef this is available from qemu/typedefs.h
So if you are poking at this, want to clean that area up generally?

> Any other ideas?
> 
> In case you wonder why I bother you with this...
> 
> Back in 2016, we discussed[1] rules for headers, and these were
> generally liked:
> 
> 1. Have a carefully curated header that's included everywhere first.  We
>got that already thanks to Peter: osdep.h.
> 
> 2. Headers should normally include everything they need beyond osdep.h.
>If exceptions are needed for some reason, they must be documented in
>the header.  If all that's needed from a header is typedefs, put
>those into qemu/typedefs.h instead of including the header.
> 
> 3. Cyclic inclusion is forbidden.
> 
> I'm working on patches to get include/ closer to obeying 2.
> 
> [1] Message-ID: <87h9g8j57d@blackfin.pond.sub.org>
> https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg03345.html




[PATCH] vhost-user-blk: Fix live migration crash during event handling

2022-12-07 Thread Yajun Wu
After live migration with virtio block device, qemu crash at:

#0 0x7fe051e54269 in g_source_destroy () at 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
#1 0x55cebaa5f37d in qio_net_listener_set_client_func_full 
(listener=0x55cebceab340, func=0x55cebab4f5f2 , 
data=0x55cebcdfcc00, notify=0x0, context=0x0) at ../io/net-listener.c:157
#2 0x55cebab4ea99 in tcp_chr_update_read_handler 
(chr=0x55cebcdfcc00) at ../chardev/char-socket.c:639
#3 0x55cebab529fa in qemu_chr_be_update_read_handlers 
(s=0x55cebcdfcc00, context=0x0) at ../chardev/char.c:226
#4 0x55cebab4a04e in qemu_chr_fe_set_handlers_full 
(b=0x55cebdf52120, fd_can_read=0x0, fd_read=0x0, fd_event=0x0, be_change=0x0, 
opaque=0x0, context=0x0, set_open=false, sync_state=true) at 
../chardev/char-fe.c:279
#5 0x55cebab4a0f6 in qemu_chr_fe_set_handlers(b=0x55cebdf52120, 
fd_can_read=0x0, fd_read=0x0, fd_event=0x0, be_change=0x0, opaque=0x0, 
context=0x0, set_open=false) at ../chardev/char-fe.c:304
#6 0x55ceba8ec3c8 in vhost_user_blk_event (opaque=0x55cebdf51f40, 
event=CHR_EVENT_CLOSED) at ../hw/block/vhost-user-blk.c:412
#7 0x55cebab524a1 in chr_be_event (s=0x55cebcdfcc00, 
event=CHR_EVENT_CLOSED) at ../chardev/char.c:61
#8 0x55cebab52519 in qemu_chr_be_event (s=0x55cebcdfcc00, 
event=CHR_EVENT_CLOSED) at ../chardev/char.c:81
#9 0x55cebab4fce4 in char_socket_finalize (obj=0x55cebcdfcc00) at 
../chardev/char-socket.c:1085
#10 0x55cebaa4cde5 in object_deinit (obj=0x55cebcdfcc00, 
type=0x55cebcc67160) at ../qom/object.c:675
#11 0x55cebaa4ce5b in object_finalize (data=0x55cebcdfcc00) at 
../qom/object.c:689
#12 0x55cebaa4dcec in object_unref (objptr=0x55cebcdfcc00) at 
../qom/object.c:1192
#13 0x55cebaa4f3ee in object_finalize_child_property 
(obj=0x55cebcc6df40, name=0x55cebcead490 "char0", opaque=0x55cebcdfcc00) at 
../qom/object.c:1735
#14 0x55cebaa4cbe4 in object_property_del_all (obj=0x55cebcc6df40) 
at ../qom/object.c:627
#15 0x55cebaa4ce48 in object_finalize (data=0x55cebcc6df40) at 
../qom/object.c:688
#16 0x55cebaa4dcec in object_unref (objptr=0x55cebcc6df40) at 
../qom/object.c:1192
#17 0x55cebaa4f3ee in object_finalize_child_property 
(obj=0x55cebce96e00, name=0x55cebceab300 "chardevs", opaque=0x55cebcc6df40) at 
../qom/object.c:1735
#18 0x55cebaa4ccd1 in object_property_del_child 
(obj=0x55cebce96e00, child=0x55cebcc6df40) at ../qom/object.c:649
#19 0x55cebaa4cdb0 in object_unparent (obj=0x55cebcc6df40) at 
../qom/object.c:668
#20 0x55cebab55124 in qemu_chr_cleanup () at ../chardev/char.c:1222
#21 0x55ceba79a561 in qemu_cleanup () at ../softmmu/runstate.c:823
#22 0x55ceba53d65f in qemu_main (argc=78, argv=0x7ffc9440bd98, 
envp=0x0) at ../softmmu/main.c:37
#23 0x55ceba53d68f in main (argc=78, argv=0x7ffc9440bd98) at 
../softmmu/main.c:45

Function qemu_chr_fe_set_handlers should not be called in qemu_chr_cleanup,
because chardev already freed. Quick fix is to handle RUN_STATE_POSTMIGRATE
same as RUN_STATE_SHUTDOWN.

Better solution is to add block device cleanup function like net_cleanup and
call it in qemu_cleanup.

Signed-off-by: Yajun Wu 
Acked-by: Parav Pandit 
---
 hw/block/vhost-user-blk.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 0d5190accf..b323d5820b 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -110,7 +110,7 @@ static int vhost_user_blk_handle_config_change(struct 
vhost_dev *dev)
 }
 
 /* valid for resize only */
-if (blkcfg.capacity != s->blkcfg.capacity) {
+if (s && blkcfg.capacity != s->blkcfg.capacity) {
 s->blkcfg.capacity = blkcfg.capacity;
 memcpy(dev->vdev->config, &s->blkcfg, vdev->config_len);
 virtio_notify_config(dev->vdev);
@@ -398,7 +398,8 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent 
event)
 }
 break;
 case CHR_EVENT_CLOSED:
-if (!runstate_check(RUN_STATE_SHUTDOWN)) {
+if (!runstate_check(RUN_STATE_SHUTDOWN) &&
+!runstate_check(RUN_STATE_POSTMIGRATE)) {
 /*
  * A close event may happen during a read/write, but vhost
  * code assumes the vhost_dev remains setup, so delay the
-- 
2.27.0




Re: How to best make include/hw/pci/pcie_sriov.h self-contained

2022-12-07 Thread Michael S. Tsirkin
On Wed, Dec 07, 2022 at 10:02:53AM +0100, Philippe Mathieu-Daudé wrote:
> On 7/12/22 07:25, Markus Armbruster wrote:
> > pcie_sriov.h needs PCI_NUM_REGIONS from pci.h, but doesn't include it.
> > pci.h must be included before pcie_sriov.h or else compile fails.
> > 
> > Adding #include "pci/pci.h" to pcie_sriov would be wrong, because it
> > would close an inclusion loop: pci.h includes pcie.h (for
> > PCIExpressDevice) includes pcie_sriov.h (for PCIESriovPF) includes pci.h
> > (for PCI_NUM_REGIONS).
> > 
> > The obvious solution is to move PCI_NUM_REGIONS pci.h somewhere
> > pcie_sriov.h can include without creating a loop.
> > 
> > We already have a few headers that don't include anything: pci_ids.h,
> > pci_regs.h (includes include/standard-headers/linux/pci_regs.h, which
> > doesn't count), pcie_regs.h.  Moving PCI_NUM_REGIONS to one of these
> > would work, but it doesn't feel right.
> > 
> > We could create a new one, say pci_defs.h.  Just for PCI_NUM_REGIONS
> > feels silly.  So, what else should move there?
> 
> Sounds good to me. Eventually name it pci_standard_defs.h?

standard is not a good name for PCI_NUM_REGIONS. It falls out of
how QEMU represents things not directly out of the standard.
QEMU supports up to 6 BAR registers + 1 expansion ROM.
That's where the number comes from.
Same with PCI_ROM_SLOT - that's a QEMU convention.



> We can move the first 100 lines of pci.h there, PCI_ROM_SLOT,
> PCI_NUM_REGIONS, PCI HEADER_TYPE, PCI_NUM_PINS, cap_present, and eventually
> PCIINTxRoute & PCIReqIDType.

It's a good point that PCI_ROM_SLOT should live with PCI_NUM_REGIONS.

> > 
> > Any other ideas?
> > 
> > In case you wonder why I bother you with this...
> > 
> > Back in 2016, we discussed[1] rules for headers, and these were
> > generally liked:
> > 
> > 1. Have a carefully curated header that's included everywhere first.  We
> > got that already thanks to Peter: osdep.h.
> > 
> > 2. Headers should normally include everything they need beyond osdep.h.
> > If exceptions are needed for some reason, they must be documented in
> > the header.  If all that's needed from a header is typedefs, put
> > those into qemu/typedefs.h instead of including the header.
> > 
> > 3. Cyclic inclusion is forbidden.
> > 
> > I'm working on patches to get include/ closer to obeying 2.
> > 
> > [1] Message-ID: <87h9g8j57d@blackfin.pond.sub.org>
> >  https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg03345.html
> > 
> > 




Re: [PATCH v2] tests/stream-under-throttle: New test

2022-12-07 Thread Christian Borntraeger

Am 14.11.22 um 10:52 schrieb Hanna Reitz:

Test streaming a base image into the top image underneath two throttle
nodes.  This was reported to make qemu 7.1 hang
(https://gitlab.com/qemu-project/qemu/-/issues/1215), so this serves as
a regression test.

Signed-off-by: Hanna Reitz 
---
Based-on: <20221107151321.211175-1-hre...@redhat.com>

v1: https://lists.nongnu.org/archive/html/qemu-block/2022-11/msg00368.html

v2:
- Replace `asyncio.exceptions.TimeoutError` by `asyncio.TimeoutError`:
   Stefan reported that the CI does not recognize the former:
   https://lists.nongnu.org/archive/html/qemu-block/2022-11/msg00424.html

   As far as I understand, the latter was basically moved to become the
   former in Python 3.11, and an alias remains, so both are basically
   equivalent.  I only have 3.10, though, where the documentation says
   that both are different, even though using either seems to work fine
   (i.e. both catch the timeout there).  Not sure about previous
   versions, but the CI seems pretty certain about not knowing
   `asyncio.exceptions.TimeoutError`, so use `asyncio.TimeoutError`
   instead.  (Even though that is deprecated in 3.11, but this is not the
   first place in the tree to use it, so it should not be too bad.)
---
  .../qemu-iotests/tests/stream-under-throttle  | 121 ++
  .../tests/stream-under-throttle.out   |   5 +
  2 files changed, 126 insertions(+)
  create mode 100755 tests/qemu-iotests/tests/stream-under-throttle
  create mode 100644 tests/qemu-iotests/tests/stream-under-throttle.out


As a heads up, I do get the following on s390. I have not yet looked into that:

+EE
+==
+ERROR: test_stream (__main__.TestStreamWithThrottle)
+Do a simple stream beneath the two throttle nodes.  Should complete
+--
+Traceback (most recent call last):
+  File "qemu/tests/qemu-iotests/tests/stream-under-throttle", line 110, in 
test_stream
+self.vm.run_job('stream')
+  File "qemu/tests/qemu-iotests/iotests.py", line 986, in run_job
+result = self.qmp('query-jobs')
+  File "qemu/python/qemu/machine/machine.py", line 646, in qmp
+ret = self._qmp.cmd(cmd, args=qmp_args)
+  File "qemu/python/qemu/qmp/legacy.py", line 204, in cmd
+return self.cmd_obj(qmp_cmd)
+  File "qemu/python/qemu/qmp/legacy.py", line 184, in cmd_obj
+self._qmp._raw(qmp_cmd, assign_id=False),
+  File "qemu/python/qemu/qmp/protocol.py", line 154, in _wrapper
+raise StateError(emsg, proto.runstate, required_state)
+qemu.qmp.protocol.StateError: QMPClient is disconnecting. Call disconnect() to 
return to IDLE state.
+
+==
+ERROR: test_stream (__main__.TestStreamWithThrottle)
+Do a simple stream beneath the two throttle nodes.  Should complete
+--
+Traceback (most recent call last):
+  File "qemu/python/qemu/machine/machine.py", line 533, in _soft_shutdown
+self.qmp('quit')
+  File "qemu/python/qemu/machine/machine.py", line 646, in qmp
+ret = self._qmp.cmd(cmd, args=qmp_args)
+  File "qemu/python/qemu/qmp/legacy.py", line 204, in cmd
+return self.cmd_obj(qmp_cmd)
+  File "qemu/python/qemu/qmp/legacy.py", line 184, in cmd_obj
+self._qmp._raw(qmp_cmd, assign_id=False),
+  File "qemu/python/qemu/qmp/protocol.py", line 154, in _wrapper
+raise StateError(emsg, proto.runstate, required_state)
+qemu.qmp.protocol.StateError: QMPClient is disconnecting. Call disconnect() to 
return to IDLE state.
+
+During handling of the above exception, another exception occurred:
+
+Traceback (most recent call last):
+  File "qemu/python/qemu/machine/machine.py", line 554, in _do_shutdown
+self._soft_shutdown(timeout)
+  File "qemu/python/qemu/machine/machine.py", line 536, in _soft_shutdown
+self._close_qmp_connection()
+  File "qemu/python/qemu/machine/machine.py", line 476, in 
_close_qmp_connection
+self._qmp.close()
+  File "qemu/python/qemu/qmp/legacy.py", line 277, in close
+self._sync(
+  File "qemu/python/qemu/qmp/legacy.py", line 94, in _sync
+return self._aloop.run_until_complete(
+  File "/usr/lib64/python3.10/asyncio/base_events.py", line 649, in 
run_until_complete
+return future.result()
+  File "/usr/lib64/python3.10/asyncio/tasks.py", line 408, in wait_for
+return await fut
+  File "qemu/python/qemu/qmp/protocol.py", line 398, in disconnect
+await self._wait_disconnect()
+  File "qemu/python/qemu/qmp/protocol.py", line 710, in _wait_disconnect
+await all_defined_tasks  # Raise Exceptions from the bottom half.
+  File "qemu/python/qemu/qmp/protocol.py", line 861, in _bh_loop_forever
+await async_fn()
+  File "qemu/python/qemu/qmp/protocol.py", line 899, in _bh_recv_message
+msg = await self._recv()
+  File "qemu/python/qemu/qmp/protocol.py", line 1000, in _recv
+ 

Re: [PATCH for-8.0] ui/vnc: fix bad address parsing

2022-12-07 Thread Vladimir Sementsov-Ogievskiy

On 12/6/22 23:12, Philippe Mathieu-Daudé wrote:

On 6/12/22 20:23, Vladimir Sementsov-Ogievskiy wrote:

IF addrstr == "[" and websocket is true, hostlen becomes 0 and we try
to access addrstr[hostlen-1] which is bad idea.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  ui/vnc.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ui/vnc.c b/ui/vnc.c
index 88f55cbf3c..8830bfe382 100644
--- a/ui/vnc.c
+++ b/ui/vnc.c
@@ -3765,7 +3765,7 @@ static int vnc_display_get_address(const char *addrstr,
  addr->type = SOCKET_ADDRESS_TYPE_INET;
  inet = &addr->u.inet;
-    if (addrstr[0] == '[' && addrstr[hostlen - 1] == ']') {
+    if (hostlen >= 2 && addrstr[0] == '[' && addrstr[hostlen - 1] == ']') {
  inet->host = g_strndup(addrstr + 1, hostlen - 2);
  } else {
  inet->host = g_strndup(addrstr, hostlen);


If addrstr is "[" then inet->host ends up being "[" too now, right?

I was pretty sure we had a helper for that, but can't find any.


that's all a bit strange, let's add a bit of debugging:
diff --git a/ui/vnc.c b/ui/vnc.c
index 88f55cbf3c..b1d463e67a 100644
--- a/ui/vnc.c
+++ b/ui/vnc.c
@@ -3770,6 +3770,7 @@ static int vnc_display_get_address(const char *addrstr,
 } else {
 inet->host = g_strndup(addrstr, hostlen);
 }
+printf("%s: websocket: %d, host: %s, port: %s\n", __func__, websocket, 
inet->host, port);
 /* plain VNC port is just an offset, for websocket
  * port is absolute */
 if (websocket) {


then:



./build/qemu-system-x86_64 -vnc [
qemu-system-x86_64: -vnc [: no vnc port specified


./build/qemu-system-x86_64 -vnc [,websocket
qemu-system-x86_64: -vnc [,websocket: warning: short-form boolean option 
'websocket' deprecated
Please use websocket=on instead
qemu-system-x86_64: -vnc [,websocket: no vnc port specified



./build/qemu-system-x86_64 -vnc [:0,websocket
qemu-system-x86_64: -vnc [:0,websocket: warning: short-form boolean option 
'websocket' deprecated
Please use websocket=on instead
vnc_display_get_address: websocket: 0, host: [, port: 0
vnc_display_get_address: websocket: 1, host: , port: on
qemu-system-x86_64: -vnc [:0,websocket: address resolution failed for [:5900: 
Name or service not known

./build/qemu-system-x86_64 -vnc [:0,websocket=on
vnc_display_get_address: websocket: 0, host: [, port: 0
vnc_display_get_address: websocket: 1, host: , port: on
qemu-system-x86_64: -vnc [:0,websocket=on: address resolution failed for 
[:5900: Name or service not known


so, "on" is treated as address string? (aha, that's OK, and it's parsed later 
in the code)

./build/qemu-system-x86_64 -vnc :0,websocket=[
vnc_display_get_address: websocket: 0, host: , port: 0
we are going to do bad thing!
vnc_display_get_address: websocket: 1, host: , port: [
qemu-system-x86_64: -vnc :0,websocket=[: address resolution failed for :[: 
Servname not supported for ai_socktype


--
Best regards,
Vladimir




Re: [RFC PATCH 12/21] i386/xen: set shared_info page

2022-12-07 Thread David Woodhouse
On Tue, 2022-12-06 at 10:00 +, Dr. David Alan Gilbert wrote:
> * Philippe Mathieu-Daudé (
> phi...@linaro.org
> ) wrote:
> > +Juan/David/Claudio.
> > 
> > On 6/12/22 03:20, David Woodhouse wrote:
> > > On Mon, 2022-12-05 at 23:17 +0100, Philippe Mathieu-Daudé wrote:
> > > > On 5/12/22 18:31, David Woodhouse wrote:
> > > > > From: Joao Martins <
> > > > > joao.m.mart...@oracle.com
> > > > > >
> > > > > 
> > > > > This is done by implementing HYPERVISOR_memory_op specifically
> > > > > XENMEM_add_to_physmap with space XENMAPSPACE_shared_info. While
> > > > > Xen removes the page with its own, we instead use the gfn passed
> > > > > by the guest.
> > > > > 
> > > > > Signed-off-by: Joao Martins <
> > > > > joao.m.mart...@oracle.com
> > > > > >
> > > > > Signed-off-by: David Woodhouse <
> > > > > d...@amazon.co.uk
> > > > > >
> > > > > ---
> > > > >accel/kvm/kvm-all.c  |  6 
> > > > >include/hw/core/cpu.h|  2 ++
> > > > >include/sysemu/kvm.h |  2 ++
> > > > >include/sysemu/kvm_int.h |  3 ++
> > > > >target/i386/cpu.h|  8 ++
> > > > >target/i386/trace-events |  1 +
> > > > >target/i386/xen-proto.h  | 19 +
> > > > >target/i386/xen.c| 61 
> > > > > 
> > > > >8 files changed, 102 insertions(+)
> > > > >create mode 100644 target/i386/xen-proto.h
> > > > 
> > > > 
> > > > > diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
> > > > > index 8830546121..e57b693528 100644
> > > > > --- a/include/hw/core/cpu.h
> > > > > +++ b/include/hw/core/cpu.h
> > > > > @@ -443,6 +443,8 @@ struct CPUState {
> > > > >/* track IOMMUs whose translations we've cached in the TCG TLB 
> > > > > */
> > > > >GArray *iommu_notifiers;
> > > > > +
> > > > > +struct XenState *xen_state;
> > > > 
> > > > Since you define a type definition below, use it.
> > > 
> > > Ack.
> > > 
> > > More importantly though, some of that state needs to be persisted
> > > across live migration / live update.
> > > 
> > > There is per-vCPU state (the GPAs for vcpu_info etc., upcall vector,
> > > timer info). I think I see how I could add that to the vmstate_x86_cpu
> > > defined in target/i386/machine.c.
> > > 
> > > For the machine-wide state, where do I add that? Should I just
> > > instantiate a dummy device (a bit like TYPE_KVM_CLOCK, AFAICT) to hang
> > > that state off?
> > 
> > XenState in CPUState indeed is an anti-pattern.
> > 
> > As you said elsewhere (patch 2 maybe) your use is not a new accelerator
> > but a machine, so new state should go in a derived MachineState.
> 
> I *think* the vmstate tends to be attached to a device rather than the
> machinetype itself; eg a PCIe bridge that the Machine instantiates.
> But yes, a 'dummy' type is fine for hanging vmstate off.

Below is an attempt at that. It adds a 'xen-overlay' device which hosts
the memory regions corresponding to "xenheap" pages, which need to be
mapped over guest GPAs on demand.

There's plenty to heckle here, but it basically seems to be working.
I've dumped the state (migrate "exec:cat>foo") and I can see the
correct shinfo_gpa there when the guest was running.

I added the device under hw/xen covered by CONFIG_XEN_EMU, and will
amend the existing shinfo patch to call xen_overlay_map_page() instead
of just *assuming* that there'll already be RAM there... which is true
for Linux guests but Windows uses an empty GFN instead of wasting a
page of real RAM.

There are some target-specific things to be migrated too, so if this
approach is sane then I'll probably add a similar dummy device in
target/i386/xen.c for the system-wide state in *addition* to...

> > Migration is not my area of expertise, but since only the xenfv machine
> > will use this configuration, it seems simpler to store the vCPUs
> > migration states there...
> 
> As long as ordering issues don't bite; i.e. between loading the
> new Xen specific stuff and loading the main cpu;  you can force
> order using the MIG_PRI_ constants on the .priority field.
> 
> I was going to suggest maybe you could add it to vmstate_cpu_common
> as a subsection; but I don't *think* that's used for x86
> (I think that's vmstate_x86_cpu instead???)

... using vmstate_x86_cpu for the per-vCPU state, which seems fairly
straightforward.

---
From 6ac40ff7731bc2144aa7fa4015b9308c2eea8f3d Mon Sep 17 00:00:00 2001
From: David Woodhouse 
Date: Wed, 7 Dec 2022 09:19:31 +
Subject: [PATCH] hw/xen: Add xen_overlay device for emulating shared xenheap
 pages

For the shared info page and for grant tables, Xen shares its own pages
from the "Xen heap" to the guest. The guest requests that a given page
from a certain address space (XENMAPSPACE_shared_info, etc.) be mapped
to a given GPA using the XENMEM_add_to_physmap hypercall.

To support that in qemu when *emulating* Xen, create a memory region
(migratable) and allow it to be mapped as an overlay when requested.

Xen theoretically allows the same

Re: [PATCH v12 1/7] s390x/cpu topology: Creating CPU topology device

2022-12-07 Thread Janis Schoetterl-Glausch
 * On Wed, 2022-12-07 at 11:00 +0100, Pierre Morel wrote:
> 
> On 12/6/22 22:06, Janis Schoetterl-Glausch wrote:
> > On Tue, 2022-12-06 at 15:35 +0100, Pierre Morel wrote:
> > > 
> > > On 12/6/22 14:35, Janis Schoetterl-Glausch wrote:
> > > > On Tue, 2022-12-06 at 11:32 +0100, Pierre Morel wrote:
> > > > > 
> > > > > On 12/6/22 10:31, Janis Schoetterl-Glausch wrote:
> > > > > > On Tue, 2022-11-29 at 18:42 +0100, Pierre Morel wrote:
> > > > > > > We will need a Topology device to transfer the topology
> > > > > > > during migration and to implement machine reset.
> > > > > > > 
> > > > > > > The device creation is fenced by s390_has_topology().
> > > > > > > 
> > > > > > > Signed-off-by: Pierre Morel 
> > > > > > > ---
> > > > > > >  include/hw/s390x/cpu-topology.h | 44 +++
> > > > > > >  include/hw/s390x/s390-virtio-ccw.h | 1 +
> > > > > > >  hw/s390x/cpu-topology.c | 87 ++
> > > > > > >  hw/s390x/s390-virtio-ccw.c | 25 +
> > > > > > >  hw/s390x/meson.build | 1 +
> > > > > > >  5 files changed, 158 insertions(+)
> > > > > > >  create mode 100644 include/hw/s390x/cpu-topology.h
> > > > > > >  create mode 100644 hw/s390x/cpu-topology.c
> > > > > > 
[...]

> > > > > > > + object_property_set_int(OBJECT(dev), "num-cores",
> > > > > > > + machine->smp.cores * machine->smp.threads, errp);
> > > > > > > + object_property_set_int(OBJECT(dev), "num-sockets",
> > > > > > > + machine->smp.sockets, errp);
> > > > > > > +
> > > > > > > + sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), errp);
> > > > > > 
> > > > > > I must admit that I haven't fully grokked qemu's memory management 
> > > > > > yet.
> > > > > > Is the topology devices now owned by the sysbus?
> > > > > 
> > > > > Yes it is so we see it on the qtree with its properties.
> > > > > 
> > > > > 
> > > > > > If so, is it fine to have a pointer to it S390CcwMachineState?
> > > > > 
> > > > > Why not?
> > > > 
> > > > If it's owned by the sysbus and the object is not explicitly referenced
> > > > for the pointer, it might be deallocated and then you'd have a dangling 
> > > > pointer.
> > > 
> > > Why would it be deallocated ?
> > 
> > That's beside the point, if you transfer ownership, you have no control 
> > over when
> > the deallocation happens.
> > It's going to be fine in practice, but I don't think you should rely on it.
> > I think you could just do sysbus_realize instead of ..._and_unref,
> > but like I said, I haven't fully understood qemu memory management.
> > (It would also leak in a sense, but since the machine exists forever that 
> > should be fine)
> 
> If I understand correctly:
> 
> - qdev_new adds a reference count to the new created object, dev.
> 
> - object_property_add_child adds a reference count to the child also 
> here the new created device dev so the ref count of dev is 2 .
> 
> after the unref on dev, the ref count of dev get down to 1
> 
> then it seems OK. Did I miss something?

The properties ref belongs to the property, if the property were removed,
it would be unref'ed. There is no extra ref for the pointer in 
S390CcwMachineState.
I'm coming from a clean code perspective, I don't think we'd run into this 
problem in practice.
> 
> Regards,
> Pierre
> 
> > 
> > > as long it is not unrealized it belongs to the sysbus doesn't it?
> > > 
> > > Regards,
> > > Pierre
> > > 
> > 
> 




Re: [PATCH v12 1/7] s390x/cpu topology: Creating CPU topology device

2022-12-07 Thread Pierre Morel




On 12/7/22 12:38, Janis Schoetterl-Glausch wrote:

  * On Wed, 2022-12-07 at 11:00 +0100, Pierre Morel wrote:


On 12/6/22 22:06, Janis Schoetterl-Glausch wrote:

On Tue, 2022-12-06 at 15:35 +0100, Pierre Morel wrote:


On 12/6/22 14:35, Janis Schoetterl-Glausch wrote:

On Tue, 2022-12-06 at 11:32 +0100, Pierre Morel wrote:


On 12/6/22 10:31, Janis Schoetterl-Glausch wrote:

On Tue, 2022-11-29 at 18:42 +0100, Pierre Morel wrote:

We will need a Topology device to transfer the topology
during migration and to implement machine reset.

The device creation is fenced by s390_has_topology().

Signed-off-by: Pierre Morel 
---
  include/hw/s390x/cpu-topology.h | 44 +++
  include/hw/s390x/s390-virtio-ccw.h | 1 +
  hw/s390x/cpu-topology.c | 87 ++
  hw/s390x/s390-virtio-ccw.c | 25 +
  hw/s390x/meson.build | 1 +
  5 files changed, 158 insertions(+)
  create mode 100644 include/hw/s390x/cpu-topology.h
  create mode 100644 hw/s390x/cpu-topology.c



[...]


+ object_property_set_int(OBJECT(dev), "num-cores",
+ machine->smp.cores * machine->smp.threads, errp);
+ object_property_set_int(OBJECT(dev), "num-sockets",
+ machine->smp.sockets, errp);
+
+ sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), errp);


I must admit that I haven't fully grokked qemu's memory management yet.
Is the topology devices now owned by the sysbus?


Yes it is so we see it on the qtree with its properties.



If so, is it fine to have a pointer to it S390CcwMachineState?


Why not?


If it's owned by the sysbus and the object is not explicitly referenced
for the pointer, it might be deallocated and then you'd have a dangling pointer.


Why would it be deallocated ?


That's beside the point, if you transfer ownership, you have no control over 
when
the deallocation happens.
It's going to be fine in practice, but I don't think you should rely on it.
I think you could just do sysbus_realize instead of ..._and_unref,
but like I said, I haven't fully understood qemu memory management.
(It would also leak in a sense, but since the machine exists forever that 
should be fine)


If I understand correctly:

- qdev_new adds a reference count to the new created object, dev.

- object_property_add_child adds a reference count to the child also
here the new created device dev so the ref count of dev is 2 .

after the unref on dev, the ref count of dev get down to 1

then it seems OK. Did I miss something?


The properties ref belongs to the property, if the property were removed,
it would be unref'ed. There is no extra ref for the pointer in 
S390CcwMachineState.
I'm coming from a clean code perspective, I don't think we'd run into this 
problem in practice.


OK, I understand, you are right.
My original code used object_resolve_path() to retrieve the object what 
made things cleaner I think.


For performance reason, Cedric proposed during the review of V10 to add 
the pointer to the machine state instead.


I must say that I am not very comfortable to argument on this.
@Cedric what do you think?


Regards,
Pierre

--
Pierre Morel
IBM Lab Boeblingen



Re: [PATCH v2] tests/stream-under-throttle: New test

2022-12-07 Thread Christian Borntraeger



Am 07.12.22 um 11:31 schrieb Christian Borntraeger:

Am 14.11.22 um 10:52 schrieb Hanna Reitz:

Test streaming a base image into the top image underneath two throttle
nodes.  This was reported to make qemu 7.1 hang
(https://gitlab.com/qemu-project/qemu/-/issues/1215), so this serves as
a regression test.

Signed-off-by: Hanna Reitz 
---
Based-on: <20221107151321.211175-1-hre...@redhat.com>

v1: https://lists.nongnu.org/archive/html/qemu-block/2022-11/msg00368.html

v2:
- Replace `asyncio.exceptions.TimeoutError` by `asyncio.TimeoutError`:
   Stefan reported that the CI does not recognize the former:
   https://lists.nongnu.org/archive/html/qemu-block/2022-11/msg00424.html

   As far as I understand, the latter was basically moved to become the
   former in Python 3.11, and an alias remains, so both are basically
   equivalent.  I only have 3.10, though, where the documentation says
   that both are different, even though using either seems to work fine
   (i.e. both catch the timeout there).  Not sure about previous
   versions, but the CI seems pretty certain about not knowing
   `asyncio.exceptions.TimeoutError`, so use `asyncio.TimeoutError`
   instead.  (Even though that is deprecated in 3.11, but this is not the
   first place in the tree to use it, so it should not be too bad.)
---
  .../qemu-iotests/tests/stream-under-throttle  | 121 ++
  .../tests/stream-under-throttle.out   |   5 +
  2 files changed, 126 insertions(+)
  create mode 100755 tests/qemu-iotests/tests/stream-under-throttle
  create mode 100644 tests/qemu-iotests/tests/stream-under-throttle.out


As a heads up, I do get the following on s390. I have not yet looked into that:

+EE
+==
+ERROR: test_stream (__main__.TestStreamWithThrottle)
+Do a simple stream beneath the two throttle nodes.  Should complete
+--
+Traceback (most recent call last):
+  File "qemu/tests/qemu-iotests/tests/stream-under-throttle", line 110, in 
test_stream
+    self.vm.run_job('stream')
+  File "qemu/tests/qemu-iotests/iotests.py", line 986, in run_job
+    result = self.qmp('query-jobs')
+  File "qemu/python/qemu/machine/machine.py", line 646, in qmp
+    ret = self._qmp.cmd(cmd, args=qmp_args)
+  File "qemu/python/qemu/qmp/legacy.py", line 204, in cmd
+    return self.cmd_obj(qmp_cmd)
+  File "qemu/python/qemu/qmp/legacy.py", line 184, in cmd_obj
+    self._qmp._raw(qmp_cmd, assign_id=False),
+  File "qemu/python/qemu/qmp/protocol.py", line 154, in _wrapper
+    raise StateError(emsg, proto.runstate, required_state)
+qemu.qmp.protocol.StateError: QMPClient is disconnecting. Call disconnect() to 
return to IDLE state.
+
+==
+ERROR: test_stream (__main__.TestStreamWithThrottle)
+Do a simple stream beneath the two throttle nodes.  Should complete
+--
+Traceback (most recent call last):
+  File "qemu/python/qemu/machine/machine.py", line 533, in _soft_shutdown
+    self.qmp('quit')
+  File "qemu/python/qemu/machine/machine.py", line 646, in qmp
+    ret = self._qmp.cmd(cmd, args=qmp_args)
+  File "qemu/python/qemu/qmp/legacy.py", line 204, in cmd
+    return self.cmd_obj(qmp_cmd)
+  File "qemu/python/qemu/qmp/legacy.py", line 184, in cmd_obj
+    self._qmp._raw(qmp_cmd, assign_id=False),
+  File "qemu/python/qemu/qmp/protocol.py", line 154, in _wrapper
+    raise StateError(emsg, proto.runstate, required_state)
+qemu.qmp.protocol.StateError: QMPClient is disconnecting. Call disconnect() to 
return to IDLE state.
+
+During handling of the above exception, another exception occurred:
+
+Traceback (most recent call last):
+  File "qemu/python/qemu/machine/machine.py", line 554, in _do_shutdown
+    self._soft_shutdown(timeout)
+  File "qemu/python/qemu/machine/machine.py", line 536, in _soft_shutdown
+    self._close_qmp_connection()
+  File "qemu/python/qemu/machine/machine.py", line 476, in 
_close_qmp_connection
+    self._qmp.close()
+  File "qemu/python/qemu/qmp/legacy.py", line 277, in close
+    self._sync(
+  File "qemu/python/qemu/qmp/legacy.py", line 94, in _sync
+    return self._aloop.run_until_complete(
+  File "/usr/lib64/python3.10/asyncio/base_events.py", line 649, in 
run_until_complete
+    return future.result()
+  File "/usr/lib64/python3.10/asyncio/tasks.py", line 408, in wait_for
+    return await fut
+  File "qemu/python/qemu/qmp/protocol.py", line 398, in disconnect
+    await self._wait_disconnect()
+  File "qemu/python/qemu/qmp/protocol.py", line 710, in _wait_disconnect
+    await all_defined_tasks  # Raise Exceptions from the bottom half.
+  File "qemu/python/qemu/qmp/protocol.py", line 861, in _bh_loop_forever
+    await async_fn()
+  File "qemu/python/qemu/qmp/protocol.py", line 899, in _bh_recv_message
+    msg = await self._recv()
+  File "qe

Re: [PATCH v2] tests/stream-under-throttle: New test

2022-12-07 Thread Christian Borntraeger




Am 07.12.22 um 13:56 schrieb Christian Borntraeger:


Am 07.12.22 um 11:31 schrieb Christian Borntraeger:

Am 14.11.22 um 10:52 schrieb Hanna Reitz:

Test streaming a base image into the top image underneath two throttle
nodes.  This was reported to make qemu 7.1 hang
(https://gitlab.com/qemu-project/qemu/-/issues/1215), so this serves as
a regression test.

Signed-off-by: Hanna Reitz 
---
Based-on: <20221107151321.211175-1-hre...@redhat.com>

v1: https://lists.nongnu.org/archive/html/qemu-block/2022-11/msg00368.html

v2:
- Replace `asyncio.exceptions.TimeoutError` by `asyncio.TimeoutError`:
   Stefan reported that the CI does not recognize the former:
   https://lists.nongnu.org/archive/html/qemu-block/2022-11/msg00424.html

   As far as I understand, the latter was basically moved to become the
   former in Python 3.11, and an alias remains, so both are basically
   equivalent.  I only have 3.10, though, where the documentation says
   that both are different, even though using either seems to work fine
   (i.e. both catch the timeout there).  Not sure about previous
   versions, but the CI seems pretty certain about not knowing
   `asyncio.exceptions.TimeoutError`, so use `asyncio.TimeoutError`
   instead.  (Even though that is deprecated in 3.11, but this is not the
   first place in the tree to use it, so it should not be too bad.)
---
  .../qemu-iotests/tests/stream-under-throttle  | 121 ++
  .../tests/stream-under-throttle.out   |   5 +
  2 files changed, 126 insertions(+)
  create mode 100755 tests/qemu-iotests/tests/stream-under-throttle
  create mode 100644 tests/qemu-iotests/tests/stream-under-throttle.out


As a heads up, I do get the following on s390. I have not yet looked into that:

+EE
+==
+ERROR: test_stream (__main__.TestStreamWithThrottle)
+Do a simple stream beneath the two throttle nodes.  Should complete
+--
+Traceback (most recent call last):
+  File "qemu/tests/qemu-iotests/tests/stream-under-throttle", line 110, in 
test_stream
+    self.vm.run_job('stream')
+  File "qemu/tests/qemu-iotests/iotests.py", line 986, in run_job
+    result = self.qmp('query-jobs')
+  File "qemu/python/qemu/machine/machine.py", line 646, in qmp
+    ret = self._qmp.cmd(cmd, args=qmp_args)
+  File "qemu/python/qemu/qmp/legacy.py", line 204, in cmd
+    return self.cmd_obj(qmp_cmd)
+  File "qemu/python/qemu/qmp/legacy.py", line 184, in cmd_obj
+    self._qmp._raw(qmp_cmd, assign_id=False),
+  File "qemu/python/qemu/qmp/protocol.py", line 154, in _wrapper
+    raise StateError(emsg, proto.runstate, required_state)
+qemu.qmp.protocol.StateError: QMPClient is disconnecting. Call disconnect() to 
return to IDLE state.
+
+==
+ERROR: test_stream (__main__.TestStreamWithThrottle)
+Do a simple stream beneath the two throttle nodes.  Should complete
+--
+Traceback (most recent call last):
+  File "qemu/python/qemu/machine/machine.py", line 533, in _soft_shutdown
+    self.qmp('quit')
+  File "qemu/python/qemu/machine/machine.py", line 646, in qmp
+    ret = self._qmp.cmd(cmd, args=qmp_args)
+  File "qemu/python/qemu/qmp/legacy.py", line 204, in cmd
+    return self.cmd_obj(qmp_cmd)
+  File "qemu/python/qemu/qmp/legacy.py", line 184, in cmd_obj
+    self._qmp._raw(qmp_cmd, assign_id=False),
+  File "qemu/python/qemu/qmp/protocol.py", line 154, in _wrapper
+    raise StateError(emsg, proto.runstate, required_state)
+qemu.qmp.protocol.StateError: QMPClient is disconnecting. Call disconnect() to 
return to IDLE state.
+
+During handling of the above exception, another exception occurred:
+
+Traceback (most recent call last):
+  File "qemu/python/qemu/machine/machine.py", line 554, in _do_shutdown
+    self._soft_shutdown(timeout)
+  File "qemu/python/qemu/machine/machine.py", line 536, in _soft_shutdown
+    self._close_qmp_connection()
+  File "qemu/python/qemu/machine/machine.py", line 476, in 
_close_qmp_connection
+    self._qmp.close()
+  File "qemu/python/qemu/qmp/legacy.py", line 277, in close
+    self._sync(
+  File "qemu/python/qemu/qmp/legacy.py", line 94, in _sync
+    return self._aloop.run_until_complete(
+  File "/usr/lib64/python3.10/asyncio/base_events.py", line 649, in 
run_until_complete
+    return future.result()
+  File "/usr/lib64/python3.10/asyncio/tasks.py", line 408, in wait_for
+    return await fut
+  File "qemu/python/qemu/qmp/protocol.py", line 398, in disconnect
+    await self._wait_disconnect()
+  File "qemu/python/qemu/qmp/protocol.py", line 710, in _wait_disconnect
+    await all_defined_tasks  # Raise Exceptions from the bottom half.
+  File "qemu/python/qemu/qmp/protocol.py", line 861, in _bh_loop_forever
+    await async_fn()
+  File "qemu/python/qemu/qmp/protocol.py", line 899, in _bh

[PATCH 1/1] qemu-iotests/stream-under-throttle: do not shutdown QEMU

2022-12-07 Thread Christian Borntraeger
Without a kernel or boot disk a QEMU on s390 will exit (usually with a
disabled wait state). This breaks the stream-under-throttle test case.
Do not exit qemu if on s390.

Signed-off-by: Christian Borntraeger 
---
 tests/qemu-iotests/tests/stream-under-throttle | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tests/qemu-iotests/tests/stream-under-throttle 
b/tests/qemu-iotests/tests/stream-under-throttle
index 8d2d9e16840d..c24dfbcaa2f2 100755
--- a/tests/qemu-iotests/tests/stream-under-throttle
+++ b/tests/qemu-iotests/tests/stream-under-throttle
@@ -88,6 +88,8 @@ class TestStreamWithThrottle(iotests.QMPTestCase):
'x-iops-total=1,x-bps-total=104857600')
 self.vm.add_blockdev(self.vm.qmp_to_opts(blockdev))
 self.vm.add_device('virtio-blk,iothread=iothr0,drive=throttled-node')
+if iotests.qemu_default_machine == 's390-ccw-virtio':
+self.vm.add_args('-no-shutdown')
 self.vm.launch()
 
 def tearDown(self) -> None:
-- 
2.38.1




[PATCH 11/18] block: wrlock in bdrv_replace_child_noperm

2022-12-07 Thread Kevin Wolf
From: Emanuele Giuseppe Esposito 

Protect the main function where graph is modified.

Signed-off-by: Emanuele Giuseppe Esposito 
Signed-off-by: Kevin Wolf 
---
 block.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index 44d59362d6..df52c6b012 100644
--- a/block.c
+++ b/block.c
@@ -2836,8 +2836,6 @@ uint64_t bdrv_qapi_perm_to_blk_perm(BlockPermission 
qapi_perm)
  *
  * If @new_bs is non-NULL, the parent of @child must already be drained through
  * @child.
- *
- * This function does not poll.
  */
 static void bdrv_replace_child_noperm(BdrvChild *child,
   BlockDriverState *new_bs)
@@ -2875,23 +2873,24 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 assert(bdrv_get_aio_context(old_bs) == bdrv_get_aio_context(new_bs));
 }
 
+/* TODO Pull this up into the callers to avoid polling here */
+bdrv_graph_wrlock();
 if (old_bs) {
 if (child->klass->detach) {
 child->klass->detach(child);
 }
-assert_bdrv_graph_writable(old_bs);
 QLIST_REMOVE(child, next_parent);
 }
 
 child->bs = new_bs;
 
 if (new_bs) {
-assert_bdrv_graph_writable(new_bs);
 QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);
 if (child->klass->attach) {
 child->klass->attach(child);
 }
 }
+bdrv_graph_wrunlock();
 
 /*
  * If the parent was drained through this BdrvChild previously, but new_bs
-- 
2.38.1




[PATCH 04/18] async: Register/unregister aiocontext in graph lock list

2022-12-07 Thread Kevin Wolf
From: Emanuele Giuseppe Esposito 

Add/remove the AioContext in aio_context_list in graph-lock.c when it is
created/destroyed. This allows using the graph locking operations from
this AioContext.

In order to allow linking util/async.c with binaries that don't include
the block layer, introduce stubs for (un)register_aiocontext().

Signed-off-by: Emanuele Giuseppe Esposito 
Signed-off-by: Kevin Wolf 
---
 stubs/graph-lock.c | 10 ++
 util/async.c   |  4 
 stubs/meson.build  |  1 +
 3 files changed, 15 insertions(+)
 create mode 100644 stubs/graph-lock.c

diff --git a/stubs/graph-lock.c b/stubs/graph-lock.c
new file mode 100644
index 00..177aa0a8ba
--- /dev/null
+++ b/stubs/graph-lock.c
@@ -0,0 +1,10 @@
+#include "qemu/osdep.h"
+#include "block/graph-lock.h"
+
+void register_aiocontext(AioContext *ctx)
+{
+}
+
+void unregister_aiocontext(AioContext *ctx)
+{
+}
diff --git a/util/async.c b/util/async.c
index 63434ddae4..14d63b3091 100644
--- a/util/async.c
+++ b/util/async.c
@@ -27,6 +27,7 @@
 #include "qapi/error.h"
 #include "block/aio.h"
 #include "block/thread-pool.h"
+#include "block/graph-lock.h"
 #include "qemu/main-loop.h"
 #include "qemu/atomic.h"
 #include "qemu/rcu_queue.h"
@@ -376,6 +377,7 @@ aio_ctx_finalize(GSource *source)
 qemu_rec_mutex_destroy(&ctx->lock);
 qemu_lockcnt_destroy(&ctx->list_lock);
 timerlistgroup_deinit(&ctx->tlg);
+unregister_aiocontext(ctx);
 aio_context_destroy(ctx);
 }
 
@@ -574,6 +576,8 @@ AioContext *aio_context_new(Error **errp)
 ctx->thread_pool_min = 0;
 ctx->thread_pool_max = THREAD_POOL_MAX_THREADS_DEFAULT;
 
+register_aiocontext(ctx);
+
 return ctx;
 fail:
 g_source_destroy(&ctx->source);
diff --git a/stubs/meson.build b/stubs/meson.build
index c96a74f095..981585cbdf 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -13,6 +13,7 @@ stub_ss.add(files('error-printf.c'))
 stub_ss.add(files('fdset.c'))
 stub_ss.add(files('gdbstub.c'))
 stub_ss.add(files('get-vm-name.c'))
+stub_ss.add(files('graph-lock.c'))
 if linux_io_uring.found()
   stub_ss.add(files('io_uring.c'))
 endif
-- 
2.38.1




[PATCH 10/18] block: Fix locking in external_snapshot_prepare()

2022-12-07 Thread Kevin Wolf
bdrv_img_create() polls internally (when calling bdrv_create(), which is
a co_wrapper), so it can't be called while holding the lock of any
AioContext except the current one without causing deadlocks. Drop the
lock around the call in external_snapshot_prepare().

Signed-off-by: Kevin Wolf 
---
 block.c| 4 
 blockdev.c | 4 
 2 files changed, 8 insertions(+)

diff --git a/block.c b/block.c
index 6191ac1f44..44d59362d6 100644
--- a/block.c
+++ b/block.c
@@ -6924,6 +6924,10 @@ bool bdrv_op_blocker_is_empty(BlockDriverState *bs)
 return true;
 }
 
+/*
+ * Must not be called while holding the lock of an AioContext other than the
+ * current one.
+ */
 void bdrv_img_create(const char *filename, const char *fmt,
  const char *base_filename, const char *base_fmt,
  char *options, uint64_t img_size, int flags, bool quiet,
diff --git a/blockdev.c b/blockdev.c
index 8ffb3d9537..011e48df7b 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1524,10 +1524,14 @@ static void external_snapshot_prepare(BlkActionState 
*common,
 goto out;
 }
 bdrv_refresh_filename(state->old_bs);
+
+aio_context_release(aio_context);
 bdrv_img_create(new_image_file, format,
 state->old_bs->filename,
 state->old_bs->drv->format_name,
 NULL, size, flags, false, &local_err);
+aio_context_acquire(aio_context);
+
 if (local_err) {
 error_propagate(errp, local_err);
 goto out;
-- 
2.38.1




[PATCH 12/18] block: remove unnecessary assert_bdrv_graph_writable()

2022-12-07 Thread Kevin Wolf
From: Emanuele Giuseppe Esposito 

We don't protect bdrv->aio_context with the graph rwlock,
so these assertions are not needed

Signed-off-by: Emanuele Giuseppe Esposito 
Signed-off-by: Kevin Wolf 
---
 block.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block.c b/block.c
index df52c6b012..bdffadcdaa 100644
--- a/block.c
+++ b/block.c
@@ -7214,7 +7214,6 @@ static void bdrv_detach_aio_context(BlockDriverState *bs)
 if (bs->quiesce_counter) {
 aio_enable_external(bs->aio_context);
 }
-assert_bdrv_graph_writable(bs);
 bs->aio_context = NULL;
 }
 
@@ -7228,7 +7227,6 @@ static void bdrv_attach_aio_context(BlockDriverState *bs,
 aio_disable_external(new_context);
 }
 
-assert_bdrv_graph_writable(bs);
 bs->aio_context = new_context;
 
 if (bs->drv && bs->drv->bdrv_attach_aio_context) {
@@ -7309,7 +7307,6 @@ static void bdrv_set_aio_context_commit(void *opaque)
 BlockDriverState *bs = (BlockDriverState *) state->bs;
 AioContext *new_context = state->new_ctx;
 AioContext *old_context = bdrv_get_aio_context(bs);
-assert_bdrv_graph_writable(bs);
 
 /*
  * Take the old AioContex when detaching it from bs.
-- 
2.38.1




[PATCH 09/18] test-bdrv-drain: Fix incorrrect drain assumptions

2022-12-07 Thread Kevin Wolf
The test case assumes that a drain only happens in one specific place
where it drains explicitly. This assumption happened to hold true until
now, but block layer functions may drain interally (any graph
modifications are going to do that through bdrv_graph_wrlock()), so this
is incorrect. Make sure that the test code in .drained_begin only runs
where we actually want it to run.

When scheduling a BH from .drained_begin, we also need to increase the
in_flight counter to make sure that the operation is actually completed
in time before the node that it works on goes away.

Signed-off-by: Kevin Wolf 
---
 tests/unit/test-bdrv-drain.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index 2686a8acee..8cedea4959 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -1107,6 +1107,7 @@ struct detach_by_parent_data {
 BlockDriverState *c;
 BdrvChild *child_c;
 bool by_parent_cb;
+bool detach_on_drain;
 };
 static struct detach_by_parent_data detach_by_parent_data;
 
@@ -1114,6 +1115,7 @@ static void detach_indirect_bh(void *opaque)
 {
 struct detach_by_parent_data *data = opaque;
 
+bdrv_dec_in_flight(data->child_b->bs);
 bdrv_unref_child(data->parent_b, data->child_b);
 
 bdrv_ref(data->c);
@@ -1128,12 +1130,21 @@ static void detach_by_parent_aio_cb(void *opaque, int 
ret)
 
 g_assert_cmpint(ret, ==, 0);
 if (data->by_parent_cb) {
+bdrv_inc_in_flight(data->child_b->bs);
 detach_indirect_bh(data);
 }
 }
 
 static void detach_by_driver_cb_drained_begin(BdrvChild *child)
 {
+struct detach_by_parent_data *data = &detach_by_parent_data;
+
+if (!data->detach_on_drain) {
+return;
+}
+data->detach_on_drain = false;
+
+bdrv_inc_in_flight(data->child_b->bs);
 aio_bh_schedule_oneshot(qemu_get_current_aio_context(),
 detach_indirect_bh, &detach_by_parent_data);
 child_of_bds.drained_begin(child);
@@ -1174,8 +1185,14 @@ static void test_detach_indirect(bool by_parent_cb)
 detach_by_driver_cb_class = child_of_bds;
 detach_by_driver_cb_class.drained_begin =
 detach_by_driver_cb_drained_begin;
+detach_by_driver_cb_class.drained_end = NULL;
+detach_by_driver_cb_class.drained_poll = NULL;
 }
 
+detach_by_parent_data = (struct detach_by_parent_data) {
+.detach_on_drain = false,
+};
+
 /* Create all involved nodes */
 parent_a = bdrv_new_open_driver(&bdrv_test, "parent-a", BDRV_O_RDWR,
 &error_abort);
@@ -1227,6 +1244,7 @@ static void test_detach_indirect(bool by_parent_cb)
 .child_b = child_b,
 .c = c,
 .by_parent_cb = by_parent_cb,
+.detach_on_drain = true,
 };
 acb = blk_aio_preadv(blk, 0, &qiov, 0, detach_by_parent_aio_cb, NULL);
 g_assert(acb != NULL);
-- 
2.38.1




[PATCH 17/18] block: use co_wrapper_mixed_bdrv_rdlock in functions taking the rdlock

2022-12-07 Thread Kevin Wolf
From: Emanuele Giuseppe Esposito 

Take the rdlock already, before we add the assertions.

All these functions either read the graph recursively, or call
BlockDriver callbacks that will eventually need to be protected by the
graph rdlock.

Do it now to all functions together, because many of these recursively
call each other.

For example, bdrv_co_truncate calls BlockDriver->bdrv_co_truncate, and
some driver callbacks implement their own .bdrv_co_truncate by calling
bdrv_flush inside. So if bdrv_flush asserts but bdrv_truncate does not
take the rdlock yet, the assertion will always fail.

Signed-off-by: Emanuele Giuseppe Esposito 
Signed-off-by: Kevin Wolf 
---
 block/coroutines.h   |  2 +-
 include/block/block-io.h | 53 +++-
 2 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index 17da4db963..48e9081aa1 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -71,7 +71,7 @@ nbd_co_do_establish_connection(BlockDriverState *bs, bool 
blocking,
  * the "I/O or GS" API.
  */
 
-int co_wrapper_mixed
+int co_wrapper_mixed_bdrv_rdlock
 bdrv_common_block_status_above(BlockDriverState *bs,
BlockDriverState *base,
bool include_base,
diff --git a/include/block/block-io.h b/include/block/block-io.h
index 52869ea08e..2ed6214909 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -39,19 +39,24 @@
  * to catch when they are accidentally called by the wrong API.
  */
 
-int co_wrapper_mixed bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
-int64_t bytes,
-BdrvRequestFlags flags);
+int co_wrapper_mixed_bdrv_rdlock
+bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset, int64_t bytes,
+   BdrvRequestFlags flags);
+
 int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags);
-int co_wrapper_mixed bdrv_pread(BdrvChild *child, int64_t offset,
-int64_t bytes, void *buf,
-BdrvRequestFlags flags);
-int co_wrapper_mixed bdrv_pwrite(BdrvChild *child, int64_t offset,
- int64_t bytes, const void *buf,
- BdrvRequestFlags flags);
-int co_wrapper_mixed bdrv_pwrite_sync(BdrvChild *child, int64_t offset,
-  int64_t bytes, const void *buf,
-  BdrvRequestFlags flags);
+
+int co_wrapper_mixed_bdrv_rdlock
+bdrv_pread(BdrvChild *child, int64_t offset, int64_t bytes, void *buf,
+   BdrvRequestFlags flags);
+
+int co_wrapper_mixed_bdrv_rdlock
+bdrv_pwrite(BdrvChild *child, int64_t offset,int64_t bytes,
+const void *buf, BdrvRequestFlags flags);
+
+int co_wrapper_mixed_bdrv_rdlock
+bdrv_pwrite_sync(BdrvChild *child, int64_t offset, int64_t bytes,
+ const void *buf, BdrvRequestFlags flags);
+
 int coroutine_fn bdrv_co_pwrite_sync(BdrvChild *child, int64_t offset,
  int64_t bytes, const void *buf,
  BdrvRequestFlags flags);
@@ -287,22 +292,26 @@ int coroutine_fn bdrv_co_copy_range(BdrvChild *src, 
int64_t src_offset,
 
 void bdrv_drain(BlockDriverState *bs);
 
-int co_wrapper_mixed
+int co_wrapper_mixed_bdrv_rdlock
 bdrv_truncate(BdrvChild *child, int64_t offset, bool exact,
   PreallocMode prealloc, BdrvRequestFlags flags, Error **errp);
 
-int co_wrapper_mixed bdrv_check(BlockDriverState *bs, BdrvCheckResult *res,
-BdrvCheckMode fix);
+int co_wrapper_mixed_bdrv_rdlock
+bdrv_check(BlockDriverState *bs, BdrvCheckResult *res, BdrvCheckMode fix);
 
 /* Invalidate any cached metadata used by image formats */
-int co_wrapper_mixed bdrv_invalidate_cache(BlockDriverState *bs,
-   Error **errp);
-int co_wrapper_mixed bdrv_flush(BlockDriverState *bs);
-int co_wrapper_mixed bdrv_pdiscard(BdrvChild *child, int64_t offset,
-   int64_t bytes);
-int co_wrapper_mixed
+int co_wrapper_mixed_bdrv_rdlock
+bdrv_invalidate_cache(BlockDriverState *bs, Error **errp);
+
+int co_wrapper_mixed_bdrv_rdlock bdrv_flush(BlockDriverState *bs);
+
+int co_wrapper_mixed_bdrv_rdlock
+bdrv_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
+
+int co_wrapper_mixed_bdrv_rdlock
 bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
-int co_wrapper_mixed
+
+int co_wrapper_mixed_bdrv_rdlock
 bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
 
 /**
-- 
2.38.1




[PATCH 18/18] block: GRAPH_RDLOCK for functions only called by co_wrappers

2022-12-07 Thread Kevin Wolf
The generated coroutine wrappers already take care to take the lock in
the non-coroutine path, and assume that the lock is already taken in the
coroutine path.

The only thing we need to do for the wrapped function is adding the
GRAPH_RDLOCK annotation. Doing so also allows us to mark the
corresponding callbacks in BlockDriver as GRAPH_RDLOCK_PTR.

Signed-off-by: Kevin Wolf 
---
 block/coroutines.h   | 17 ++---
 include/block/block_int-common.h | 20 +---
 block.c  |  2 ++
 block/io.c   |  2 ++
 4 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index 48e9081aa1..2a1e0b3c9d 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -37,9 +37,11 @@
  * the I/O API.
  */
 
-int coroutine_fn bdrv_co_check(BlockDriverState *bs,
-   BdrvCheckResult *res, BdrvCheckMode fix);
-int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp);
+int coroutine_fn GRAPH_RDLOCK
+bdrv_co_check(BlockDriverState *bs, BdrvCheckResult *res, BdrvCheckMode fix);
+
+int coroutine_fn GRAPH_RDLOCK
+bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp);
 
 int coroutine_fn
 bdrv_co_common_block_status_above(BlockDriverState *bs,
@@ -53,10 +55,11 @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
   BlockDriverState **file,
   int *depth);
 
-int coroutine_fn bdrv_co_readv_vmstate(BlockDriverState *bs,
-   QEMUIOVector *qiov, int64_t pos);
-int coroutine_fn bdrv_co_writev_vmstate(BlockDriverState *bs,
-QEMUIOVector *qiov, int64_t pos);
+int coroutine_fn GRAPH_RDLOCK
+bdrv_co_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
+
+int coroutine_fn GRAPH_RDLOCK
+bdrv_co_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
 
 int coroutine_fn
 nbd_co_do_establish_connection(BlockDriverState *bs, bool blocking,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index b1f0d88307..c34c525fa6 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -641,8 +641,8 @@ struct BlockDriver {
 /*
  * Invalidate any cached meta-data.
  */
-void coroutine_fn (*bdrv_co_invalidate_cache)(BlockDriverState *bs,
-  Error **errp);
+void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_invalidate_cache)(
+BlockDriverState *bs, Error **errp);
 
 /*
  * Flushes all data for all layers by calling bdrv_co_flush for underlying
@@ -701,12 +701,11 @@ struct BlockDriver {
  Error **errp);
 BlockStatsSpecific *(*bdrv_get_specific_stats)(BlockDriverState *bs);
 
-int coroutine_fn (*bdrv_save_vmstate)(BlockDriverState *bs,
-  QEMUIOVector *qiov,
-  int64_t pos);
-int coroutine_fn (*bdrv_load_vmstate)(BlockDriverState *bs,
-  QEMUIOVector *qiov,
-  int64_t pos);
+int coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_save_vmstate)(
+BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
+
+int coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_load_vmstate)(
+BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
 
 /* removable device specific */
 bool (*bdrv_is_inserted)(BlockDriverState *bs);
@@ -724,9 +723,8 @@ struct BlockDriver {
  * Returns 0 for completed check, -errno for internal errors.
  * The check results are stored in result.
  */
-int coroutine_fn (*bdrv_co_check)(BlockDriverState *bs,
-  BdrvCheckResult *result,
-  BdrvCheckMode fix);
+int coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_check)(
+BlockDriverState *bs, BdrvCheckResult *result, BdrvCheckMode fix);
 
 void (*bdrv_debug_event)(BlockDriverState *bs, BlkdebugEvent event);
 
diff --git a/block.c b/block.c
index 1a82fd101a..9c2ac757e4 100644
--- a/block.c
+++ b/block.c
@@ -5402,6 +5402,7 @@ int coroutine_fn bdrv_co_check(BlockDriverState *bs,
BdrvCheckResult *res, BdrvCheckMode fix)
 {
 IO_CODE();
+assert_bdrv_graph_readable();
 if (bs->drv == NULL) {
 return -ENOMEDIUM;
 }
@@ -6617,6 +6618,7 @@ int coroutine_fn 
bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp)
 IO_CODE();
 
 assert(!(bs->open_flags & BDRV_O_INACTIVE));
+assert_bdrv_graph_readable();
 
 if (bs->drv->bdrv_co_invalidate_cache) {
 bs->drv->bdrv_co_invalidate_cache(bs, &local_err);
diff --git a/block/io.c b/block/io.c
index d160d2e273..d87788dfbb 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2697,6 +2697,7 @@ bdrv_co_readv_vmst

[PATCH 08/18] configure: Enable -Wthread-safety if present

2022-12-07 Thread Kevin Wolf
This enables clang's thread safety analysis (TSA), which we'll use to
statically check the block graph locking.

Signed-off-by: Kevin Wolf 
---
 configure | 1 +
 1 file changed, 1 insertion(+)

diff --git a/configure b/configure
index 26c7bc5154..45d693a241 100755
--- a/configure
+++ b/configure
@@ -1208,6 +1208,7 @@ add_to warn_flags -Wnested-externs
 add_to warn_flags -Wendif-labels
 add_to warn_flags -Wexpansion-to-defined
 add_to warn_flags -Wimplicit-fallthrough=2
+add_to warn_flags -Wthread-safety
 
 nowarn_flags=
 add_to nowarn_flags -Wno-initializer-overrides
-- 
2.38.1




[PATCH 02/18] graph-lock: Introduce a lock to protect block graph operations

2022-12-07 Thread Kevin Wolf
From: Paolo Bonzini 

Block layer graph operations are always run under BQL in the main loop.
This is proved by the assertion qemu_in_main_thread() and its wrapper
macro GLOBAL_STATE_CODE.

However, there are also concurrent coroutines running in other iothreads
that always try to traverse the graph. Currently this is protected
(among various other things) by the AioContext lock, but once this is
removed, we need to make sure that reads do not happen while modifying
the graph.

We distinguish between writer (main loop, under BQL) that modifies the
graph, and readers (all other coroutines running in various AioContext),
that go through the graph edges, reading ->parents and->children.

The writer (main loop) has "exclusive" access, so it first waits for any
current read to finish, and then prevents incoming ones from entering
while it has the exclusive access.

The readers (coroutines in multiple AioContext) are free to access the
graph as long the writer is not modifying the graph. In case it is, they
go in a CoQueue and sleep until the writer is done.

If a coroutine changes AioContext, the counter in the original and new
AioContext are left intact, since the writer does not care where the
reader is, but only if there is one.

As a result, some AioContexts might have a negative reader count, to
balance the positive count of the AioContext that took the lock.  This
also means that when an AioContext is deleted it may have a nonzero
reader count. In that case we transfer the count to a global shared
counter so that the writer is always aware of all readers.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
Signed-off-by: Kevin Wolf 
---
 include/block/aio.h|   9 ++
 include/block/block_int.h  |   1 +
 include/block/graph-lock.h | 139 
 block/graph-lock.c | 261 +
 block/meson.build  |   1 +
 5 files changed, 411 insertions(+)
 create mode 100644 include/block/graph-lock.h
 create mode 100644 block/graph-lock.c

diff --git a/include/block/aio.h b/include/block/aio.h
index d128558f1d..0f65a3cc9e 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -22,6 +22,7 @@
 #include "qemu/event_notifier.h"
 #include "qemu/thread.h"
 #include "qemu/timer.h"
+#include "block/graph-lock.h"
 
 typedef struct BlockAIOCB BlockAIOCB;
 typedef void BlockCompletionFunc(void *opaque, int ret);
@@ -127,6 +128,14 @@ struct AioContext {
 /* Used by AioContext users to protect from multi-threaded access.  */
 QemuRecMutex lock;
 
+/*
+ * Keep track of readers and writers of the block layer graph.
+ * This is essential to avoid performing additions and removal
+ * of nodes and edges from block graph while some
+ * other thread is traversing it.
+ */
+BdrvGraphRWlock *bdrv_graph;
+
 /* The list of registered AIO handlers.  Protected by ctx->list_lock. */
 AioHandlerList aio_handlers;
 
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 7d50b6bbd1..b35b0138ed 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -26,6 +26,7 @@
 
 #include "block_int-global-state.h"
 #include "block_int-io.h"
+#include "block/graph-lock.h"
 
 /* DO NOT ADD ANYTHING IN HERE. USE ONE OF THE HEADERS INCLUDED ABOVE */
 
diff --git a/include/block/graph-lock.h b/include/block/graph-lock.h
new file mode 100644
index 00..82edb62cfa
--- /dev/null
+++ b/include/block/graph-lock.h
@@ -0,0 +1,139 @@
+/*
+ * Graph lock: rwlock to protect block layer graph manipulations (add/remove
+ * edges and nodes)
+ *
+ *  Copyright (c) 2022 Red Hat
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see .
+ */
+#ifndef GRAPH_LOCK_H
+#define GRAPH_LOCK_H
+
+#include "qemu/osdep.h"
+
+#include "qemu/coroutine.h"
+
+/**
+ * Graph Lock API
+ * This API provides a rwlock used to protect block layer
+ * graph modifications like edge (BdrvChild) and node (BlockDriverState)
+ * addition and removal.
+ * Currently we have 1 writer only, the Main loop, and many
+ * readers, mostly coroutines running in other AioContext thus other threads.
+ *
+ * We distinguish between writer (main loop, under BQL) that modifies the
+ * graph, and readers (all other coroutines running in various AioContext),
+ * that go through the graph edges, reading
+ * BlockDriverState ->parents and->

[PATCH 00/18] block: Introduce a block graph rwlock

2022-12-07 Thread Kevin Wolf
This series supersedes the first half of Emanuele's "Protect the block
layer with a rwlock: part 1". It introduces the basic infrastructure for
protecting the block graph (specifically parent/child links) with a
rwlock. Actually taking the reader lock in all necessary places is left
for future series.

Compared to Emanuele's series, this one adds patches to make use of
clang's Thread Safety Analysis (TSA) feature in order to statically
check at compile time that the places where we assert that we hold the
lock actually do hold it. Once we cover all relevant places, the check
can be extended to verify that all accesses of bs->children and
bs->parents hold the lock.

For reference, here is the more detailed version of our plan in
Emanuele's words from his series:

The aim is to replace the current AioContext lock with much
fine-grained locks, aimed to protect only specific data. Currently
the AioContext lock is used pretty much everywhere, and it's not
even clear what it is protecting exactly.

The aim of the rwlock is to cover graph modifications: more
precisely, when a BlockDriverState parent or child list is modified
or read, since it can be concurrently accessed by the main loop and
iothreads.

The main assumption is that the main loop is the only one allowed to
perform graph modifications, and so far this has always been held by
the current code.

The rwlock is inspired from cpus-common.c implementation, and aims
to reduce cacheline bouncing by having per-aiocontext counter of
readers.  All details and implementation of the lock are in patch 2.

We distinguish between writer (main loop, under BQL) that modifies
the graph, and readers (all other coroutines running in various
AioContext), that go through the graph edges, reading ->parents
and->children.  The writer (main loop)  has an "exclusive" access,
so it first waits for current read to finish, and then prevents
incoming ones from entering while it has the exclusive access.  The
readers (coroutines in multiple AioContext) are free to access the
graph as long the writer is not modifying the graph.  In case it is,
they go in a CoQueue and sleep until the writer is done.

In this and following series, we try to follow the following locking
pattern:

- bdrv_co_* functions that call BlockDriver callbacks always expect
  the lock to be taken, therefore they assert.

- blk_co_* functions are called from external code outside the block
  layer, which should not have to care about the block layer's
  internal locking. Usually they also call blk_wait_while_drained().
  Therefore they take the lock internally.

The long term goal of this series is to eventually replace the
AioContext lock, so that we can get rid of it once and for all.

Emanuele Giuseppe Esposito (7):
  graph-lock: Implement guard macros
  async: Register/unregister aiocontext in graph lock list
  block: wrlock in bdrv_replace_child_noperm
  block: remove unnecessary assert_bdrv_graph_writable()
  block: assert that graph read and writes are performed correctly
  block-coroutine-wrapper.py: introduce annotations that take the graph
rdlock
  block: use co_wrapper_mixed_bdrv_rdlock in functions taking the rdlock

Kevin Wolf (10):
  block: Factor out bdrv_drain_all_begin_nopoll()
  Import clang-tsa.h
  clang-tsa: Add TSA_ASSERT() macro
  clang-tsa: Add macros for shared locks
  configure: Enable -Wthread-safety if present
  test-bdrv-drain: Fix incorrrect drain assumptions
  block: Fix locking in external_snapshot_prepare()
  graph-lock: TSA annotations for lock/unlock functions
  Mark assert_bdrv_graph_readable/writable() GRAPH_RD/WRLOCK
  block: GRAPH_RDLOCK for functions only called by co_wrappers

Paolo Bonzini (1):
  graph-lock: Introduce a lock to protect block graph operations

 configure  |   1 +
 block/coroutines.h |  19 +-
 include/block/aio.h|   9 +
 include/block/block-common.h   |   9 +-
 include/block/block-global-state.h |   1 +
 include/block/block-io.h   |  53 +++--
 include/block/block_int-common.h   |  24 +--
 include/block/block_int-global-state.h |  17 --
 include/block/block_int.h  |   1 +
 include/block/graph-lock.h | 280 +
 include/qemu/clang-tsa.h   | 114 ++
 block.c|  24 ++-
 block/graph-lock.c | 275 
 block/io.c |  21 +-
 blockdev.c |   4 +
 stubs/graph-lock.c |  10 +
 tests/unit/test-bdrv-drain.c   |  18 ++
 util/async.c   |   4 +
 scripts/block-coroutine-wrapper.py |  12 ++
 block/meson.build  |   1 +
 stubs/meson.build  |   1 +
 21 files changed, 820 insertions(+), 78 deletions(-)
 create 

[PATCH 01/18] block: Factor out bdrv_drain_all_begin_nopoll()

2022-12-07 Thread Kevin Wolf
Provide a separate function that just quiesces the users of a node to
prevent new requests from coming in, but without waiting for the already
in-flight I/O to complete.

This function can be used in contexts where polling is not allowed.

Signed-off-by: Kevin Wolf 
---
 include/block/block-global-state.h |  1 +
 block/io.c | 19 +--
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/block/block-global-state.h 
b/include/block/block-global-state.h
index 1f8b54f2df..b0a3cfe6b8 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -152,6 +152,7 @@ int bdrv_inactivate_all(void);
 int bdrv_flush_all(void);
 void bdrv_close_all(void);
 void bdrv_drain_all_begin(void);
+void bdrv_drain_all_begin_nopoll(void);
 void bdrv_drain_all_end(void);
 void bdrv_drain_all(void);
 
diff --git a/block/io.c b/block/io.c
index fb..d160d2e273 100644
--- a/block/io.c
+++ b/block/io.c
@@ -466,16 +466,11 @@ static bool bdrv_drain_all_poll(void)
  * NOTE: no new block jobs or BlockDriverStates can be created between
  * the bdrv_drain_all_begin() and bdrv_drain_all_end() calls.
  */
-void bdrv_drain_all_begin(void)
+void bdrv_drain_all_begin_nopoll(void)
 {
 BlockDriverState *bs = NULL;
 GLOBAL_STATE_CODE();
 
-if (qemu_in_coroutine()) {
-bdrv_co_yield_to_drain(NULL, true, NULL, true);
-return;
-}
-
 /*
  * bdrv queue is managed by record/replay,
  * waiting for finishing the I/O requests may
@@ -500,6 +495,18 @@ void bdrv_drain_all_begin(void)
 bdrv_do_drained_begin(bs, NULL, false);
 aio_context_release(aio_context);
 }
+}
+
+void bdrv_drain_all_begin(void)
+{
+BlockDriverState *bs = NULL;
+
+if (qemu_in_coroutine()) {
+bdrv_co_yield_to_drain(NULL, true, NULL, true);
+return;
+}
+
+bdrv_drain_all_begin_nopoll();
 
 /* Now poll the in-flight requests */
 AIO_WAIT_WHILE(NULL, bdrv_drain_all_poll());
-- 
2.38.1




[PATCH 13/18] block: assert that graph read and writes are performed correctly

2022-12-07 Thread Kevin Wolf
From: Emanuele Giuseppe Esposito 

Remove the old assert_bdrv_graph_writable, and replace it with
the new version using graph-lock API.

See the function documentation for more information.

Signed-off-by: Emanuele Giuseppe Esposito 
Signed-off-by: Kevin Wolf 
---
 include/block/block_int-global-state.h | 17 -
 include/block/graph-lock.h | 15 +++
 block.c|  4 ++--
 block/graph-lock.c | 11 +++
 4 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/include/block/block_int-global-state.h 
b/include/block/block_int-global-state.h
index b49f4eb35b..2f0993f6e9 100644
--- a/include/block/block_int-global-state.h
+++ b/include/block/block_int-global-state.h
@@ -310,21 +310,4 @@ void bdrv_remove_aio_context_notifier(BlockDriverState *bs,
  */
 void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
 
-/**
- * Make sure that the function is running under both drain and BQL.
- * The latter protects from concurrent writings
- * from the GS API, while the former prevents concurrent reads
- * from I/O.
- */
-static inline void assert_bdrv_graph_writable(BlockDriverState *bs)
-{
-/*
- * TODO: this function is incomplete. Because the users of this
- * assert lack the necessary drains, check only for BQL.
- * Once the necessary drains are added,
- * assert also for qatomic_read(&bs->quiesce_counter) > 0
- */
-assert(qemu_in_main_thread());
-}
-
 #endif /* BLOCK_INT_GLOBAL_STATE_H */
diff --git a/include/block/graph-lock.h b/include/block/graph-lock.h
index b27d8a5fb1..85e8a53b73 100644
--- a/include/block/graph-lock.h
+++ b/include/block/graph-lock.h
@@ -135,6 +135,21 @@ void coroutine_fn bdrv_graph_co_rdunlock(void);
 void bdrv_graph_rdlock_main_loop(void);
 void bdrv_graph_rdunlock_main_loop(void);
 
+/*
+ * assert_bdrv_graph_readable:
+ * Make sure that the reader is either the main loop,
+ * or there is at least a reader helding the rdlock.
+ * In this way an incoming writer is aware of the read and waits.
+ */
+void assert_bdrv_graph_readable(void);
+
+/*
+ * assert_bdrv_graph_writable:
+ * Make sure that the writer is the main loop and has set @has_writer,
+ * so that incoming readers will pause.
+ */
+void assert_bdrv_graph_writable(void);
+
 typedef struct GraphLockable { } GraphLockable;
 
 /*
diff --git a/block.c b/block.c
index bdffadcdaa..ff53b41af3 100644
--- a/block.c
+++ b/block.c
@@ -1406,7 +1406,7 @@ static void bdrv_child_cb_attach(BdrvChild *child)
 {
 BlockDriverState *bs = child->opaque;
 
-assert_bdrv_graph_writable(bs);
+assert_bdrv_graph_writable();
 QLIST_INSERT_HEAD(&bs->children, child, next);
 if (bs->drv->is_filter || (child->role & BDRV_CHILD_FILTERED)) {
 /*
@@ -1452,7 +1452,7 @@ static void bdrv_child_cb_detach(BdrvChild *child)
 bdrv_backing_detach(child);
 }
 
-assert_bdrv_graph_writable(bs);
+assert_bdrv_graph_writable();
 QLIST_REMOVE(child, next);
 if (child == bs->backing) {
 assert(child != bs->file);
diff --git a/block/graph-lock.c b/block/graph-lock.c
index e033c6d9ac..c4d9d2c274 100644
--- a/block/graph-lock.c
+++ b/block/graph-lock.c
@@ -259,3 +259,14 @@ void bdrv_graph_rdunlock_main_loop(void)
 GLOBAL_STATE_CODE();
 assert(!qemu_in_coroutine());
 }
+
+void assert_bdrv_graph_readable(void)
+{
+assert(qemu_in_main_thread() || reader_count());
+}
+
+void assert_bdrv_graph_writable(void)
+{
+assert(qemu_in_main_thread());
+assert(qatomic_read(&has_writer));
+}
-- 
2.38.1




[PATCH 05/18] Import clang-tsa.h

2022-12-07 Thread Kevin Wolf
This defines macros that allow clang to perform Thread Safety Analysis
based on function and variable annotations that specify the locking
rules. On non-clang compilers, the annotations are ignored.

Imported tsa.h from the original repository with the pthread_mutex_t
wrapper removed:

https://github.com/jhi/clang-thread-safety-analysis-for-c.git

Signed-off-by: Kevin Wolf 
---
 include/qemu/clang-tsa.h | 101 +++
 1 file changed, 101 insertions(+)
 create mode 100644 include/qemu/clang-tsa.h

diff --git a/include/qemu/clang-tsa.h b/include/qemu/clang-tsa.h
new file mode 100644
index 00..0a3361dfc8
--- /dev/null
+++ b/include/qemu/clang-tsa.h
@@ -0,0 +1,101 @@
+#ifndef CLANG_TSA_H
+#define CLANG_TSA_H
+
+/*
+ * Copyright 2018 Jarkko Hietaniemi 
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without
+ * limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+/* http://clang.llvm.org/docs/ThreadSafetyAnalysis.html
+ *
+ * TSA is available since clang 3.6-ish.
+ */
+#ifdef __clang__
+#  define TSA(x)   __attribute__((x))
+#else
+#  define TSA(x)   /* No TSA, make TSA attributes no-ops. */
+#endif
+
+/* TSA_CAPABILITY() is used to annotate typedefs:
+ *
+ * typedef pthread_mutex_t TSA_CAPABILITY("mutex") tsa_mutex;
+ */
+#define TSA_CAPABILITY(x) TSA(capability(x))
+
+/* TSA_GUARDED_BY() is used to annotate global variables,
+ * the data is guarded:
+ *
+ * Foo foo TSA_GUARDED_BY(mutex);
+ */
+#define TSA_GUARDED_BY(x) TSA(guarded_by(x))
+
+/* TSA_PT_GUARDED_BY() is used to annotate global pointers, the data
+ * behind the pointer is guarded.
+ *
+ * Foo* ptr TSA_PT_GUARDED_BY(mutex);
+ */
+#define TSA_PT_GUARDED_BY(x) TSA(pt_guarded_by(x))
+
+/* The TSA_REQUIRES() is used to annotate functions: the caller of the
+ * function MUST hold the resource, the function will NOT release it.
+ *
+ * More than one mutex may be specified, comma-separated.
+ *
+ * void Foo(void) TSA_REQUIRES(mutex);
+ */
+#define TSA_REQUIRES(...) TSA(requires_capability(__VA_ARGS__))
+
+/* TSA_EXCLUDES() is used to annotate functions: the caller of the
+ * function MUST NOT hold resource, the function first acquires the
+ * resource, and then releases it.
+ *
+ * More than one mutex may be specified, comma-separated.
+ *
+ * void Foo(void) TSA_EXCLUDES(mutex);
+ */
+#define TSA_EXCLUDES(...) TSA(locks_excluded(__VA_ARGS__))
+
+/* TSA_ACQUIRE() is used to annotate functions: the caller of the
+ * function MUST NOT hold the resource, the function will acquire the
+ * resource, but NOT release it.
+ *
+ * More than one mutex may be specified, comma-separated.
+ *
+ * void Foo(void) TSA_ACQUIRE(mutex);
+ */
+#define TSA_ACQUIRE(...) TSA(acquire_capability(__VA_ARGS__))
+
+/* TSA_RELEASE() is used to annotate functions: the caller of the
+ * function MUST hold the resource, but the function will then release it.
+ *
+ * More than one mutex may be specified, comma-separated.
+ *
+ * void Foo(void) TSA_RELEASE(mutex);
+ */
+#define TSA_RELEASE(...) TSA(release_capability(__VA_ARGS__))
+
+/* TSA_NO_TSA is used to annotate functions.  Use only when you need to.
+ *
+ * void Foo(void) TSA_NO_TSA;
+ */
+#define TSA_NO_TSA TSA(no_thread_safety_analysis)
+
+#endif /* #ifndef CLANG_TSA_H */
-- 
2.38.1




[PATCH 14/18] graph-lock: TSA annotations for lock/unlock functions

2022-12-07 Thread Kevin Wolf
Signed-off-by: Kevin Wolf 
---
 include/block/graph-lock.h | 80 +-
 block/graph-lock.c |  3 ++
 2 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/include/block/graph-lock.h b/include/block/graph-lock.h
index 85e8a53b73..50b7e7b1b6 100644
--- a/include/block/graph-lock.h
+++ b/include/block/graph-lock.h
@@ -21,6 +21,7 @@
 #define GRAPH_LOCK_H
 
 #include "qemu/osdep.h"
+#include "qemu/clang-tsa.h"
 
 #include "qemu/coroutine.h"
 
@@ -57,6 +58,35 @@
  */
 typedef struct BdrvGraphRWlock BdrvGraphRWlock;
 
+/* Dummy lock object to use for Thread Safety Analysis (TSA) */
+typedef struct TSA_CAPABILITY("graph-lock") BdrvGraphLock {
+} BdrvGraphLock;
+
+extern BdrvGraphLock graph_lock;
+
+/*
+ * clang doesn't check consistency in locking annotations between forward
+ * declarations and the function definition. Having the annotation on the
+ * definition, but not the declaration in a header file, may give the reader
+ * a false sense of security because the condition actually remains unchecked
+ * for callers in other source files.
+ *
+ * Therefore, as a convention, for public functions, GRAPH_RDLOCK and
+ * GRAPH_WRLOCK annotations should be present only in the header file.
+ */
+#define GRAPH_WRLOCK TSA_REQUIRES(graph_lock)
+#define GRAPH_RDLOCK TSA_REQUIRES_SHARED(graph_lock)
+
+/*
+ * TSA annotations are not part of function types, so checks are defeated when
+ * using a function pointer. As a workaround, annotate function pointers with
+ * this macro that will require that the lock is at least taken while reading
+ * the pointer. In most cases this is equivalent to actually protecting the
+ * function call.
+ */
+#define GRAPH_RDLOCK_PTR TSA_GUARDED_BY(graph_lock)
+#define GRAPH_WRLOCK_PTR TSA_GUARDED_BY(graph_lock)
+
 /*
  * register_aiocontext:
  * Add AioContext @ctx to the list of AioContext.
@@ -85,14 +115,14 @@ void unregister_aiocontext(AioContext *ctx);
  * This function polls. Callers must not hold the lock of any AioContext other
  * than the current one.
  */
-void bdrv_graph_wrlock(void);
+void bdrv_graph_wrlock(void) TSA_ACQUIRE(graph_lock) TSA_NO_TSA;
 
 /*
  * bdrv_graph_wrunlock:
  * Write finished, reset global has_writer to 0 and restart
  * all readers that are waiting.
  */
-void bdrv_graph_wrunlock(void);
+void bdrv_graph_wrunlock(void) TSA_RELEASE(graph_lock) TSA_NO_TSA;
 
 /*
  * bdrv_graph_co_rdlock:
@@ -116,7 +146,8 @@ void bdrv_graph_wrunlock(void);
  * loop) to take it and wait that the coroutine ends, so that
  * we always signal that a reader is running.
  */
-void coroutine_fn bdrv_graph_co_rdlock(void);
+void coroutine_fn TSA_ACQUIRE_SHARED(graph_lock) TSA_NO_TSA
+bdrv_graph_co_rdlock(void);
 
 /*
  * bdrv_graph_rdunlock:
@@ -124,7 +155,8 @@ void coroutine_fn bdrv_graph_co_rdlock(void);
  * If the writer is waiting for reads to finish (has_writer == 1), signal
  * the writer that we are done via aio_wait_kick() to let it continue.
  */
-void coroutine_fn bdrv_graph_co_rdunlock(void);
+void coroutine_fn TSA_RELEASE_SHARED(graph_lock) TSA_NO_TSA
+bdrv_graph_co_rdunlock(void);
 
 /*
  * bdrv_graph_rd{un}lock_main_loop:
@@ -132,8 +164,11 @@ void coroutine_fn bdrv_graph_co_rdunlock(void);
  * in the main loop. It is just asserting that we are not
  * in a coroutine and in GLOBAL_STATE_CODE.
  */
-void bdrv_graph_rdlock_main_loop(void);
-void bdrv_graph_rdunlock_main_loop(void);
+void TSA_ACQUIRE_SHARED(graph_lock) TSA_NO_TSA
+bdrv_graph_rdlock_main_loop(void);
+
+void TSA_RELEASE_SHARED(graph_lock) TSA_NO_TSA
+bdrv_graph_rdunlock_main_loop(void);
 
 /*
  * assert_bdrv_graph_readable:
@@ -150,6 +185,17 @@ void assert_bdrv_graph_readable(void);
  */
 void assert_bdrv_graph_writable(void);
 
+/*
+ * Calling this function tells TSA that we know that the lock is effectively
+ * taken even though we cannot prove it (yet) with GRAPH_RDLOCK. This can be
+ * useful in intermediate stages of a conversion to using the GRAPH_RDLOCK
+ * macro.
+ */
+static inline void TSA_ASSERT_SHARED(graph_lock) TSA_NO_TSA
+assume_graph_lock(void)
+{
+}
+
 typedef struct GraphLockable { } GraphLockable;
 
 /*
@@ -159,13 +205,21 @@ typedef struct GraphLockable { } GraphLockable;
  */
 #define GML_OBJ_() (&(GraphLockable) { })
 
-static inline GraphLockable *graph_lockable_auto_lock(GraphLockable *x)
+/*
+ * This is not marked as TSA_ACQUIRE() because TSA doesn't understand the
+ * cleanup attribute and would therefore complain that the graph is never
+ * unlocked. TSA_ASSERT() makes sure that the following calls know that we
+ * hold the lock while unlocking is left unchecked.
+ */
+static inline GraphLockable * TSA_ASSERT(graph_lock) TSA_NO_TSA
+graph_lockable_auto_lock(GraphLockable *x)
 {
 bdrv_graph_co_rdlock();
 return x;
 }
 
-static inline void graph_lockable_auto_unlock(GraphLockable *x)
+static inline void TSA_NO_TSA
+graph_lockable_auto_unlock(GraphLockable *x)
 {
 bdrv_graph_co_rdunlock();
 }
@@ -195,14 +249,20 @@ typedef struct G

[PATCH 15/18] Mark assert_bdrv_graph_readable/writable() GRAPH_RD/WRLOCK

2022-12-07 Thread Kevin Wolf
Signed-off-by: Kevin Wolf 
---
 include/block/block_int-common.h | 4 ++--
 include/block/graph-lock.h   | 4 ++--
 block.c  | 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index a6bc6b7fe9..b1f0d88307 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -898,8 +898,8 @@ struct BdrvChildClass {
 void (*activate)(BdrvChild *child, Error **errp);
 int (*inactivate)(BdrvChild *child);
 
-void (*attach)(BdrvChild *child);
-void (*detach)(BdrvChild *child);
+void GRAPH_WRLOCK_PTR (*attach)(BdrvChild *child);
+void GRAPH_WRLOCK_PTR (*detach)(BdrvChild *child);
 
 /*
  * Notifies the parent that the filename of its child has changed (e.g.
diff --git a/include/block/graph-lock.h b/include/block/graph-lock.h
index 50b7e7b1b6..0c66386167 100644
--- a/include/block/graph-lock.h
+++ b/include/block/graph-lock.h
@@ -176,14 +176,14 @@ bdrv_graph_rdunlock_main_loop(void);
  * or there is at least a reader helding the rdlock.
  * In this way an incoming writer is aware of the read and waits.
  */
-void assert_bdrv_graph_readable(void);
+void GRAPH_RDLOCK assert_bdrv_graph_readable(void);
 
 /*
  * assert_bdrv_graph_writable:
  * Make sure that the writer is the main loop and has set @has_writer,
  * so that incoming readers will pause.
  */
-void assert_bdrv_graph_writable(void);
+void GRAPH_WRLOCK assert_bdrv_graph_writable(void);
 
 /*
  * Calling this function tells TSA that we know that the lock is effectively
diff --git a/block.c b/block.c
index ff53b41af3..1a82fd101a 100644
--- a/block.c
+++ b/block.c
@@ -1402,7 +1402,7 @@ static void bdrv_inherited_options(BdrvChildRole role, 
bool parent_is_format,
 *child_flags = flags;
 }
 
-static void bdrv_child_cb_attach(BdrvChild *child)
+static void GRAPH_WRLOCK bdrv_child_cb_attach(BdrvChild *child)
 {
 BlockDriverState *bs = child->opaque;
 
@@ -1444,7 +1444,7 @@ static void bdrv_child_cb_attach(BdrvChild *child)
 }
 }
 
-static void bdrv_child_cb_detach(BdrvChild *child)
+static void GRAPH_WRLOCK bdrv_child_cb_detach(BdrvChild *child)
 {
 BlockDriverState *bs = child->opaque;
 
-- 
2.38.1




[PATCH 06/18] clang-tsa: Add TSA_ASSERT() macro

2022-12-07 Thread Kevin Wolf
Signed-off-by: Kevin Wolf 
---
 include/qemu/clang-tsa.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/qemu/clang-tsa.h b/include/qemu/clang-tsa.h
index 0a3361dfc8..211ee0ae73 100644
--- a/include/qemu/clang-tsa.h
+++ b/include/qemu/clang-tsa.h
@@ -98,4 +98,13 @@
  */
 #define TSA_NO_TSA TSA(no_thread_safety_analysis)
 
+/*
+ * TSA_ASSERT() is used to annotate functions: This function will assert that
+ * the lock is held. When it returns, the caller of the function is assumed to
+ * already hold the resource.
+ *
+ * More than one mutex may be specified, comma-separated.
+ */
+#define TSA_ASSERT(...) TSA(assert_capability(__VA_ARGS__))
+
 #endif /* #ifndef CLANG_TSA_H */
-- 
2.38.1




[PATCH 16/18] block-coroutine-wrapper.py: introduce annotations that take the graph rdlock

2022-12-07 Thread Kevin Wolf
From: Emanuele Giuseppe Esposito 

Add co_wrapper_bdrv_rdlock and co_wrapper_mixed_bdrv_rdlock option to
the block-coroutine-wrapper.py script.

This "_bdrv_rdlock" option takes and releases the graph rdlock when a
coroutine function is created.

This means that when used together with "_mixed", the function marked
with co_wrapper_mixed_bdrv_rdlock will support both coroutine and
non-coroutine case, and in the latter case it will create a coroutine
that takes and releases the rdlock. When called from a coroutine, the
caller must already hold the graph lock.

Example:
void co_wrapper_mixed_bdrv_rdlock bdrv_f1();

Becomes

static void bdrv_co_enter_f1()
{
bdrv_graph_co_rdlock();
bdrv_co_function();
bdrv_graph_co_rdunlock();
}

void bdrv_f1()
{
if (qemu_in_coroutine) {
assume_graph_lock();
bdrv_co_function();
} else {
qemu_co_enter(bdrv_co_enter_f1);
...
}
}

When used alone, the function will not work in coroutine context, and
when called in non-coroutine context it will create a new coroutine that
takes care of taking and releasing the rdlock automatically.

Example:
void co_wrapper_bdrv_rdlock bdrv_f1();

Becomes

static void bdrv_co_enter_f1()
{
bdrv_graph_co_rdlock();
bdrv_co_function();
bdrv_graph_co_rdunlock();
}

void bdrv_f1()
{
assert(!qemu_in_coroutine());
qemu_co_enter(bdrv_co_enter_f1);
...
}

About their usage:
- co_wrapper does not take the rdlock, so it can be used also outside
  the block layer.
- co_wrapper_mixed will be used by many blk_* functions, since the
  coroutine function needs to call blk_wait_while_drained() and
  the rdlock *must* be taken afterwards, otherwise it's a deadlock.
  In the future this annotation will go away, and blk_* will use
  co_wrapper directly.
- co_wrapper_bdrv_rdlock will be used by BlockDriver callbacks, ideally
  by all of them in the future.
- co_wrapper_mixed_bdrv_rdlock will be used by the remaining functions
  that are still called by coroutine and non-coroutine context. In the
  future this annotation will go away, as we will split such mixed
  functions.

Signed-off-by: Emanuele Giuseppe Esposito 
Signed-off-by: Kevin Wolf 
---
 include/block/block-common.h   |  9 -
 scripts/block-coroutine-wrapper.py | 12 
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index 6cf603ab06..4749c46a5e 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -40,14 +40,21 @@
  *
  * Usage: read docs/devel/block-coroutine-wrapper.rst
  *
- * There are 2 kind of specifiers:
+ * There are 4 kind of specifiers:
  * - co_wrapper functions can be called by only non-coroutine context, because
  *   they always generate a new coroutine.
  * - co_wrapper_mixed functions can be called by both coroutine and
  *   non-coroutine context.
+ * - co_wrapper_bdrv_rdlock are co_wrapper functions but automatically take and
+ *   release the graph rdlock when creating a new coroutine
+ * - co_wrapper_mixed_bdrv_rdlock are co_wrapper_mixed functions but
+ *   automatically take and release the graph rdlock when creating a new
+ *   coroutine.
  */
 #define co_wrapper
 #define co_wrapper_mixed
+#define co_wrapper_bdrv_rdlock
+#define co_wrapper_mixed_bdrv_rdlock
 
 #include "block/dirty-bitmap.h"
 #include "block/blockjob.h"
diff --git a/scripts/block-coroutine-wrapper.py 
b/scripts/block-coroutine-wrapper.py
index 71a06e917a..6e087fa0b7 100644
--- a/scripts/block-coroutine-wrapper.py
+++ b/scripts/block-coroutine-wrapper.py
@@ -69,6 +69,7 @@ def __init__(self, return_type: str, name: str, args: str,
 self.struct_name = snake_to_camel(self.name)
 self.args = [ParamDecl(arg.strip()) for arg in args.split(',')]
 self.create_only_co = 'mixed' not in variant
+self.graph_rdlock = 'bdrv_rdlock' in variant
 
 subsystem, subname = self.name.split('_', 1)
 self.co_name = f'{subsystem}_co_{subname}'
@@ -123,10 +124,13 @@ def create_mixed_wrapper(func: FuncDecl) -> str:
 """
 name = func.co_name
 struct_name = func.struct_name
+graph_assume_lock = 'assume_graph_lock();' if func.graph_rdlock else ''
+
 return f"""\
 {func.return_type} {func.name}({ func.gen_list('{decl}') })
 {{
 if (qemu_in_coroutine()) {{
+{graph_assume_lock}
 return {name}({ func.gen_list('{name}') });
 }} else {{
 {struct_name} s = {{
@@ -174,6 +178,12 @@ def gen_wrapper(func: FuncDecl) -> str:
 name = func.co_name
 struct_name = func.struct_name
 
+graph_lock=''
+graph_unlock=''
+if func.graph_rdlock:
+graph_lock='bdrv_graph_co_rdlock();'
+graph_unlock='bdrv_graph_co_rdunlock();'
+
 creation_function = create_mixed_wrapper
 if func.create_only_co:
 creation_function = create_co_wrapper
@@ -193,7 +203,9 @@ def gen_wrapper(func: FuncDecl) -> str:
 {{
 {struct_name} *s = opa

[PATCH 07/18] clang-tsa: Add macros for shared locks

2022-12-07 Thread Kevin Wolf
Signed-off-by: Kevin Wolf 
---
 include/qemu/clang-tsa.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/qemu/clang-tsa.h b/include/qemu/clang-tsa.h
index 211ee0ae73..ba06fb8c92 100644
--- a/include/qemu/clang-tsa.h
+++ b/include/qemu/clang-tsa.h
@@ -62,6 +62,7 @@
  * void Foo(void) TSA_REQUIRES(mutex);
  */
 #define TSA_REQUIRES(...) TSA(requires_capability(__VA_ARGS__))
+#define TSA_REQUIRES_SHARED(...) TSA(requires_shared_capability(__VA_ARGS__))
 
 /* TSA_EXCLUDES() is used to annotate functions: the caller of the
  * function MUST NOT hold resource, the function first acquires the
@@ -82,6 +83,7 @@
  * void Foo(void) TSA_ACQUIRE(mutex);
  */
 #define TSA_ACQUIRE(...) TSA(acquire_capability(__VA_ARGS__))
+#define TSA_ACQUIRE_SHARED(...) TSA(acquire_shared_capability(__VA_ARGS__))
 
 /* TSA_RELEASE() is used to annotate functions: the caller of the
  * function MUST hold the resource, but the function will then release it.
@@ -91,6 +93,7 @@
  * void Foo(void) TSA_RELEASE(mutex);
  */
 #define TSA_RELEASE(...) TSA(release_capability(__VA_ARGS__))
+#define TSA_RELEASE_SHARED(...) TSA(release_shared_capability(__VA_ARGS__))
 
 /* TSA_NO_TSA is used to annotate functions.  Use only when you need to.
  *
@@ -106,5 +109,6 @@
  * More than one mutex may be specified, comma-separated.
  */
 #define TSA_ASSERT(...) TSA(assert_capability(__VA_ARGS__))
+#define TSA_ASSERT_SHARED(...) TSA(assert_shared_capability(__VA_ARGS__))
 
 #endif /* #ifndef CLANG_TSA_H */
-- 
2.38.1




Re: [PULL 02/10] pci-bridge/cxl_downstream: Add a CXL switch downstream port

2022-12-07 Thread Jonathan Cameron via
On Mon, 05 Dec 2022 14:59:39 +
Alex Bennée  wrote:

> Jonathan Cameron via  writes:
> 
> > On Mon, 5 Dec 2022 10:54:03 +
> > Jonathan Cameron via  wrote:
> >  
> >> On Sun, 4 Dec 2022 08:23:55 +0100
> >> Thomas Huth  wrote:
> >>   
> >> > On 04/11/2022 07.47, Thomas Huth wrote:
> >> > > On 16/06/2022 18.57, Michael S. Tsirkin wrote:  
> >> > >> From: Jonathan Cameron 
> >> > >>
> >> > >> Emulation of a simple CXL Switch downstream port.
> >> > >> The Device ID has been allocated for this use.
> >> > >>
> >> > >> Signed-off-by: Jonathan Cameron 
> >> > >> Message-Id: <20220616145126.8002-3-jonathan.came...@huawei.com>
> >> > >> Signed-off-by: Michael S. Tsirkin 
> >> > >> Reviewed-by: Michael S. Tsirkin 
> >> > >> Signed-off-by: Michael S. Tsirkin 
> >> > >> ---
> >> > >>   hw/cxl/cxl-host.c  |  43 +-
> >> > >>   hw/pci-bridge/cxl_downstream.c | 249 
> >> > >> +
> >> > >>   hw/pci-bridge/meson.build  |   2 +-
> >> > >>   3 files changed, 291 insertions(+), 3 deletions(-)
> >> > >>   create mode 100644 hw/pci-bridge/cxl_downstream.c  
> >> > > 
> >> > >   Hi!
> >> > > 
> >> > > There is a memory problem somewhere in this new device. I can make 
> >> > > QEMU 
> >> > > crash by running something like this:
> >> > > 
> >> > > $ MALLOC_PERTURB_=59 ./qemu-system-x86_64 -M x-remote \
> >> > >      -display none -monitor stdio
> >> > > QEMU 7.1.50 monitor - type 'help' for more information
> >> > > (qemu) device_add cxl-downstream
> >> > > ./qemu/qom/object.c:1188:5: runtime error: member access within 
> >> > > misaligned 
> >> > > address 0x3b3b3b3b3b3b3b3b for type 'struct Object', which requires 8 
> >> > > byte 
> >> > > alignment
> >> > > 0x3b3b3b3b3b3b3b3b: note: pointer points here
> >> > > 
> >> > > Bus error (core dumped)
> >> > > 
> >> > > Could you have a look if you've got some spare minutes?  
> >> > 
> >> > Ping! Jonathan, Michael, any news on this bug?
> >> > 
> >> > (this breaks one of my local tests, that's why it's annoying for me)
> >> Sorry, my email filters ate your earlier message.
> >> 
> >> Looking into this now. I'll note that it also happens on
> >> device_add xio3130-downstream so not specific to this new device.
> >> 
> >> So far all I've managed to do is track it to something rcu related
> >> as failing in a call to drain_call_rcu() in qmp_device_add()
> >> 
> >> Will continue digging.  
> >
> > Assuming I'm seeing the same thing...
> >
> > Problem is g_free() on the PCIBridge windows: 
> > https://elixir.bootlin.com/qemu/latest/source/hw/pci/pci_bridge.c#L235
> >
> > Is called before we get an rcu_call() to flatview_destroy() as a
> > result of the final call of flatview_unref() in address_space_set_flatview()
> > so we get a use after free.
> >
> > As to what the fix is...  Suggestions welcome!  
> 
> It sounds like this is the wrong place to free the value then. I guess
> the PCI aliases into &w->alias_io() don't get dealt with until RCU
> clean-up time.
> 
> I *think* using g_free_rcu() should be enough to ensure the free occurs
> after the rest of the RCU cleanups but maybe you should only be cleaning
> up the windows at device unrealize time? Is this a dynamic piece of
> memory which gets updated during the lifetime of the device?

There is unfortunately code that swaps it for an updated structure
in pci_bridge_update_mappings()

> 
> If the memory is being cleared with RCU then the access to the base
> pointer should be done with the appropriate qatomic_rcu_[set|read]
> functions.
>

I'm annoyingly snowed under this week with other things, but hopefully
can get to in a few days.  Note we are looking at an old problem
here, just one that's happening for an additional device, not sure
if that really affects urgency of fix though.

Jonathan
 




Re: [PATCH v15 1/6] qmp: add QMP command x-query-virtio

2022-12-07 Thread Markus Armbruster
Jonah Palmer  writes:

> On 12/2/22 10:21, Markus Armbruster wrote:
>> Philippe Mathieu-Daudé  writes:
>>
>>> On 2/12/22 13:23, Jonah Palmer wrote:
 On 11/30/22 11:16, Philippe Mathieu-Daudé wrote:
> Hi,
>
> On 11/8/22 14:24, Jonah Palmer wrote:
>> From: Laurent Vivier
>>
>> This new command lists all the instances of VirtIODevices with
>> their canonical QOM path and name.
>>
>> [Jonah: @virtio_list duplicates information that already exists in
>>    the QOM composition tree. However, extracting necessary information
>>    from this tree seems to be a bit convoluted.
>>
>>    Instead, we still create our own list of realized virtio devices
>>    but use @qmp_qom_get with the device's canonical QOM path to confirm
>>    that the device exists and is realized. If the device exists but
>>    is actually not realized, then we remove it from our list (for
>>    synchronicity to the QOM composition tree).
>>
>> How could this happen?
>>
>>    Also, the QMP command @x-query-virtio is redundant as @qom-list
>>    and @qom-get are sufficient to search '/machine/' for realized
>>    virtio devices. However, @x-query-virtio is much more convenient
>>    in listing realized virtio devices.]
>>
>> Signed-off-by: Laurent Vivier
>> Signed-off-by: Jonah Palmer
>> ---
>>    hw/virtio/meson.build  |  2 ++
>>    hw/virtio/virtio-stub.c    | 14 
>>    hw/virtio/virtio.c | 44 
>>    include/hw/virtio/virtio.h |  1 +
>>    qapi/meson.build   |  1 +
>>    qapi/qapi-schema.json  |  1 +
>>    qapi/virtio.json   | 68 ++
>>    tests/qtest/qmp-cmd-test.c |  1 +
>>    8 files changed, 132 insertions(+)
>>    create mode 100644 hw/virtio/virtio-stub.c
>>    create mode 100644 qapi/virtio.json
>> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
>> index 5d607aeaa0..bdfa82e9c0 100644
>> --- a/hw/virtio/virtio.c
>> +++ b/hw/virtio/virtio.c
>> @@ -13,12 +13,18 @@
>>      #include "qemu/osdep.h"
>>    #include "qapi/error.h"
>> +#include "qapi/qmp/qdict.h"
>> +#include "qapi/qapi-commands-virtio.h"
>> +#include "qapi/qapi-commands-qom.h"
>> +#include "qapi/qapi-visit-virtio.h"
>> +#include "qapi/qmp/qjson.h"
>>    #include "cpu.h"
>>    #include "trace.h"
>>    #include "qemu/error-report.h"
>>    #include "qemu/log.h"
>>    #include "qemu/main-loop.h"
>>    #include "qemu/module.h"
>> +#include "qom/object_interfaces.h"
>>    #include "hw/virtio/virtio.h"
>>    #include "migration/qemu-file-types.h"
>>    #include "qemu/atomic.h"
>> @@ -29,6 +35,9 @@
>>    #include "sysemu/runstate.h"
>>    #include "standard-headers/linux/virtio_ids.h"
>>    +/* QAPI list of realized VirtIODevices */
>> +static QTAILQ_HEAD(, VirtIODevice) virtio_list;
>> +
>>    /*
>>     * The alignment to use between consumer and producer parts of vring.
>>     * x86 pagesize again. This is the default, used by transports like 
>> PCI
>> @@ -3698,6 +3707,7 @@ static void virtio_device_realize(DeviceState 
>> *dev, Error **errp)
>>    vdev->listener.commit = virtio_memory_listener_commit;
>>    vdev->listener.name = "virtio";
>>    memory_listener_register(&vdev->listener, vdev->dma_as);
>> +    QTAILQ_INSERT_TAIL(&virtio_list, vdev, next);
>>    }
>>      static void virtio_device_unrealize(DeviceState *dev)
>> @@ -3712,6 +3722,7 @@ static void virtio_device_unrealize(DeviceState 
>> *dev)
>>    vdc->unrealize(dev);
>>    }
>>    +    QTAILQ_REMOVE(&virtio_list, vdev, next);
>>    g_free(vdev->bus_name);
>>    vdev->bus_name = NULL;
>>    }
>> @@ -3885,6 +3896,8 @@ static void virtio_device_class_init(ObjectClass 
>> *klass, void *data)
>>    vdc->stop_ioeventfd = virtio_device_stop_ioeventfd_impl;
>>      vdc->legacy_features |= VIRTIO_LEGACY_FEATURES;
>> +
>> +    QTAILQ_INIT(&virtio_list);
>>    }
>>      bool virtio_device_ioeventfd_enabled(VirtIODevice *vdev)
>> @@ -3895,6 +3908,37 @@ bool virtio_device_ioeventfd_enabled(VirtIODevice 
>> *vdev)
>>    return virtio_bus_ioeventfd_enabled(vbus);
>>    }
>>    +VirtioInfoList *qmp_x_query_virtio(Error **errp)
>> +{
>> +    VirtioInfoList *list = NULL;
>> +    VirtioInfoList *node;
>> +    VirtIODevice *vdev;
>> +
>> +    QTAILQ_FOREACH(vdev, &virtio_list, next) {
>> +    DeviceState *dev = DEVICE(vdev);
>> +    Error *err = NULL;
>> +    QObject *obj = qmp_qom_get(dev->canonical_path, "realized", 
>> &err);
>> +
>> +    if (err == NULL) {
>> +    GString *is_realized = qobject_to_json_pretty(obj, true);
>> +  

[PATCH 03/18] graph-lock: Implement guard macros

2022-12-07 Thread Kevin Wolf
From: Emanuele Giuseppe Esposito 

Similar to the implementation in lockable.h, implement macros to
automatically take and release the rdlock.

Create the empty GraphLockable and GraphLockableMainloop structs only to
use it as a type for G_DEFINE_AUTOPTR_CLEANUP_FUNC.

Signed-off-by: Emanuele Giuseppe Esposito 
Signed-off-by: Kevin Wolf 
---
 include/block/graph-lock.h | 66 ++
 1 file changed, 66 insertions(+)

diff --git a/include/block/graph-lock.h b/include/block/graph-lock.h
index 82edb62cfa..b27d8a5fb1 100644
--- a/include/block/graph-lock.h
+++ b/include/block/graph-lock.h
@@ -135,5 +135,71 @@ void coroutine_fn bdrv_graph_co_rdunlock(void);
 void bdrv_graph_rdlock_main_loop(void);
 void bdrv_graph_rdunlock_main_loop(void);
 
+typedef struct GraphLockable { } GraphLockable;
+
+/*
+ * In C, compound literals have the lifetime of an automatic variable.
+ * In C++ it would be different, but then C++ wouldn't need QemuLockable
+ * either...
+ */
+#define GML_OBJ_() (&(GraphLockable) { })
+
+static inline GraphLockable *graph_lockable_auto_lock(GraphLockable *x)
+{
+bdrv_graph_co_rdlock();
+return x;
+}
+
+static inline void graph_lockable_auto_unlock(GraphLockable *x)
+{
+bdrv_graph_co_rdunlock();
+}
+
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(GraphLockable, graph_lockable_auto_unlock)
+
+#define WITH_GRAPH_RDLOCK_GUARD_(var) \
+for (g_autoptr(GraphLockable) var = graph_lockable_auto_lock(GML_OBJ_()); \
+ var; \
+ graph_lockable_auto_unlock(var), var = NULL)
+
+#define WITH_GRAPH_RDLOCK_GUARD() \
+WITH_GRAPH_RDLOCK_GUARD_(glue(graph_lockable_auto, __COUNTER__))
+
+#define GRAPH_RDLOCK_GUARD(x)   \
+g_autoptr(GraphLockable)\
+glue(graph_lockable_auto, __COUNTER__) G_GNUC_UNUSED =  \
+graph_lockable_auto_lock(GML_OBJ_())
+
+
+typedef struct GraphLockableMainloop { } GraphLockableMainloop;
+
+/*
+ * In C, compound literals have the lifetime of an automatic variable.
+ * In C++ it would be different, but then C++ wouldn't need QemuLockable
+ * either...
+ */
+#define GMLML_OBJ_() (&(GraphLockableMainloop) { })
+
+static inline GraphLockableMainloop *
+graph_lockable_auto_lock_mainloop(GraphLockableMainloop *x)
+{
+bdrv_graph_rdlock_main_loop();
+return x;
+}
+
+static inline void
+graph_lockable_auto_unlock_mainloop(GraphLockableMainloop *x)
+{
+bdrv_graph_rdunlock_main_loop();
+}
+
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(GraphLockableMainloop,
+  graph_lockable_auto_unlock_mainloop)
+
+#define GRAPH_RDLOCK_GUARD_MAINLOOP(x)  \
+g_autoptr(GraphLockableMainloop)\
+glue(graph_lockable_auto, __COUNTER__) G_GNUC_UNUSED =  \
+graph_lockable_auto_lock_mainloop(GMLML_OBJ_())
+
 #endif /* GRAPH_LOCK_H */
 
-- 
2.38.1




[PATCH] block/mirror: add 'write-blocking-after-ready' copy mode

2022-12-07 Thread Fiona Ebner
The new copy mode starts out in 'background' mode and switches to
'write-blocking' mode once the job transitions to ready.

Before switching to active mode and indicating that the drives are
actively synced, it is necessary to have seen and handled all guest
I/O. This is done by checking the dirty bitmap inside a drained
section. Transitioning to ready is also only done at the same time.

The new mode is useful for management applications using drive-mirror
in combination with migration. Currently, migration doesn't check on
mirror jobs before inactivating the blockdrives, so it's necessary to
either:
1) use the 'pause-before-switchover' migration capability and complete
   mirror jobs before actually switching over.
2) use 'write-blocking' copy mode for the drive mirrors.

The downside with 1) is longer downtime for the guest, while the
downside with 2) is that guest write speed is limited by the
synchronous writes to the mirror target. The newly introduced copy
mode reduces the time that limit is in effect.

Signed-off-by: Fiona Ebner 
---

See [0] for a bit more context. While the new copy mode doesn't
fundamentally improve the downside of 2) (especially when multiple
drives are mirrored), it would still improve things a little. And I
guess when trying to keep downtime short, guest write speed needs to
be limited at /some/ point (always in the context of migration with
drive-mirror of course). Ideally, that could go hand-in-hand with
migration convergence, but that would require some larger changes to
implement and introduce more coupling.

[0] https://lists.nongnu.org/archive/html/qemu-devel/2022-09/msg04886.html

 block/mirror.c   | 29 +++--
 qapi/block-core.json |  5 -
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 251adc5ae0..e21b4e5e77 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -60,6 +60,7 @@ typedef struct MirrorBlockJob {
 /* Set when the target is synced (dirty bitmap is clean, nothing
  * in flight) and the job is running in active mode */
 bool actively_synced;
+bool in_active_mode;
 bool should_complete;
 int64_t granularity;
 size_t buf_size;
@@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job, Error 
**errp)
 if (s->in_flight == 0 && cnt == 0) {
 trace_mirror_before_flush(s);
 if (!job_is_ready(&s->common.job)) {
+if (s->copy_mode ==
+MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) {
+/*
+ * Pause guest I/O to check if we can switch to active 
mode.
+ * To set actively_synced to true below, it is necessary to
+ * have seen and synced all guest I/O.
+ */
+s->in_drain = true;
+bdrv_drained_begin(bs);
+if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) {
+bdrv_drained_end(bs);
+s->in_drain = false;
+continue;
+}
+s->in_active_mode = true;
+bdrv_disable_dirty_bitmap(s->dirty_bitmap);
+bdrv_drained_end(bs);
+s->in_drain = false;
+}
+
 if (mirror_flush(s) < 0) {
 /* Go check s->ret.  */
 continue;
 }
+
 /* We're out of the streaming phase.  From now on, if the job
  * is cancelled we will actually complete all pending I/O and
  * report completion.  This way, block-job-cancel will leave
@@ -1443,7 +1465,7 @@ static int coroutine_fn 
bdrv_mirror_top_do_write(BlockDriverState *bs,
 if (s->job) {
 copy_to_target = s->job->ret >= 0 &&
  !job_is_cancelled(&s->job->common.job) &&
- s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
+ s->job->in_active_mode;
 }
 
 if (copy_to_target) {
@@ -1494,7 +1516,7 @@ static int coroutine_fn 
bdrv_mirror_top_pwritev(BlockDriverState *bs,
 if (s->job) {
 copy_to_target = s->job->ret >= 0 &&
  !job_is_cancelled(&s->job->common.job) &&
- s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
+ s->job->in_active_mode;
 }
 
 if (copy_to_target) {
@@ -1792,7 +1814,10 @@ static BlockJob *mirror_start_job(
 goto fail;
 }
 if (s->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING) {
+s->in_active_mode = true;
 bdrv_disable_dirty_bitmap(s->dirty_bitmap);
+} else {
+s->in_active_mode = false;
 }
 
 ret = block_job_add_bdrv(&s->common, "source", bs, 0,
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 95ac4fa634..2a983ed78d 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core

Re: [PATCH 1/1] qemu-iotests/stream-under-throttle: do not shutdown QEMU

2022-12-07 Thread Thomas Huth

On 07/12/2022 14.14, Christian Borntraeger wrote:

Without a kernel or boot disk a QEMU on s390 will exit (usually with a
disabled wait state). This breaks the stream-under-throttle test case.
Do not exit qemu if on s390.

Signed-off-by: Christian Borntraeger 
---
  tests/qemu-iotests/tests/stream-under-throttle | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/tests/qemu-iotests/tests/stream-under-throttle 
b/tests/qemu-iotests/tests/stream-under-throttle
index 8d2d9e16840d..c24dfbcaa2f2 100755
--- a/tests/qemu-iotests/tests/stream-under-throttle
+++ b/tests/qemu-iotests/tests/stream-under-throttle
@@ -88,6 +88,8 @@ class TestStreamWithThrottle(iotests.QMPTestCase):
 'x-iops-total=1,x-bps-total=104857600')
  self.vm.add_blockdev(self.vm.qmp_to_opts(blockdev))
  self.vm.add_device('virtio-blk,iothread=iothr0,drive=throttled-node')
+if iotests.qemu_default_machine == 's390-ccw-virtio':
+self.vm.add_args('-no-shutdown')
  self.vm.launch()


I guess you could even add that unconditionally for all architectures?

Anyway:
Reviewed-by: Thomas Huth 




[PATCH for 7.2?] target/i386: Remove compilation errors when -Werror=maybe-uninitialized

2022-12-07 Thread Eric Auger
Initialize r0-3 to avoid compilation errors when
-Werror=maybe-uninitialized is used

../target/i386/ops_sse.h: In function ‘helper_vpermdq_ymm’:
../target/i386/ops_sse.h:2495:13: error: ‘r3’ may be used uninitialized in this 
function [-Werror=maybe-uninitialized]
 2495 | d->Q(3) = r3;
  | ^~~~
../target/i386/ops_sse.h:2494:13: error: ‘r2’ may be used uninitialized in this 
function [-Werror=maybe-uninitialized]
 2494 | d->Q(2) = r2;
  | ^~~~
../target/i386/ops_sse.h:2493:13: error: ‘r1’ may be used uninitialized in this 
function [-Werror=maybe-uninitialized]
 2493 | d->Q(1) = r1;
  | ^~~~
../target/i386/ops_sse.h:2492:13: error: ‘r0’ may be used uninitialized in this 
function [-Werror=maybe-uninitialized]
 2492 | d->Q(0) = r0;
  | ^~~~

Signed-off-by: Eric Auger 
Fixes: 790684776861 ("target/i386: reimplement 0x0f 0x3a, add AVX")

---

Am I the only one getting this? Or anything wrong in my setup.
---
 target/i386/ops_sse.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index 3cbc36a59d..b77071b8da 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -2451,7 +2451,7 @@ void glue(helper_vpgatherqq, SUFFIX)(CPUX86State *env,
 #if SHIFT >= 2
 void helper_vpermdq_ymm(Reg *d, Reg *v, Reg *s, uint32_t order)
 {
-uint64_t r0, r1, r2, r3;
+uint64_t r0 = 0, r1 = 0, r2 = 0, r3 = 0;
 
 switch (order & 3) {
 case 0:
-- 
2.37.3




Re: [PATCH for 7.2?] target/i386: Remove compilation errors when -Werror=maybe-uninitialized

2022-12-07 Thread Eric Auger



On 12/7/22 14:24, Eric Auger wrote:
> Initialize r0-3 to avoid compilation errors when
> -Werror=maybe-uninitialized is used
>
> ../target/i386/ops_sse.h: In function ‘helper_vpermdq_ymm’:
> ../target/i386/ops_sse.h:2495:13: error: ‘r3’ may be used uninitialized in 
> this function [-Werror=maybe-uninitialized]
>  2495 | d->Q(3) = r3;
>   | ^~~~
> ../target/i386/ops_sse.h:2494:13: error: ‘r2’ may be used uninitialized in 
> this function [-Werror=maybe-uninitialized]
>  2494 | d->Q(2) = r2;
>   | ^~~~
> ../target/i386/ops_sse.h:2493:13: error: ‘r1’ may be used uninitialized in 
> this function [-Werror=maybe-uninitialized]
>  2493 | d->Q(1) = r1;
>   | ^~~~
> ../target/i386/ops_sse.h:2492:13: error: ‘r0’ may be used uninitialized in 
> this function [-Werror=maybe-uninitialized]
>  2492 | d->Q(0) = r0;
>   | ^~~~
>
> Signed-off-by: Eric Auger 
> Fixes: 790684776861 ("target/i386: reimplement 0x0f 0x3a, add AVX")
>
> ---
>
> Am I the only one getting this? Or anything wrong in my setup.

With Stefan's correct address. Forgive me for the noise.

Eric
> ---
>  target/i386/ops_sse.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
> index 3cbc36a59d..b77071b8da 100644
> --- a/target/i386/ops_sse.h
> +++ b/target/i386/ops_sse.h
> @@ -2451,7 +2451,7 @@ void glue(helper_vpgatherqq, SUFFIX)(CPUX86State *env,
>  #if SHIFT >= 2
>  void helper_vpermdq_ymm(Reg *d, Reg *v, Reg *s, uint32_t order)
>  {
> -uint64_t r0, r1, r2, r3;
> +uint64_t r0 = 0, r1 = 0, r2 = 0, r3 = 0;
>  
>  switch (order & 3) {
>  case 0:




Re: [PULL 02/10] pci-bridge/cxl_downstream: Add a CXL switch downstream port

2022-12-07 Thread Thomas Huth

On 07/12/2022 14.21, Jonathan Cameron wrote:

On Mon, 05 Dec 2022 14:59:39 +
Alex Bennée  wrote:


Jonathan Cameron via  writes:


On Mon, 5 Dec 2022 10:54:03 +
Jonathan Cameron via  wrote:
  

On Sun, 4 Dec 2022 08:23:55 +0100
Thomas Huth  wrote:
   

On 04/11/2022 07.47, Thomas Huth wrote:

On 16/06/2022 18.57, Michael S. Tsirkin wrote:

From: Jonathan Cameron 

Emulation of a simple CXL Switch downstream port.
The Device ID has been allocated for this use.

Signed-off-by: Jonathan Cameron 
Message-Id: <20220616145126.8002-3-jonathan.came...@huawei.com>
Signed-off-by: Michael S. Tsirkin 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
   hw/cxl/cxl-host.c  |  43 +-
   hw/pci-bridge/cxl_downstream.c | 249 +
   hw/pci-bridge/meson.build  |   2 +-
   3 files changed, 291 insertions(+), 3 deletions(-)
   create mode 100644 hw/pci-bridge/cxl_downstream.c


   Hi!

There is a memory problem somewhere in this new device. I can make QEMU
crash by running something like this:

$ MALLOC_PERTURB_=59 ./qemu-system-x86_64 -M x-remote \
      -display none -monitor stdio
QEMU 7.1.50 monitor - type 'help' for more information
(qemu) device_add cxl-downstream
./qemu/qom/object.c:1188:5: runtime error: member access within misaligned
address 0x3b3b3b3b3b3b3b3b for type 'struct Object', which requires 8 byte
alignment
0x3b3b3b3b3b3b3b3b: note: pointer points here

Bus error (core dumped)

Could you have a look if you've got some spare minutes?


Ping! Jonathan, Michael, any news on this bug?

(this breaks one of my local tests, that's why it's annoying for me)

Sorry, my email filters ate your earlier message.

Looking into this now. I'll note that it also happens on
device_add xio3130-downstream so not specific to this new device.

So far all I've managed to do is track it to something rcu related
as failing in a call to drain_call_rcu() in qmp_device_add()

Will continue digging.


Assuming I'm seeing the same thing...

Problem is g_free() on the PCIBridge windows:
https://elixir.bootlin.com/qemu/latest/source/hw/pci/pci_bridge.c#L235

Is called before we get an rcu_call() to flatview_destroy() as a
result of the final call of flatview_unref() in address_space_set_flatview()
so we get a use after free.

As to what the fix is...  Suggestions welcome!


It sounds like this is the wrong place to free the value then. I guess
the PCI aliases into &w->alias_io() don't get dealt with until RCU
clean-up time.

I *think* using g_free_rcu() should be enough to ensure the free occurs
after the rest of the RCU cleanups but maybe you should only be cleaning
up the windows at device unrealize time? Is this a dynamic piece of
memory which gets updated during the lifetime of the device?


There is unfortunately code that swaps it for an updated structure
in pci_bridge_update_mappings()



If the memory is being cleared with RCU then the access to the base
pointer should be done with the appropriate qatomic_rcu_[set|read]
functions.



I'm annoyingly snowed under this week with other things, but hopefully
can get to in a few days.  Note we are looking at an old problem
here, just one that's happening for an additional device, not sure
if that really affects urgency of fix though.


It's too late now for QEMU 7.2 anyway, so there is no hurry, I think.

 Thomas




[PATCH for 8.0 0/2] virtio-iommu: Fix Replay

2022-12-07 Thread Eric Auger
When assigning VFIO devices protected by a virtio-iommu we need to replay
the mappings when adding a new IOMMU MR and when attaching a device to
a domain. While we do a "remap" we currently fail to first unmap the
existing IOVA mapping and just map the new one. With some device/group
topology this can lead to errors in VFIO when trying to DMA_MAP IOVA
ranges onto existing ones.

Eric Auger (2):
  virtio-iommu: Add unmap on virtio_iommu_remap()
  virtio-iommu: Fix replay on device attach

 hw/virtio/virtio-iommu.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

-- 
2.37.3




[PATCH for 8.0 2/2] virtio-iommu: Fix replay on device attach

2022-12-07 Thread Eric Auger
When attaching a device to a domain, we replay the existing
domain mappings for that device. If there are several VFIO
devices in the same group on the guest we may end up with
duplicate mapping attempts because the mapping already exist
on VFIO side. So let's do a proper remap, ie. first unmap
and then map.

Signed-off-by: Eric Auger 
Fixes 2f6eeb5f0bb ("virtio-iommu: Call memory notifiers in attach/detach")
---
 hw/virtio/virtio-iommu.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
index 30334c85aa..099dec1f31 100644
--- a/hw/virtio/virtio-iommu.c
+++ b/hw/virtio/virtio-iommu.c
@@ -277,19 +277,21 @@ static gboolean virtio_iommu_notify_unmap_cb(gpointer 
key, gpointer value,
 return false;
 }
 
-static gboolean virtio_iommu_notify_map_cb(gpointer key, gpointer value,
-   gpointer data)
+static gboolean virtio_iommu_notify_remap_cb(gpointer key, gpointer value,
+ gpointer data)
 {
 VirtIOIOMMUMapping *mapping = (VirtIOIOMMUMapping *) value;
 VirtIOIOMMUInterval *interval = (VirtIOIOMMUInterval *) key;
 IOMMUMemoryRegion *mr = (IOMMUMemoryRegion *) data;
 
+virtio_iommu_notify_unmap(mr, interval->low, interval->high);
 virtio_iommu_notify_map(mr, interval->low, interval->high,
 mapping->phys_addr, mapping->flags);
 
 return false;
 }
 
+
 static void virtio_iommu_detach_endpoint_from_domain(VirtIOIOMMUEndpoint *ep)
 {
 VirtIOIOMMUDomain *domain = ep->domain;
@@ -489,7 +491,7 @@ static int virtio_iommu_attach(VirtIOIOMMU *s,
 virtio_iommu_switch_address_space(sdev);
 
 /* Replay domain mappings on the associated memory region */
-g_tree_foreach(domain->mappings, virtio_iommu_notify_map_cb,
+g_tree_foreach(domain->mappings, virtio_iommu_notify_remap_cb,
ep->iommu_mr);
 
 return VIRTIO_IOMMU_S_OK;
-- 
2.37.3




[PATCH for 8.0 1/2] virtio-iommu: Add unmap on virtio_iommu_remap()

2022-12-07 Thread Eric Auger
following replay() callback documentation in memory.h we
shall first invalidate (notify with flag == IOMMU_NONE) and
then map for existing mappings. The code currently skips the
unmap and just do map. This may lead to duplicate mapping
attempts on VFIO side (leading to spurious -EEXIST DMA_MAP
failures). Add the unmap.

Signed-off-by: Eric Auger 
Fixes 308e5e1b5f8 ("virtio-iommu: Add replay() memory region callback")
---
 hw/virtio/virtio-iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
index 62e07ec2e4..30334c85aa 100644
--- a/hw/virtio/virtio-iommu.c
+++ b/hw/virtio/virtio-iommu.c
@@ -1034,6 +1034,7 @@ static gboolean virtio_iommu_remap(gpointer key, gpointer 
value, gpointer data)
 
 trace_virtio_iommu_remap(mr->parent_obj.name, interval->low, 
interval->high,
  mapping->phys_addr);
+virtio_iommu_notify_unmap(mr, interval->low, interval->high);
 virtio_iommu_notify_map(mr, interval->low, interval->high,
 mapping->phys_addr, mapping->flags);
 return false;
-- 
2.37.3




Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2022-12-07 Thread Chao Peng
On Tue, Dec 06, 2022 at 02:57:04PM +, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Dec 2, 2022 at 6:18 AM Chao Peng  wrote:
> >
> > From: "Kirill A. Shutemov" 
> >
> > Introduce 'memfd_restricted' system call with the ability to create
> > memory areas that are restricted from userspace access through ordinary
> > MMU operations (e.g. read/write/mmap). The memory content is expected to
> > be used through the new in-kernel interface by a third kernel module.
> >
> > memfd_restricted() is useful for scenarios where a file descriptor(fd)
> > can be used as an interface into mm but want to restrict userspace's
> > ability on the fd. Initially it is designed to provide protections for
> > KVM encrypted guest memory.
> >
> > Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> > (e.g. QEMU) and then using the mmaped virtual address to setup the
> > mapping in the KVM secondary page table (e.g. EPT). With confidential
> > computing technologies like Intel TDX, the memfd memory may be encrypted
> > with special key for special software domain (e.g. KVM guest) and is not
> > expected to be directly accessed by userspace. Precisely, userspace
> > access to such encrypted memory may lead to host crash so should be
> > prevented.
> >
> > memfd_restricted() provides semantics required for KVM guest encrypted
> > memory support that a fd created with memfd_restricted() is going to be
> > used as the source of guest memory in confidential computing environment
> > and KVM can directly interact with core-mm without the need to expose
> > the memoy content into KVM userspace.
> 
> nit: memory

Ya!

> 
> >
> > KVM userspace is still in charge of the lifecycle of the fd. It should
> > pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> > obtain the physical memory page and then uses it to populate the KVM
> > secondary page table entries.
> >
> > The userspace restricted memfd can be fallocate-ed or hole-punched
> > from userspace. When hole-punched, KVM can get notified through
> > invalidate_start/invalidate_end() callbacks, KVM then gets chance to
> > remove any mapped entries of the range in the secondary page tables.
> >
> > Machine check can happen for memory pages in the restricted memfd,
> > instead of routing this directly to userspace, we call the error()
> > callback that KVM registered. KVM then gets chance to handle it
> > correctly.
> >
> > memfd_restricted() itself is implemented as a shim layer on top of real
> > memory file systems (currently tmpfs). Pages in restrictedmem are marked
> > as unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> >
> > By default memfd_restricted() prevents userspace read, write and mmap.
> > By defining new bit in the 'flags', it can be extended to support other
> > restricted semantics in the future.
> >
> > The system call is currently wired up for x86 arch.
> 
> Reviewed-by: Fuad Tabba 
> After wiring the system call for arm64 (on qemu/arm64):
> Tested-by: Fuad Tabba 

Thanks.
Chao
> 
> Cheers,
> /fuad
> 
> 
> 
> >
> > Signed-off-by: Kirill A. Shutemov 
> > Signed-off-by: Chao Peng 
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
> >  include/linux/restrictedmem.h  |  71 ++
> >  include/linux/syscalls.h   |   1 +
> >  include/uapi/asm-generic/unistd.h  |   5 +-
> >  include/uapi/linux/magic.h |   1 +
> >  kernel/sys_ni.c|   3 +
> >  mm/Kconfig |   4 +
> >  mm/Makefile|   1 +
> >  mm/memory-failure.c|   3 +
> >  mm/restrictedmem.c | 318 +
> >  11 files changed, 408 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/restrictedmem.h
> >  create mode 100644 mm/restrictedmem.c
> >
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> > b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 320480a8db4f..dc70ba90247e 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -455,3 +455,4 @@
> >  448i386process_mreleasesys_process_mrelease
> >  449i386futex_waitv sys_futex_waitv
> >  450i386set_mempolicy_home_node sys_set_mempolicy_home_node
> > +451i386memfd_restrictedsys_memfd_restricted
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> > b/arch/x86/entry/syscalls/syscall_64.tbl
> > index c84d12608cd2..06516abc8318 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -372,6 +372,7 @@
> >  448common  process_mreleasesys_process_mrelease
> >  449common  futex_waitv sys_futex_waitv
> >  450common  set_mempolicy_home_node sys_set_mempolicy_home_node
> > +451common  memfd_restrictedsys_mem

Re: [PATCH 1/1] qemu-iotests/stream-under-throttle: do not shutdown QEMU

2022-12-07 Thread Christian Borntraeger




Am 07.12.22 um 14:23 schrieb Thomas Huth:

On 07/12/2022 14.14, Christian Borntraeger wrote:

Without a kernel or boot disk a QEMU on s390 will exit (usually with a
disabled wait state). This breaks the stream-under-throttle test case.
Do not exit qemu if on s390.

Signed-off-by: Christian Borntraeger 
---
  tests/qemu-iotests/tests/stream-under-throttle | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/tests/qemu-iotests/tests/stream-under-throttle 
b/tests/qemu-iotests/tests/stream-under-throttle
index 8d2d9e16840d..c24dfbcaa2f2 100755
--- a/tests/qemu-iotests/tests/stream-under-throttle
+++ b/tests/qemu-iotests/tests/stream-under-throttle
@@ -88,6 +88,8 @@ class TestStreamWithThrottle(iotests.QMPTestCase):
 'x-iops-total=1,x-bps-total=104857600')
  self.vm.add_blockdev(self.vm.qmp_to_opts(blockdev))
  self.vm.add_device('virtio-blk,iothread=iothr0,drive=throttled-node')
+    if iotests.qemu_default_machine == 's390-ccw-virtio':
+    self.vm.add_args('-no-shutdown')
  self.vm.launch()


I guess you could even add that unconditionally for all architectures?


maybe. It might even fix other architecture with the same problem. But I dont 
know if thats the case.
So we can start with this fix and then remove the if at a later point in time 
if necessary/useful.


Anyway:
Reviewed-by: Thomas Huth 





Re: [PATCH for 7.2?] target/i386: Remove compilation errors when -Werror=maybe-uninitialized

2022-12-07 Thread Stefan Hajnoczi
On Wed, 7 Dec 2022 at 08:31, Eric Auger  wrote:
> On 12/7/22 14:24, Eric Auger wrote:
> > Initialize r0-3 to avoid compilation errors when
> > -Werror=maybe-uninitialized is used
> >
> > ../target/i386/ops_sse.h: In function ‘helper_vpermdq_ymm’:
> > ../target/i386/ops_sse.h:2495:13: error: ‘r3’ may be used uninitialized in 
> > this function [-Werror=maybe-uninitialized]
> >  2495 | d->Q(3) = r3;
> >   | ^~~~
> > ../target/i386/ops_sse.h:2494:13: error: ‘r2’ may be used uninitialized in 
> > this function [-Werror=maybe-uninitialized]
> >  2494 | d->Q(2) = r2;
> >   | ^~~~
> > ../target/i386/ops_sse.h:2493:13: error: ‘r1’ may be used uninitialized in 
> > this function [-Werror=maybe-uninitialized]
> >  2493 | d->Q(1) = r1;
> >   | ^~~~
> > ../target/i386/ops_sse.h:2492:13: error: ‘r0’ may be used uninitialized in 
> > this function [-Werror=maybe-uninitialized]
> >  2492 | d->Q(0) = r0;
> >   | ^~~~
> >
> > Signed-off-by: Eric Auger 
> > Fixes: 790684776861 ("target/i386: reimplement 0x0f 0x3a, add AVX")
> >
> > ---
> >
> > Am I the only one getting this? Or anything wrong in my setup.
>
> With Stefan's correct address. Forgive me for the noise.

When is -Wmaybe-uninitialized used? QEMU's build system doesn't set
it. Unless it's automatically set by meson this must be a manual
--extra-cflags= option you set.

If you added it manually then let's fix this in 8.0 since it's not
tested/supported and very few people will see this issue.

Stefan



Re: [PATCH 00/18] block: Introduce a block graph rwlock

2022-12-07 Thread Emanuele Giuseppe Esposito



Am 07/12/2022 um 14:18 schrieb Kevin Wolf:
> This series supersedes the first half of Emanuele's "Protect the block
> layer with a rwlock: part 1". It introduces the basic infrastructure for
> protecting the block graph (specifically parent/child links) with a
> rwlock. Actually taking the reader lock in all necessary places is left
> for future series.
> 
> Compared to Emanuele's series, this one adds patches to make use of
> clang's Thread Safety Analysis (TSA) feature in order to statically
> check at compile time that the places where we assert that we hold the
> lock actually do hold it. Once we cover all relevant places, the check
> can be extended to verify that all accesses of bs->children and
> bs->parents hold the lock.
> 
> For reference, here is the more detailed version of our plan in
> Emanuele's words from his series:
> 
> The aim is to replace the current AioContext lock with much
> fine-grained locks, aimed to protect only specific data. Currently
> the AioContext lock is used pretty much everywhere, and it's not
> even clear what it is protecting exactly.
> 
> The aim of the rwlock is to cover graph modifications: more
> precisely, when a BlockDriverState parent or child list is modified
> or read, since it can be concurrently accessed by the main loop and
> iothreads.
> 
> The main assumption is that the main loop is the only one allowed to
> perform graph modifications, and so far this has always been held by
> the current code.
> 
> The rwlock is inspired from cpus-common.c implementation, and aims
> to reduce cacheline bouncing by having per-aiocontext counter of
> readers.  All details and implementation of the lock are in patch 2.
> 
> We distinguish between writer (main loop, under BQL) that modifies
> the graph, and readers (all other coroutines running in various
> AioContext), that go through the graph edges, reading ->parents
> and->children.  The writer (main loop)  has an "exclusive" access,
> so it first waits for current read to finish, and then prevents
> incoming ones from entering while it has the exclusive access.  The
> readers (coroutines in multiple AioContext) are free to access the
> graph as long the writer is not modifying the graph.  In case it is,
> they go in a CoQueue and sleep until the writer is done.
> 
> In this and following series, we try to follow the following locking
> pattern:
> 
> - bdrv_co_* functions that call BlockDriver callbacks always expect
>   the lock to be taken, therefore they assert.
> 
> - blk_co_* functions are called from external code outside the block
>   layer, which should not have to care about the block layer's
>   internal locking. Usually they also call blk_wait_while_drained().
>   Therefore they take the lock internally.
> 
> The long term goal of this series is to eventually replace the
> AioContext lock, so that we can get rid of it once and for all.
> 
> Emanuele Giuseppe Esposito (7):
>   graph-lock: Implement guard macros
>   async: Register/unregister aiocontext in graph lock list
>   block: wrlock in bdrv_replace_child_noperm
>   block: remove unnecessary assert_bdrv_graph_writable()
>   block: assert that graph read and writes are performed correctly
>   block-coroutine-wrapper.py: introduce annotations that take the graph
> rdlock
>   block: use co_wrapper_mixed_bdrv_rdlock in functions taking the rdlock
> 
> Kevin Wolf (10):
>   block: Factor out bdrv_drain_all_begin_nopoll()
>   Import clang-tsa.h
>   clang-tsa: Add TSA_ASSERT() macro
>   clang-tsa: Add macros for shared locks
>   configure: Enable -Wthread-safety if present
>   test-bdrv-drain: Fix incorrrect drain assumptions
>   block: Fix locking in external_snapshot_prepare()
>   graph-lock: TSA annotations for lock/unlock functions
>   Mark assert_bdrv_graph_readable/writable() GRAPH_RD/WRLOCK
>   block: GRAPH_RDLOCK for functions only called by co_wrappers
> 
> Paolo Bonzini (1):
>   graph-lock: Introduce a lock to protect block graph operations
> 
Reviewed-by: Emanuele Giuseppe Esposito 

^ I am curious to see if I am allowed to have my r-b also on my patches :)

>  configure  |   1 +
>  block/coroutines.h |  19 +-
>  include/block/aio.h|   9 +
>  include/block/block-common.h   |   9 +-
>  include/block/block-global-state.h |   1 +
>  include/block/block-io.h   |  53 +++--
>  include/block/block_int-common.h   |  24 +--
>  include/block/block_int-global-state.h |  17 --
>  include/block/block_int.h  |   1 +
>  include/block/graph-lock.h | 280 +
>  include/qemu/clang-tsa.h   | 114 ++
>  block.c|  24 ++-
>  block/graph-lock.c | 275 
>  block/io.c |  21 +-
>  blockdev.c 

Re: [PATCH for 7.2?] target/i386: Remove compilation errors when -Werror=maybe-uninitialized

2022-12-07 Thread Eric Auger
Hi Stefan,

On 12/7/22 15:09, Stefan Hajnoczi wrote:
> On Wed, 7 Dec 2022 at 08:31, Eric Auger  wrote:
>> On 12/7/22 14:24, Eric Auger wrote:
>>> Initialize r0-3 to avoid compilation errors when
>>> -Werror=maybe-uninitialized is used
>>>
>>> ../target/i386/ops_sse.h: In function ‘helper_vpermdq_ymm’:
>>> ../target/i386/ops_sse.h:2495:13: error: ‘r3’ may be used uninitialized in 
>>> this function [-Werror=maybe-uninitialized]
>>>  2495 | d->Q(3) = r3;
>>>   | ^~~~
>>> ../target/i386/ops_sse.h:2494:13: error: ‘r2’ may be used uninitialized in 
>>> this function [-Werror=maybe-uninitialized]
>>>  2494 | d->Q(2) = r2;
>>>   | ^~~~
>>> ../target/i386/ops_sse.h:2493:13: error: ‘r1’ may be used uninitialized in 
>>> this function [-Werror=maybe-uninitialized]
>>>  2493 | d->Q(1) = r1;
>>>   | ^~~~
>>> ../target/i386/ops_sse.h:2492:13: error: ‘r0’ may be used uninitialized in 
>>> this function [-Werror=maybe-uninitialized]
>>>  2492 | d->Q(0) = r0;
>>>   | ^~~~
>>>
>>> Signed-off-by: Eric Auger 
>>> Fixes: 790684776861 ("target/i386: reimplement 0x0f 0x3a, add AVX")
>>>
>>> ---
>>>
>>> Am I the only one getting this? Or anything wrong in my setup.
>> With Stefan's correct address. Forgive me for the noise.
> When is -Wmaybe-uninitialized used? QEMU's build system doesn't set
> it. Unless it's automatically set by meson this must be a manual
> --extra-cflags= option you set.

I am using this configure cmd line:

./configure --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/lib/qemu
--target-list=x86_64-softmmu --docdir=/usr/share/doc/qemu --enable-kvm
--extra-cflags=-O --enable-trace-backends=log --python=/usr/bin/python3
--extra-cflags=-Wall --extra-cflags=-Wundef
--extra-cflags=-Wwrite-strings --extra-cflags=-Wmissing-prototypes
--extra-cflags=-fno-strict-aliasing --extra-cflags=-fno-common
--extra-cflags=-Werror=type-limits
>
> If you added it manually then let's fix this in 8.0 since it's not
> tested/supported and very few people will see this issue.

Thanks

Eric
>
> Stefan
>




Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2022-12-07 Thread Chao Peng
On Tue, Dec 06, 2022 at 10:34:32AM -0300, Fabiano Rosas wrote:
> Chao Peng  writes:
> 
> > In confidential computing usages, whether a page is private or shared is
> > necessary information for KVM to perform operations like page fault
> > handling, page zapping etc. There are other potential use cases for
> > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > or exec-only, etc.) without having to modify memslots.
> >
> > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > userspace to operate on the per-page memory attributes.
> >   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> > a guest memory range.
> >   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> > memory attributes.
> >
> > KVM internally uses xarray to store the per-page memory attributes.
> >
> > Suggested-by: Sean Christopherson 
> > Signed-off-by: Chao Peng 
> > Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com/
> > ---
> >  Documentation/virt/kvm/api.rst | 63 
> >  arch/x86/kvm/Kconfig   |  1 +
> >  include/linux/kvm_host.h   |  3 ++
> >  include/uapi/linux/kvm.h   | 17 
> >  virt/kvm/Kconfig   |  3 ++
> >  virt/kvm/kvm_main.c| 76 ++
> >  6 files changed, 163 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 5617bc4f899f..bb2f709c0900 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
> >  The "pad" and "reserved" fields may be used for future extensions and 
> > should be
> >  set to 0s by userspace.
> >  
> > +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> > +-
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: u64 memory attributes bitmask(out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Returns supported memory attributes bitmask. Supported memory attributes 
> > will
> > +have the corresponding bits set in u64 memory attributes bitmask.
> > +
> > +The following memory attributes are defined::
> > +
> > +  #define KVM_MEMORY_ATTRIBUTE_READ  (1ULL << 0)
> > +  #define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> > +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE   (1ULL << 2)
> > +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE   (1ULL << 3)
> > +
> > +4.139 KVM_SET_MEMORY_ATTRIBUTES
> > +-
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct kvm_memory_attributes(in/out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Sets memory attributes for pages in a guest memory range. Parameters are
> > +specified via the following structure::
> > +
> > +  struct kvm_memory_attributes {
> > +   __u64 address;
> > +   __u64 size;
> > +   __u64 attributes;
> > +   __u64 flags;
> > +  };
> > +
> > +The user sets the per-page memory attributes to a guest memory range 
> > indicated
> > +by address/size, and in return KVM adjusts address and size to reflect the
> > +actual pages of the memory range have been successfully set to the 
> > attributes.
> 
> This wording could cause some confusion, what about a simpler:
> 
> "reflect the range of pages that had its attributes successfully set"

Thanks, this is much better.

> 
> > +If the call returns 0, "address" is updated to the last successful address 
> > + 1
> > +and "size" is updated to the remaining address size that has not been set
> > +successfully.
> 
> "address + 1 page" or "subsequent page" perhaps.
> 
> In fact, wouldn't this all become simpler if size were number of pages 
> instead?

It indeed becomes better if the size is number of pages and the address
is gfn, but I think we don't want to imply that the page size is 4K to
userspace.

> 
> > The user should check the return value as well as the size to
> > +decide if the operation succeeded for the whole range or not. The user may 
> > want
> > +to retry the operation with the returned address/size if the previous 
> > range was
> > +partially successful.
> > +
> > +Both address and size should be page aligned and the supported attributes 
> > can be
> > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > +
> > +The "flags" field may be used for future extensions and should be set to 
> > 0s.
> > +
> 
> ...
> 
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > +  struct kvm_memory_attributes *attrs)
> > +{
> > +   gfn_t start, end;
> > +   unsigned long i;
> > +   void *entry;
> > +   u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +   /* flags is currently not used. */
> > +   if (attrs->flags)
> > +   return -EINVAL;
> > +   if (a

Re: [PATCH for-8.0] hw/rtc/mc146818rtc: Make this rtc device target independent

2022-12-07 Thread Bernhard Beschow



Am 6. Dezember 2022 20:06:41 UTC schrieb Thomas Huth :
>The only code that is really, really target dependent is the apic-related
>code in rtc_policy_slew_deliver_irq(). By moving this code into the hw/i386/
>folder (renamed to rtc_apic_policy_slew_deliver_irq()) and passing this
>function as parameter to mc146818_rtc_init(), we can make the RTC completely
>target-independent.
>
>Signed-off-by: Thomas Huth 
>---
> include/hw/rtc/mc146818rtc.h |  7 +--
> hw/alpha/dp264.c |  2 +-
> hw/hppa/machine.c|  2 +-
> hw/i386/microvm.c|  3 ++-
> hw/i386/pc.c | 10 +-
> hw/mips/jazz.c   |  2 +-
> hw/ppc/pnv.c |  2 +-
> hw/rtc/mc146818rtc.c | 34 +++---
> hw/rtc/meson.build   |  3 +--
> 9 files changed, 32 insertions(+), 33 deletions(-)
>
>diff --git a/include/hw/rtc/mc146818rtc.h b/include/hw/rtc/mc146818rtc.h
>index 1db0fcee92..c687953cc4 100644
>--- a/include/hw/rtc/mc146818rtc.h
>+++ b/include/hw/rtc/mc146818rtc.h
>@@ -46,14 +46,17 @@ struct RTCState {
> Notifier clock_reset_notifier;
> LostTickPolicy lost_tick_policy;

This lost_tick_policy attribute along with its enum is now redundant and can be 
removed. Removing it avoids an error condition (see below).

> Notifier suspend_notifier;
>+bool (*policy_slew_deliver_irq)(RTCState *s);
> QLIST_ENTRY(RTCState) link;
> };
> 
> #define RTC_ISA_IRQ 8
> 
>-ISADevice *mc146818_rtc_init(ISABus *bus, int base_year,
>- qemu_irq intercept_irq);
>+ISADevice *mc146818_rtc_init(ISABus *bus, int base_year, qemu_irq 
>intercept_irq,
>+ bool (*policy_slew_deliver_irq)(RTCState *s));
> void rtc_set_memory(ISADevice *dev, int addr, int val);
> int rtc_get_memory(ISADevice *dev, int addr);
>+bool rtc_apic_policy_slew_deliver_irq(RTCState *s);
>+void qmp_rtc_reset_reinjection(Error **errp);
> 
> #endif /* HW_RTC_MC146818RTC_H */
>diff --git a/hw/alpha/dp264.c b/hw/alpha/dp264.c
>index c502c8c62a..8723942b52 100644
>--- a/hw/alpha/dp264.c
>+++ b/hw/alpha/dp264.c
>@@ -118,7 +118,7 @@ static void clipper_init(MachineState *machine)
> qdev_connect_gpio_out(i82378_dev, 0, isa_irq);
> 
> /* Since we have an SRM-compatible PALcode, use the SRM epoch.  */
>-mc146818_rtc_init(isa_bus, 1900, rtc_irq);
>+mc146818_rtc_init(isa_bus, 1900, rtc_irq, NULL);
> 
> /* VGA setup.  Don't bother loading the bios.  */
> pci_vga_init(pci_bus);
>diff --git a/hw/hppa/machine.c b/hw/hppa/machine.c
>index de1cc7ab71..311031714a 100644
>--- a/hw/hppa/machine.c
>+++ b/hw/hppa/machine.c
>@@ -232,7 +232,7 @@ static void machine_hppa_init(MachineState *machine)
> assert(isa_bus);
> 
> /* Realtime clock, used by firmware for PDC_TOD call. */
>-mc146818_rtc_init(isa_bus, 2000, NULL);
>+mc146818_rtc_init(isa_bus, 2000, NULL, NULL);
> 
> /* Serial ports: Lasi and Dino use a 7.272727 MHz clock. */
> serial_mm_init(addr_space, LASI_UART_HPA + 0x800, 0,
>diff --git a/hw/i386/microvm.c b/hw/i386/microvm.c
>index 170a331e3f..d0ed4dca50 100644
>--- a/hw/i386/microvm.c
>+++ b/hw/i386/microvm.c
>@@ -267,7 +267,8 @@ static void microvm_devices_init(MicrovmMachineState *mms)
> 
> if (mms->rtc == ON_OFF_AUTO_ON ||
> (mms->rtc == ON_OFF_AUTO_AUTO && !kvm_enabled())) {
>-rtc_state = mc146818_rtc_init(isa_bus, 2000, NULL);
>+rtc_state = mc146818_rtc_init(isa_bus, 2000, NULL,
>+  rtc_apic_policy_slew_deliver_irq);
> microvm_set_rtc(mms, rtc_state);
> }
> 
>diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>index 546b703cb4..650e7bc199 100644
>--- a/hw/i386/pc.c
>+++ b/hw/i386/pc.c
>@@ -1244,6 +1244,13 @@ static void pc_superio_init(ISABus *isa_bus, bool 
>create_fdctrl,
> g_free(a20_line);
> }
> 
>+bool rtc_apic_policy_slew_deliver_irq(RTCState *s)
>+{
>+apic_reset_irq_delivered();
>+qemu_irq_raise(s->irq);
>+return apic_get_irq_delivered();
>+}
>+
> void pc_basic_device_init(struct PCMachineState *pcms,
>   ISABus *isa_bus, qemu_irq *gsi,
>   ISADevice **rtc_state,
>@@ -1299,7 +1306,8 @@ void pc_basic_device_init(struct PCMachineState *pcms,
> pit_alt_irq = qdev_get_gpio_in(hpet, HPET_LEGACY_PIT_INT);
> rtc_irq = qdev_get_gpio_in(hpet, HPET_LEGACY_RTC_INT);
> }
>-*rtc_state = mc146818_rtc_init(isa_bus, 2000, rtc_irq);
>+*rtc_state = mc146818_rtc_init(isa_bus, 2000, rtc_irq,
>+   rtc_apic_policy_slew_deliver_irq);
> 
> qemu_register_boot_set(pc_boot_set, *rtc_state);
> 
>diff --git a/hw/mips/jazz.c b/hw/mips/jazz.c
>index 6aefe9a61b..50fbd57b23 100644
>--- a/hw/mips/jazz.c
>+++ b/hw/mips/jazz.c
>@@ -356,7 +356,7 @@ static void mips_jazz_init(MachineState *machine,
> fdctrl_init_sysbus(qdev_get_gpio_in(rc4030, 1), 0x80003000, fds);
> 
> /* Real time clock */
>-mc146818_rtc_init(i

Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB

2022-12-07 Thread Richard Henderson

On 12/7/22 01:45, Thomas Huth wrote:

On 06/12/2022 23.22, Richard Henderson wrote:

On 12/6/22 13:29, Ilya Leoshkevich wrote:

This change doesn't seem to affect that, but what is the minimum
supported s390x qemu host? z900?


Possibly z990, if I'm reading the gcc processor_flags_table[] correctly; 
long-displacement-facility is definitely a minimum.


We probably should revisit what the minimum for TCG should be, assert those features at 
startup, and drop the corresponding runtime tests.


If we consider the official IBM support statement:

https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Mainframe%20Life%20Cycle%20History%20V2.10%20-%20Sept%2013%202022_1.pdf

... that would mean that the z10 and all older machines are not supported 
anymore.


Thanks for the pointer.  It would appear that z114 exits support at the end of this month, 
which would leave z12 as minimum supported cpu.


Even assuming z196 gets us extended-immediate, general-insn-extension, load-on-condition, 
and distinct-operands, which are all quite important to TCG, and constitute almost all of 
the current runtime checks.


The other metric would be matching the set of supported cpus from the set of supported os 
distributions, but I would be ready to believe z196 is below the minimum there too.



r~



Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2022-12-07 Thread Chao Peng
On Tue, Dec 06, 2022 at 03:07:27PM +, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Dec 2, 2022 at 6:18 AM Chao Peng  wrote:
> >
> > In confidential computing usages, whether a page is private or shared is
> > necessary information for KVM to perform operations like page fault
> > handling, page zapping etc. There are other potential use cases for
> > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > or exec-only, etc.) without having to modify memslots.
> >
> > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > userspace to operate on the per-page memory attributes.
> >   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> > a guest memory range.
> >   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> > memory attributes.
> >
> > KVM internally uses xarray to store the per-page memory attributes.
> >
> > Suggested-by: Sean Christopherson 
> > Signed-off-by: Chao Peng 
> > Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com/
> > ---
> >  Documentation/virt/kvm/api.rst | 63 
> >  arch/x86/kvm/Kconfig   |  1 +
> >  include/linux/kvm_host.h   |  3 ++
> >  include/uapi/linux/kvm.h   | 17 
> >  virt/kvm/Kconfig   |  3 ++
> >  virt/kvm/kvm_main.c| 76 ++
> >  6 files changed, 163 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 5617bc4f899f..bb2f709c0900 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
> >  The "pad" and "reserved" fields may be used for future extensions and 
> > should be
> >  set to 0s by userspace.
> >
> > +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> > +-
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: u64 memory attributes bitmask(out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Returns supported memory attributes bitmask. Supported memory attributes 
> > will
> > +have the corresponding bits set in u64 memory attributes bitmask.
> > +
> > +The following memory attributes are defined::
> > +
> > +  #define KVM_MEMORY_ATTRIBUTE_READ  (1ULL << 0)
> > +  #define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> > +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE   (1ULL << 2)
> > +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE   (1ULL << 3)
> > +
> > +4.139 KVM_SET_MEMORY_ATTRIBUTES
> > +-
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct kvm_memory_attributes(in/out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Sets memory attributes for pages in a guest memory range. Parameters are
> > +specified via the following structure::
> > +
> > +  struct kvm_memory_attributes {
> > +   __u64 address;
> > +   __u64 size;
> > +   __u64 attributes;
> > +   __u64 flags;
> > +  };
> > +
> > +The user sets the per-page memory attributes to a guest memory range 
> > indicated
> > +by address/size, and in return KVM adjusts address and size to reflect the
> > +actual pages of the memory range have been successfully set to the 
> > attributes.
> > +If the call returns 0, "address" is updated to the last successful address 
> > + 1
> > +and "size" is updated to the remaining address size that has not been set
> > +successfully. The user should check the return value as well as the size to
> > +decide if the operation succeeded for the whole range or not. The user may 
> > want
> > +to retry the operation with the returned address/size if the previous 
> > range was
> > +partially successful.
> > +
> > +Both address and size should be page aligned and the supported attributes 
> > can be
> > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > +
> > +The "flags" field may be used for future extensions and should be set to 
> > 0s.
> > +
> >  5. The kvm_run structure
> >  
> >
> > @@ -8270,6 +8323,16 @@ structure.
> >  When getting the Modified Change Topology Report value, the attr->addr
> >  must point to a byte where the value will be stored or retrieved from.
> >
> > +8.40 KVM_CAP_MEMORY_ATTRIBUTES
> > +--
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm
> > +
> > +This capability indicates KVM supports per-page memory attributes and 
> > ioctls
> > +KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are 
> > available.
> > +
> >  9. Known KVM API problems
> >  =
> >
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index fbeaa9ddef59..a8e379a3afee 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -49,6 

Re: Thoughts on removing the TARGET_I386 part of hw/display/vga/vbe_portio_list[]

2022-12-07 Thread Mark Cave-Ayland

On 06/12/2022 16:23, Richard Henderson wrote:


On 12/6/22 10:02, Peter Maydell wrote:

On Tue, 6 Dec 2022 at 15:56, Philippe Mathieu-Daudé  wrote:


On 6/12/22 13:30, Dr. David Alan Gilbert wrote:

I don't know that bit of qemu well enough to know whether the cpu part
of qemu should be splitting the unaligned accesses or not.

All I/O accesses are gated thru access_with_adjusted_size() in
softmmu/memory.c.

There is an old access_with_adjusted_size_unaligned() version [1] from
Andrew and a more recent series [2] from Richard. Maybe the latter fixes
some long-standing bug [3] we have here?


There definitely are some unaddressed bugs there -- maybe this
is the time to work through what semantics we want that
softmmu code to provide and fix the bugs...


Yes, indeed.  Let's not forget Mark C-A's m68k bug[1] which so far has no 
resolution.

r~

[1] https://gitlab.com/qemu-project/qemu/-/issues/360


That would definitely be useful: since Richard worked on this series, I managed to 
develop a hack that allows me to work around the issue for my particular use-case 
which is why I haven't been focusing on this.


The main concerns are listed in the above issue at 
https://gitlab.com/qemu-project/qemu/-/issues/360#note_597130838. Defining the 
behaviour doesn't seem too bad, but it is likely some things that unintentionally 
depend upon the existing behaviour will break.



ATB,

Mark.



Re: [PATCH v3] hw/pvrdma: Protect against buggy or malicious guest driver

2022-12-07 Thread Claudio Fontana
On 4/5/22 12:31, Marcel Apfelbaum wrote:
> Hi Yuval,
> Thank you for the changes.
> 
> On Sun, Apr 3, 2022 at 11:54 AM Yuval Shaia  wrote:
>>
>> Guest driver might execute HW commands when shared buffers are not yet
>> allocated.
>> This could happen on purpose (malicious guest) or because of some other
>> guest/host address mapping error.
>> We need to protect againts such case.
>>
>> Fixes: CVE-2022-1050
>>
>> Reported-by: Raven 
>> Signed-off-by: Yuval Shaia 
>> ---
>> v1 -> v2:
>> * Commit message changes
>> v2 -> v3:
>> * Exclude cosmetic changes
>> ---
>>  hw/rdma/vmw/pvrdma_cmd.c | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
>> index da7ddfa548..89db963c46 100644
>> --- a/hw/rdma/vmw/pvrdma_cmd.c
>> +++ b/hw/rdma/vmw/pvrdma_cmd.c
>> @@ -796,6 +796,12 @@ int pvrdma_exec_cmd(PVRDMADev *dev)
>>
>>  dsr_info = &dev->dsr_info;
>>
>> +if (!dsr_info->dsr) {
>> +/* Buggy or malicious guest driver */
>> +rdma_error_report("Exec command without dsr, req or rsp 
>> buffers");
>> +goto out;
>> +}
>> +
>>  if (dsr_info->req->hdr.cmd >= sizeof(cmd_handlers) /
>>sizeof(struct cmd_handler)) {
>>  rdma_error_report("Unsupported command");
>> --
>> 2.20.1
>>
> 
> cc-ing Peter and Philippe for a question:
> Do we have a "Security Fixes" or a "Misc" subtree? Otherwise it will
> have to wait a week or so.
> 
> Reviewed by: Marcel Apfelbaum 
> Thanks,
> Marcel
> 

Hi all,

patch is reviewed, anything holding back the inclusion of this security fix?

Thanks,

Claudio



Re: [PATCH for-8.0] hw/rtc/mc146818rtc: Make this rtc device target independent

2022-12-07 Thread Bernhard Beschow



Am 6. Dezember 2022 20:06:41 UTC schrieb Thomas Huth :
>The only code that is really, really target dependent is the apic-related
>code in rtc_policy_slew_deliver_irq(). By moving this code into the hw/i386/
>folder (renamed to rtc_apic_policy_slew_deliver_irq()) and passing this
>function as parameter to mc146818_rtc_init(), we can make the RTC completely
>target-independent.
>
>Signed-off-by: Thomas Huth 
>---
> include/hw/rtc/mc146818rtc.h |  7 +--
> hw/alpha/dp264.c |  2 +-
> hw/hppa/machine.c|  2 +-
> hw/i386/microvm.c|  3 ++-
> hw/i386/pc.c | 10 +-
> hw/mips/jazz.c   |  2 +-
> hw/ppc/pnv.c |  2 +-
> hw/rtc/mc146818rtc.c | 34 +++---
> hw/rtc/meson.build   |  3 +--
> 9 files changed, 32 insertions(+), 33 deletions(-)
>
>diff --git a/include/hw/rtc/mc146818rtc.h b/include/hw/rtc/mc146818rtc.h
>index 1db0fcee92..c687953cc4 100644
>--- a/include/hw/rtc/mc146818rtc.h
>+++ b/include/hw/rtc/mc146818rtc.h
>@@ -46,14 +46,17 @@ struct RTCState {
> Notifier clock_reset_notifier;
> LostTickPolicy lost_tick_policy;
> Notifier suspend_notifier;
>+bool (*policy_slew_deliver_irq)(RTCState *s);
> QLIST_ENTRY(RTCState) link;
> };
> 
> #define RTC_ISA_IRQ 8
> 
>-ISADevice *mc146818_rtc_init(ISABus *bus, int base_year,
>- qemu_irq intercept_irq);
>+ISADevice *mc146818_rtc_init(ISABus *bus, int base_year, qemu_irq 
>intercept_irq,
>+ bool (*policy_slew_deliver_irq)(RTCState *s));
> void rtc_set_memory(ISADevice *dev, int addr, int val);
> int rtc_get_memory(ISADevice *dev, int addr);
>+bool rtc_apic_policy_slew_deliver_irq(RTCState *s);

Can we move this declaration into pc.h since it is also implemented there? This 
makes it more clear that it is only used in x86 and avoids a "dangling" 
declaration for all other architectures.

Thanks,
Bernhard
 
>+void qmp_rtc_reset_reinjection(Error **errp);
> 
> #endif /* HW_RTC_MC146818RTC_H */
>diff --git a/hw/alpha/dp264.c b/hw/alpha/dp264.c
>index c502c8c62a..8723942b52 100644
>--- a/hw/alpha/dp264.c
>+++ b/hw/alpha/dp264.c
>@@ -118,7 +118,7 @@ static void clipper_init(MachineState *machine)
> qdev_connect_gpio_out(i82378_dev, 0, isa_irq);
> 
> /* Since we have an SRM-compatible PALcode, use the SRM epoch.  */
>-mc146818_rtc_init(isa_bus, 1900, rtc_irq);
>+mc146818_rtc_init(isa_bus, 1900, rtc_irq, NULL);
> 
> /* VGA setup.  Don't bother loading the bios.  */
> pci_vga_init(pci_bus);
>diff --git a/hw/hppa/machine.c b/hw/hppa/machine.c
>index de1cc7ab71..311031714a 100644
>--- a/hw/hppa/machine.c
>+++ b/hw/hppa/machine.c
>@@ -232,7 +232,7 @@ static void machine_hppa_init(MachineState *machine)
> assert(isa_bus);
> 
> /* Realtime clock, used by firmware for PDC_TOD call. */
>-mc146818_rtc_init(isa_bus, 2000, NULL);
>+mc146818_rtc_init(isa_bus, 2000, NULL, NULL);
> 
> /* Serial ports: Lasi and Dino use a 7.272727 MHz clock. */
> serial_mm_init(addr_space, LASI_UART_HPA + 0x800, 0,
>diff --git a/hw/i386/microvm.c b/hw/i386/microvm.c
>index 170a331e3f..d0ed4dca50 100644
>--- a/hw/i386/microvm.c
>+++ b/hw/i386/microvm.c
>@@ -267,7 +267,8 @@ static void microvm_devices_init(MicrovmMachineState *mms)
> 
> if (mms->rtc == ON_OFF_AUTO_ON ||
> (mms->rtc == ON_OFF_AUTO_AUTO && !kvm_enabled())) {
>-rtc_state = mc146818_rtc_init(isa_bus, 2000, NULL);
>+rtc_state = mc146818_rtc_init(isa_bus, 2000, NULL,
>+  rtc_apic_policy_slew_deliver_irq);
> microvm_set_rtc(mms, rtc_state);
> }
> 
>diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>index 546b703cb4..650e7bc199 100644
>--- a/hw/i386/pc.c
>+++ b/hw/i386/pc.c
>@@ -1244,6 +1244,13 @@ static void pc_superio_init(ISABus *isa_bus, bool 
>create_fdctrl,
> g_free(a20_line);
> }
> 
>+bool rtc_apic_policy_slew_deliver_irq(RTCState *s)
>+{
>+apic_reset_irq_delivered();
>+qemu_irq_raise(s->irq);
>+return apic_get_irq_delivered();
>+}
>+
> void pc_basic_device_init(struct PCMachineState *pcms,
>   ISABus *isa_bus, qemu_irq *gsi,
>   ISADevice **rtc_state,
>@@ -1299,7 +1306,8 @@ void pc_basic_device_init(struct PCMachineState *pcms,
> pit_alt_irq = qdev_get_gpio_in(hpet, HPET_LEGACY_PIT_INT);
> rtc_irq = qdev_get_gpio_in(hpet, HPET_LEGACY_RTC_INT);
> }
>-*rtc_state = mc146818_rtc_init(isa_bus, 2000, rtc_irq);
>+*rtc_state = mc146818_rtc_init(isa_bus, 2000, rtc_irq,
>+   rtc_apic_policy_slew_deliver_irq);
> 
> qemu_register_boot_set(pc_boot_set, *rtc_state);
> 
>diff --git a/hw/mips/jazz.c b/hw/mips/jazz.c
>index 6aefe9a61b..50fbd57b23 100644
>--- a/hw/mips/jazz.c
>+++ b/hw/mips/jazz.c
>@@ -356,7 +356,7 @@ static void mips_jazz_init(MachineState *machine,
> fdctrl_init_sysbus(qdev_get_gpio_in(rc4030, 1), 0

Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2022-12-07 Thread Chao Peng
On Tue, Dec 06, 2022 at 12:39:18PM +, Fuad Tabba wrote:
> Hi Chao,
> 
> On Tue, Dec 6, 2022 at 11:58 AM Chao Peng  wrote:
> >
> > On Mon, Dec 05, 2022 at 09:03:11AM +, Fuad Tabba wrote:
> > > Hi Chao,
> > >
> > > On Fri, Dec 2, 2022 at 6:18 AM Chao Peng  
> > > wrote:
> > > >
> > > > In memory encryption usage, guest memory may be encrypted with special
> > > > key and can be accessed only by the guest itself. We call such memory
> > > > private memory. It's valueless and sometimes can cause problem to allow
> > > > userspace to access guest private memory. This new KVM memslot extension
> > > > allows guest private memory being provided through a restrictedmem
> > > > backed file descriptor(fd) and userspace is restricted to access the
> > > > bookmarked memory in the fd.
> > > >
> > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > > > userspace to instruct KVM to provide guest memory through restricted_fd.
> > > > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > > > and the size is 'memory_size'.
> > > >
> > > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > > single memslot can maintain both private memory through restricted_fd
> > > > and shared memory through userspace_addr. Whether the private or shared
> > > > part is visible to guest is maintained by other KVM code.
> > > >
> > > > A restrictedmem_notifier field is also added to the memslot structure to
> > > > allow the restricted_fd's backing store to notify KVM the memory change,
> > > > KVM then can invalidate its page table entries or handle memory errors.
> > > >
> > > > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > > > and right now it is selected on X86_64 only.
> > > >
> > > > To make future maintenance easy, internally use a binary compatible
> > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > '_ext' variants.
> > > >
> > > > Co-developed-by: Yu Zhang 
> > > > Signed-off-by: Yu Zhang 
> > > > Signed-off-by: Chao Peng 
> > > > Reviewed-by: Fuad Tabba 
> > > > Tested-by: Fuad Tabba 
> > >
> > > V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
> > > patch series anymore. Any reason you removed it, or is it just an
> > > omission?
> >
> > We had some discussion in v9 [1] to add generic memory attributes ioctls
> > and KVM_CAP_PRIVATE_MEM can be implemented as a new
> > KVM_MEMORY_ATTRIBUTE_PRIVATE flag via KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES()
> > ioctl [2]. The api doc has been updated:
> >
> > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > +  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl) …
> >
> >
> > [1] https://lore.kernel.org/linux-mm/y2wb48kd0j4vg...@google.com/
> > [2]
> > https://lore.kernel.org/linux-mm/20221202061347.1070246-3-chao.p.p...@linux.intel.com/
> 
> I see. I just retested it with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES,
> and my Reviewed/Tested-by still apply.

Thanks for the info.

Chao
> 
> Cheers,
> /fuad
> 
> >
> > Thanks,
> > Chao
> > >
> > > [*] 
> > > https://lore.kernel.org/linux-mm/20221025151344.3784230-3-chao.p.p...@linux.intel.com/
> > >
> > > Thanks,
> > > /fuad
> > >
> > > > ---
> > > >  Documentation/virt/kvm/api.rst | 40 ++-
> > > >  arch/x86/kvm/Kconfig   |  2 ++
> > > >  arch/x86/kvm/x86.c |  2 +-
> > > >  include/linux/kvm_host.h   |  8 --
> > > >  include/uapi/linux/kvm.h   | 28 +++
> > > >  virt/kvm/Kconfig   |  3 +++
> > > >  virt/kvm/kvm_main.c| 49 --
> > > >  7 files changed, 114 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/Documentation/virt/kvm/api.rst 
> > > > b/Documentation/virt/kvm/api.rst
> > > > index bb2f709c0900..99352170c130 100644
> > > > --- a/Documentation/virt/kvm/api.rst
> > > > +++ b/Documentation/virt/kvm/api.rst
> > > > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> > > >  :Capability: KVM_CAP_USER_MEMORY
> > > >  :Architectures: all
> > > >  :Type: vm ioctl
> > > > -:Parameters: struct kvm_userspace_memory_region (in)
> > > > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> > > >  :Returns: 0 on success, -1 on error
> > > >
> > > >  ::
> > > > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> > > > __u64 userspace_addr; /* start of the userspace allocated 
> > > > memory */
> > > >};
> > > >
> > > > +  struct kvm_userspace_memory_region_ext {
> > > > +   struct kvm_userspace_memory_region region;
> > > > +   __u64 restricted_offset;
> > > > +   __u32 restricted_fd;
> > > > +   __u32 pad1;
> > > > +   __u64 pad2[14];
> > > > +  };
> > > > +
> > > >/* for kvm_memory_region::flags */
> > > >#define KVM_MEM_LOG_DIRTY_PAGES  (1UL << 0)
> > > >#define KVM_MEM_READONLY (1UL << 1)
> > > > +  #d

Re: [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-12-07 Thread Chao Peng
On Tue, Dec 06, 2022 at 03:47:20PM +, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng  wrote:
> >
> > This new KVM exit allows userspace to handle memory-related errors. It
> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> > The flags includes additional information for userspace to handle the
> > error. Currently bit 0 is defined as 'private memory' where '1'
> > indicates error happens due to private memory access and '0' indicates
> > error happens due to shared memory access.
> >
> > When private memory is enabled, this new exit will be used for KVM to
> > exit to userspace for shared <-> private memory conversion in memory
> > encryption usage. In such usage, typically there are two kind of memory
> > conversions:
> >   - explicit conversion: happens when guest explicitly calls into KVM
> > to map a range (as private or shared), KVM then exits to userspace
> > to perform the map/unmap operations.
> >   - implicit conversion: happens in KVM page fault handler where KVM
> > exits to userspace for an implicit conversion when the page is in a
> > different state than requested (private or shared).
> >
> > Suggested-by: Sean Christopherson 
> > Co-developed-by: Yu Zhang 
> > Signed-off-by: Yu Zhang 
> > Signed-off-by: Chao Peng 
> > Reviewed-by: Fuad Tabba 
> > ---
> >  Documentation/virt/kvm/api.rst | 22 ++
> >  include/uapi/linux/kvm.h   |  8 
> >  2 files changed, 30 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 99352170c130..d9edb14ce30b 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6634,6 +6634,28 @@ array field represents return values. The userspace 
> > should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >
> > +::
> > +
> > +   /* KVM_EXIT_MEMORY_FAULT */
> > +   struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 0)
> > +   __u64 flags;
> 
> I see you've removed the padding and increased the flag size.

Yes Sean suggested this and also looks good to me.

Chao
> 
> Reviewed-by: Fuad Tabba 
> Tested-by: Fuad Tabba 
> 
> Cheers,
> /fuad
> 
> 
> 
> 
> > +   __u64 gpa;
> > +   __u64 size;
> > +   } memory;
> > +
> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> > +encountered a memory error which is not handled by KVM kernel module and
> > +userspace may choose to handle it. The 'flags' field indicates the memory
> > +properties of the exit.
> > +
> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> > +   private memory access when the bit is set. Otherwise the memory error is
> > +   caused by shared memory access when the bit is clear.
> > +
> > +'gpa' and 'size' indicate the memory range the error occurs at. The 
> > userspace
> > +may handle the error and return to KVM to retry the previous memory access.
> > +
> >  ::
> >
> >  /* KVM_EXIT_NOTIFY */
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 13bff963b8b0..c7e9d375a902 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
> >  #define KVM_EXIT_RISCV_SBI35
> >  #define KVM_EXIT_RISCV_CSR36
> >  #define KVM_EXIT_NOTIFY   37
> > +#define KVM_EXIT_MEMORY_FAULT 38
> >
> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >  /* Emulate instruction failed. */
> > @@ -541,6 +542,13 @@ struct kvm_run {
> >  #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
> > __u32 flags;
> > } notify;
> > +   /* KVM_EXIT_MEMORY_FAULT */
> > +   struct {
> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1ULL << 0)
> > +   __u64 flags;
> > +   __u64 gpa;
> > +   __u64 size;
> > +   } memory;
> > /* Fix the size of the union. */
> > char padding[256];
> > };
> > --
> > 2.25.1
> >



Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry

2022-12-07 Thread Chao Peng
On Tue, Dec 06, 2022 at 10:34:11PM -0800, Isaku Yamahata wrote:
> On Tue, Dec 06, 2022 at 07:56:23PM +0800,
> Chao Peng  wrote:
> 
> > > > -   if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > > > -   hva >= kvm->mmu_invalidate_range_start &&
> > > > -   hva < kvm->mmu_invalidate_range_end)
> > > > -   return 1;
> > > > +   if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > > > +   /*
> > > > +* Dropping mmu_lock after bumping 
> > > > mmu_invalidate_in_progress
> > > > +* but before updating the range is a KVM bug.
> > > > +*/
> > > > +   if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == 
> > > > INVALID_GPA ||
> > > > +kvm->mmu_invalidate_range_end == 
> > > > INVALID_GPA))
> > > 
> > > INVALID_GPA is an x86-specific define in
> > > arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> > > architectures. The obvious fix is to move it to
> > > include/linux/kvm_host.h.
> > 
> > Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
> > correct choice for other architectures, but after search it has not been
> > used for other architectures, so should be safe to make it common.
> 
> INVALID_GPA is defined as all bit 1.  Please notice "~" (tilde).
> 
> #define INVALID_GPA (~(gpa_t)0)

Thanks for mention. Still looks right moving it to include/linux/kvm_host.h. 
Chao
> -- 
> Isaku Yamahata 



Re: [PATCH for-8.0] hw/rtc/mc146818rtc: Make this rtc device target independent

2022-12-07 Thread Mark Cave-Ayland

On 06/12/2022 20:06, Thomas Huth wrote:


The only code that is really, really target dependent is the apic-related
code in rtc_policy_slew_deliver_irq(). By moving this code into the hw/i386/
folder (renamed to rtc_apic_policy_slew_deliver_irq()) and passing this
function as parameter to mc146818_rtc_init(), we can make the RTC completely
target-independent.

Signed-off-by: Thomas Huth 
---
  include/hw/rtc/mc146818rtc.h |  7 +--
  hw/alpha/dp264.c |  2 +-
  hw/hppa/machine.c|  2 +-
  hw/i386/microvm.c|  3 ++-
  hw/i386/pc.c | 10 +-
  hw/mips/jazz.c   |  2 +-
  hw/ppc/pnv.c |  2 +-
  hw/rtc/mc146818rtc.c | 34 +++---
  hw/rtc/meson.build   |  3 +--
  9 files changed, 32 insertions(+), 33 deletions(-)

diff --git a/include/hw/rtc/mc146818rtc.h b/include/hw/rtc/mc146818rtc.h
index 1db0fcee92..c687953cc4 100644
--- a/include/hw/rtc/mc146818rtc.h
+++ b/include/hw/rtc/mc146818rtc.h
@@ -46,14 +46,17 @@ struct RTCState {
  Notifier clock_reset_notifier;
  LostTickPolicy lost_tick_policy;
  Notifier suspend_notifier;
+bool (*policy_slew_deliver_irq)(RTCState *s);
  QLIST_ENTRY(RTCState) link;
  };
  
  #define RTC_ISA_IRQ 8
  
-ISADevice *mc146818_rtc_init(ISABus *bus, int base_year,

- qemu_irq intercept_irq);
+ISADevice *mc146818_rtc_init(ISABus *bus, int base_year, qemu_irq 
intercept_irq,
+ bool (*policy_slew_deliver_irq)(RTCState *s));
  void rtc_set_memory(ISADevice *dev, int addr, int val);
  int rtc_get_memory(ISADevice *dev, int addr);
+bool rtc_apic_policy_slew_deliver_irq(RTCState *s);
+void qmp_rtc_reset_reinjection(Error **errp);
  
  #endif /* HW_RTC_MC146818RTC_H */

diff --git a/hw/alpha/dp264.c b/hw/alpha/dp264.c
index c502c8c62a..8723942b52 100644
--- a/hw/alpha/dp264.c
+++ b/hw/alpha/dp264.c
@@ -118,7 +118,7 @@ static void clipper_init(MachineState *machine)
  qdev_connect_gpio_out(i82378_dev, 0, isa_irq);
  
  /* Since we have an SRM-compatible PALcode, use the SRM epoch.  */

-mc146818_rtc_init(isa_bus, 1900, rtc_irq);
+mc146818_rtc_init(isa_bus, 1900, rtc_irq, NULL);
  
  /* VGA setup.  Don't bother loading the bios.  */

  pci_vga_init(pci_bus);
diff --git a/hw/hppa/machine.c b/hw/hppa/machine.c
index de1cc7ab71..311031714a 100644
--- a/hw/hppa/machine.c
+++ b/hw/hppa/machine.c
@@ -232,7 +232,7 @@ static void machine_hppa_init(MachineState *machine)
  assert(isa_bus);
  
  /* Realtime clock, used by firmware for PDC_TOD call. */

-mc146818_rtc_init(isa_bus, 2000, NULL);
+mc146818_rtc_init(isa_bus, 2000, NULL, NULL);
  
  /* Serial ports: Lasi and Dino use a 7.272727 MHz clock. */

  serial_mm_init(addr_space, LASI_UART_HPA + 0x800, 0,
diff --git a/hw/i386/microvm.c b/hw/i386/microvm.c
index 170a331e3f..d0ed4dca50 100644
--- a/hw/i386/microvm.c
+++ b/hw/i386/microvm.c
@@ -267,7 +267,8 @@ static void microvm_devices_init(MicrovmMachineState *mms)
  
  if (mms->rtc == ON_OFF_AUTO_ON ||

  (mms->rtc == ON_OFF_AUTO_AUTO && !kvm_enabled())) {
-rtc_state = mc146818_rtc_init(isa_bus, 2000, NULL);
+rtc_state = mc146818_rtc_init(isa_bus, 2000, NULL,
+  rtc_apic_policy_slew_deliver_irq);
  microvm_set_rtc(mms, rtc_state);
  }
  
diff --git a/hw/i386/pc.c b/hw/i386/pc.c

index 546b703cb4..650e7bc199 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1244,6 +1244,13 @@ static void pc_superio_init(ISABus *isa_bus, bool 
create_fdctrl,
  g_free(a20_line);
  }
  
+bool rtc_apic_policy_slew_deliver_irq(RTCState *s)

+{
+apic_reset_irq_delivered();
+qemu_irq_raise(s->irq);
+return apic_get_irq_delivered();
+}
+
  void pc_basic_device_init(struct PCMachineState *pcms,
ISABus *isa_bus, qemu_irq *gsi,
ISADevice **rtc_state,
@@ -1299,7 +1306,8 @@ void pc_basic_device_init(struct PCMachineState *pcms,
  pit_alt_irq = qdev_get_gpio_in(hpet, HPET_LEGACY_PIT_INT);
  rtc_irq = qdev_get_gpio_in(hpet, HPET_LEGACY_RTC_INT);
  }
-*rtc_state = mc146818_rtc_init(isa_bus, 2000, rtc_irq);
+*rtc_state = mc146818_rtc_init(isa_bus, 2000, rtc_irq,
+   rtc_apic_policy_slew_deliver_irq);
  
  qemu_register_boot_set(pc_boot_set, *rtc_state);
  
diff --git a/hw/mips/jazz.c b/hw/mips/jazz.c

index 6aefe9a61b..50fbd57b23 100644
--- a/hw/mips/jazz.c
+++ b/hw/mips/jazz.c
@@ -356,7 +356,7 @@ static void mips_jazz_init(MachineState *machine,
  fdctrl_init_sysbus(qdev_get_gpio_in(rc4030, 1), 0x80003000, fds);
  
  /* Real time clock */

-mc146818_rtc_init(isa_bus, 1980, NULL);
+mc146818_rtc_init(isa_bus, 1980, NULL, NULL);
  memory_region_init_io(rtc, NULL, &rtc_ops, NULL, "rtc", 0x1000);
  memory_region_add_subregion(address_space, 0x80004000, rtc);
  
d

Re: [PATCH for-8.0] hw/rtc/mc146818rtc: Make this rtc device target independent

2022-12-07 Thread Bernhard Beschow



Am 7. Dezember 2022 08:43:31 UTC schrieb Thomas Huth :
>On 07/12/2022 00.38, Bernhard Beschow wrote:
>> 
>> 
>> Am 6. Dezember 2022 20:06:41 UTC schrieb Thomas Huth :
>>> The only code that is really, really target dependent is the apic-related
>>> code in rtc_policy_slew_deliver_irq(). By moving this code into the hw/i386/
>>> folder (renamed to rtc_apic_policy_slew_deliver_irq()) and passing this
>>> function as parameter to mc146818_rtc_init(), we can make the RTC completely
>>> target-independent.
>>> 
>>> Signed-off-by: Thomas Huth 
>>> ---
>>> include/hw/rtc/mc146818rtc.h |  7 +--
>>> hw/alpha/dp264.c |  2 +-
>>> hw/hppa/machine.c|  2 +-
>>> hw/i386/microvm.c|  3 ++-
>>> hw/i386/pc.c | 10 +-
>>> hw/mips/jazz.c   |  2 +-
>>> hw/ppc/pnv.c |  2 +-
>>> hw/rtc/mc146818rtc.c | 34 +++---
>>> hw/rtc/meson.build   |  3 +--
>>> 9 files changed, 32 insertions(+), 33 deletions(-)
>>> 
>>> diff --git a/include/hw/rtc/mc146818rtc.h b/include/hw/rtc/mc146818rtc.h
>>> index 1db0fcee92..c687953cc4 100644
>>> --- a/include/hw/rtc/mc146818rtc.h
>>> +++ b/include/hw/rtc/mc146818rtc.h
>>> @@ -46,14 +46,17 @@ struct RTCState {
>>>  Notifier clock_reset_notifier;
>>>  LostTickPolicy lost_tick_policy;
>>>  Notifier suspend_notifier;
>>> +bool (*policy_slew_deliver_irq)(RTCState *s);
>>>  QLIST_ENTRY(RTCState) link;
>>> };
>>> 
>>> #define RTC_ISA_IRQ 8
>>> 
>>> -ISADevice *mc146818_rtc_init(ISABus *bus, int base_year,
>>> - qemu_irq intercept_irq);
>>> +ISADevice *mc146818_rtc_init(ISABus *bus, int base_year, qemu_irq 
>>> intercept_irq,
>>> + bool (*policy_slew_deliver_irq)(RTCState *s));
>>> void rtc_set_memory(ISADevice *dev, int addr, int val);
>>> int rtc_get_memory(ISADevice *dev, int addr);
>>> +bool rtc_apic_policy_slew_deliver_irq(RTCState *s);
>>> +void qmp_rtc_reset_reinjection(Error **errp);
>>> 
>>> #endif /* HW_RTC_MC146818RTC_H */
>>> diff --git a/hw/alpha/dp264.c b/hw/alpha/dp264.c
>>> index c502c8c62a..8723942b52 100644
>>> --- a/hw/alpha/dp264.c
>>> +++ b/hw/alpha/dp264.c
>>> @@ -118,7 +118,7 @@ static void clipper_init(MachineState *machine)
>>>  qdev_connect_gpio_out(i82378_dev, 0, isa_irq);
>>> 
>>>  /* Since we have an SRM-compatible PALcode, use the SRM epoch.  */
>>> -mc146818_rtc_init(isa_bus, 1900, rtc_irq);
>>> +mc146818_rtc_init(isa_bus, 1900, rtc_irq, NULL);
>>> 
>>>  /* VGA setup.  Don't bother loading the bios.  */
>>>  pci_vga_init(pci_bus);
>>> diff --git a/hw/hppa/machine.c b/hw/hppa/machine.c
>>> index de1cc7ab71..311031714a 100644
>>> --- a/hw/hppa/machine.c
>>> +++ b/hw/hppa/machine.c
>>> @@ -232,7 +232,7 @@ static void machine_hppa_init(MachineState *machine)
>>>  assert(isa_bus);
>>> 
>>>  /* Realtime clock, used by firmware for PDC_TOD call. */
>>> -mc146818_rtc_init(isa_bus, 2000, NULL);
>>> +mc146818_rtc_init(isa_bus, 2000, NULL, NULL);
>>> 
>>>  /* Serial ports: Lasi and Dino use a 7.272727 MHz clock. */
>>>  serial_mm_init(addr_space, LASI_UART_HPA + 0x800, 0,
>>> diff --git a/hw/i386/microvm.c b/hw/i386/microvm.c
>>> index 170a331e3f..d0ed4dca50 100644
>>> --- a/hw/i386/microvm.c
>>> +++ b/hw/i386/microvm.c
>>> @@ -267,7 +267,8 @@ static void microvm_devices_init(MicrovmMachineState 
>>> *mms)
>>> 
>>>  if (mms->rtc == ON_OFF_AUTO_ON ||
>>>  (mms->rtc == ON_OFF_AUTO_AUTO && !kvm_enabled())) {
>>> -rtc_state = mc146818_rtc_init(isa_bus, 2000, NULL);
>>> +rtc_state = mc146818_rtc_init(isa_bus, 2000, NULL,
>>> +  rtc_apic_policy_slew_deliver_irq);
>>>  microvm_set_rtc(mms, rtc_state);
>>>  }
>>> 
>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>> index 546b703cb4..650e7bc199 100644
>>> --- a/hw/i386/pc.c
>>> +++ b/hw/i386/pc.c
>>> @@ -1244,6 +1244,13 @@ static void pc_superio_init(ISABus *isa_bus, bool 
>>> create_fdctrl,
>>>  g_free(a20_line);
>>> }
>>> 
>>> +bool rtc_apic_policy_slew_deliver_irq(RTCState *s)
>>> +{
>>> +apic_reset_irq_delivered();
>>> +qemu_irq_raise(s->irq);
>>> +return apic_get_irq_delivered();
>>> +}
>>> +
>>> void pc_basic_device_init(struct PCMachineState *pcms,
>>>ISABus *isa_bus, qemu_irq *gsi,
>>>ISADevice **rtc_state,
>>> @@ -1299,7 +1306,8 @@ void pc_basic_device_init(struct PCMachineState *pcms,
>>>  pit_alt_irq = qdev_get_gpio_in(hpet, HPET_LEGACY_PIT_INT);
>>>  rtc_irq = qdev_get_gpio_in(hpet, HPET_LEGACY_RTC_INT);
>>>  }
>>> -*rtc_state = mc146818_rtc_init(isa_bus, 2000, rtc_irq);
>>> +*rtc_state = mc146818_rtc_init(isa_bus, 2000, rtc_irq,
>>> +   rtc_apic_policy_slew_deliver_irq);
>> 
>> In my PIIX consolidation series [1] I'm instantiating the RTC in the south 
>> bridges since embedding 

Re: [PATCH V9 00/46] Live Update

2022-12-07 Thread Steven Sistare
This series desperately needs review in its intersection with live migration.
The code in other areas has been reviewed and revised multiple times -- thank 
you!

David, Juan, can you spare some time to review this?  I have done my best to 
order 
the patches logically (see the labelled groups in this email), and to provide 
complete and clear cover letter and commit messages. Can I do anything to 
facilitate,
like doing a code walk through via zoom?

And of course, I welcome anyone's feedback.

Here is the original posting.

https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sist...@oracle.com/

- Steve

On 7/26/2022 12:09 PM, Steve Sistare wrote:
> This version of the live update patch series integrates live update into the
> live migration framework.  The new interfaces are:
>   * mode (migration parameter)
>   * cpr-exec-args (migration parameter)
>   * file (migration URI)
>   * migrate-mode-enable (command-line argument)
>   * only-cpr-capable (command-line argument)
> 
> Provide the cpr-exec and cpr-reboot migration modes for live update.  These
> save and restore VM state, with minimal guest pause time, so that qemu may be
> updated to a new version in between.  The caller sets the mode parameter
> before invoking the migrate or migrate-incoming commands.
> 
> In cpr-reboot mode, the migrate command saves state to a file, allowing
> one to quit qemu, reboot to an updated kernel, start an updated version of
> qemu, and resume via the migrate-incoming command.  The caller must specify
> a migration URI that writes to and reads from a file.  Unlike normal mode,
> the use of certain local storage options does not block the migration, but
> the caller must not modify guest block devices between the quit and restart.
> The guest RAM memory-backend must be shared, and the @x-ignore-shared
> migration capability must be set, to avoid saving it to the file.  Guest RAM
> must be non-volatile across reboot, which can be achieved by backing it with
> a dax device, or /dev/shm PKRAM as proposed in
> https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yzn...@oracle.com
> but this is not enforced.  The restarted qemu arguments must match those used
> to initially start qemu, plus the -incoming option.
> 
> The reboot mode supports vfio devices if the caller first suspends the guest,
> such as by issuing guest-suspend-ram to the qemu guest agent.  The guest
> drivers' suspend methods flush outstanding requests and re-initialize the
> devices, and thus there is no device state to save and restore.  After
> issuing migrate-incoming, the caller must issue a system_wakeup command to
> resume.
> 
> In cpr-exec mode, the migrate command saves state to a file and directly
> exec's a new version of qemu on the same host, replacing the original process
> while retaining its PID.  The caller must specify a migration URI that writes
> to and reads from a file, and resumes execution via the migrate-incoming
> command.  Arguments for the new qemu process are taken from the cpr-exec-args
> migration parameter, and must include the -incoming option.
> 
> Guest RAM must be backed by a memory backend with share=on, but cannot be
> memory-backend-ram.  The memory is re-mmap'd in the updated process, so guest
> ram is efficiently preserved in place, albeit with new virtual addresses.
> In addition, the '-migrate-mode-enable cpr-exec' option is required.  This
> causes secondary guest ram blocks (those not specified on the command line)
> to be allocated by mmap'ing a memfd.  The memfds are kept open across exec,
> their values are saved in special cpr state which is retrieved after exec,
> and they are re-mmap'd.  Since guest RAM is not copied, and storage blocks
> are not migrated, the caller must disable all capabilities related to page
> and block copy.  The implementation ignores all related parameters.
> 
> The exec mode supports vfio devices by preserving the vfio container, group,
> device, and event descriptors across the qemu re-exec, and by updating DMA
> mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
> VFIO_DMA_MAP_FLAG_VADDR as defined in
>   
> https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sist...@oracle.com
> and integrated in Linux kernel 5.12.
> 
> Here is an example of updating qemu from v7.0.50 to v7.0.51 using exec mode.
> The software update is performed while the guest is running to minimize
> downtime.
> 
> window 1| window 2
> |
> # qemu-system-$arch ... |
>   -migrate-mode-enable cpr-exec |
> QEMU 7.0.50 monitor - type 'help' ...   |
> (qemu) info status  |
> VM status: running  |
> | # yum update qemu
> (qemu) migrate_set_parameter mode cpr-exec  |
> (qemu) migrate_set_parameter cpr-exec-a

RE: [PATCH] configure: Fix check-tcg not executing any tests

2022-12-07 Thread Mukilan Thiyagarajan (QUIC)
Thank you for the pointers, Philippe 😊 Will keep them in mind for the future 
patches.

Regards,
Mukilan

-Original Message-
From: Philippe Mathieu-Daudé  
Sent: Wednesday, December 7, 2022 2:37 PM
To: Mukilan Thiyagarajan (QUIC) ; 
qemu-devel@nongnu.org; Brian Cain ; Matheus Bernardino 
(QUIC) 
Cc: Alex Bennée ; Richard Henderson 
; Paolo Bonzini ; Thomas 
Huth 
Subject: Re: [PATCH] configure: Fix check-tcg not executing any tests

Hi Mukilan,

On 7/12/22 09:23, Mukilan Thiyagarajan wrote:
> After configuring with --target-list=hexagon-linux-user
> running `make check-tcg` just prints the following:
> 
> ```
> make: Nothing to be done for 'check-tcg'
> ```
> 
> In the probe_target_compiler function, the 'break'
> command is used incorrectly. There are no lexically
> enclosing loops associated with that break command which
> is an unspecfied behaviour in the POSIX standard.
> 
> The dash shell implementation aborts the currently executing
> loop, in this case, causing the rest of the logic for the loop
> in line 2490 to be skipped, which means no Makefiles are
> generated for the tcg target tests.
> 
> Fixes: c3b570b5a9a24d25 (configure: don't enable
> cross compilers unless in target_list)

When posting a patch fixing an issue introduced by another one,
you'll get more feedback if Cc'ing the author/reviewers of such
patch.

Also Cc'ing the maintainers also help in having your patch picked
up :) See:

https://www.qemu.org/docs/master/devel/submitting-a-patch.html#cc-the-relevant-maintainer

I've Cc'ed the corresponding developers for you.

Regards,

Phil.

> Signed-off-by: Mukilan Thiyagarajan 
> ---
>   configure | 4 +---
>   1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/configure b/configure
> index 26c7bc5154..7a804fb657 100755
> --- a/configure
> +++ b/configure
> @@ -1881,9 +1881,7 @@ probe_target_compiler() {
> # We shall skip configuring the target compiler if the user didn't
> # bother enabling an appropriate guest. This avoids building
> # extraneous firmware images and tests.
> -  if test "${target_list#*$1}" != "$1"; then
> -  break;
> -  else
> +  if test "${target_list#*$1}" = "$1"; then
> return 1
> fi
>   



Re: [PATCH for 7.2?] target/i386: Remove compilation errors when -Werror=maybe-uninitialized

2022-12-07 Thread Stefan Hajnoczi
On Wed, 7 Dec 2022 at 09:34, Eric Auger  wrote:
>
> Hi Stefan,
>
> On 12/7/22 15:09, Stefan Hajnoczi wrote:
> > On Wed, 7 Dec 2022 at 08:31, Eric Auger  wrote:
> >> On 12/7/22 14:24, Eric Auger wrote:
> >>> Initialize r0-3 to avoid compilation errors when
> >>> -Werror=maybe-uninitialized is used
> >>>
> >>> ../target/i386/ops_sse.h: In function ‘helper_vpermdq_ymm’:
> >>> ../target/i386/ops_sse.h:2495:13: error: ‘r3’ may be used uninitialized 
> >>> in this function [-Werror=maybe-uninitialized]
> >>>  2495 | d->Q(3) = r3;
> >>>   | ^~~~
> >>> ../target/i386/ops_sse.h:2494:13: error: ‘r2’ may be used uninitialized 
> >>> in this function [-Werror=maybe-uninitialized]
> >>>  2494 | d->Q(2) = r2;
> >>>   | ^~~~
> >>> ../target/i386/ops_sse.h:2493:13: error: ‘r1’ may be used uninitialized 
> >>> in this function [-Werror=maybe-uninitialized]
> >>>  2493 | d->Q(1) = r1;
> >>>   | ^~~~
> >>> ../target/i386/ops_sse.h:2492:13: error: ‘r0’ may be used uninitialized 
> >>> in this function [-Werror=maybe-uninitialized]
> >>>  2492 | d->Q(0) = r0;
> >>>   | ^~~~
> >>>
> >>> Signed-off-by: Eric Auger 
> >>> Fixes: 790684776861 ("target/i386: reimplement 0x0f 0x3a, add AVX")
> >>>
> >>> ---
> >>>
> >>> Am I the only one getting this? Or anything wrong in my setup.
> >> With Stefan's correct address. Forgive me for the noise.
> > When is -Wmaybe-uninitialized used? QEMU's build system doesn't set
> > it. Unless it's automatically set by meson this must be a manual
> > --extra-cflags= option you set.
>
> I am using this configure cmd line:
>
> ./configure --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/lib/qemu
> --target-list=x86_64-softmmu --docdir=/usr/share/doc/qemu --enable-kvm
> --extra-cflags=-O --enable-trace-backends=log --python=/usr/bin/python3
> --extra-cflags=-Wall --extra-cflags=-Wundef
> --extra-cflags=-Wwrite-strings --extra-cflags=-Wmissing-prototypes
> --extra-cflags=-fno-strict-aliasing --extra-cflags=-fno-common
> --extra-cflags=-Werror=type-limits
> >
> > If you added it manually then let's fix this in 8.0 since it's not
> > tested/supported and very few people will see this issue.

Did you create the ./configure command-line manually? Do you think
other people will hit this?

Stefan



Re: [External] Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate

2022-12-07 Thread Chuang Xu


On 2022/12/6 上午12:28, Peter Xu wrote:

Chuang,

No worry on the delay; you're faster than when I read yours. :)

On Mon, Dec 05, 2022 at 02:56:15PM +0800, Chuang Xu wrote:

As a start, maybe you can try with poison address_space_to_flatview() (by
e.g. checking the start_pack_mr_change flag and assert it is not set)
during this process to see whether any call stack can even try to
dereference a flatview.

It's just that I didn't figure a good way to "prove" its validity, even if
I think this is an interesting idea worth thinking to shrink the downtime.

Thanks for your sugguestions!
I used a thread local variable to identify whether the current thread is a
migration thread(main thread of target qemu) and I modified the code of
qemu_coroutine_switch to make sure the thread local variable true only in
process_incoming_migration_co call stack. If the target qemu detects that
start_pack_mr_change is set and address_space_to_flatview() is called in
non-migrating threads or non-migrating coroutine, it will crash.

Are you using the thread var just to avoid the assert triggering in the
migration thread when commiting memory changes?

I think _maybe_ another cleaner way to sanity check this is directly upon
the depth:

static inline FlatView *address_space_to_flatview(AddressSpace *as)
{
 /*
  * Before using any flatview, sanity check we're not during a memory
  * region transaction or the map can be invalid.  Note that this can
  * also be called during commit phase of memory transaction, but that
  * should also only happen when the depth decreases to 0 first.
  */
 assert(memory_region_transaction_depth == 0);
 return qatomic_rcu_read(&as->current_map);
}

That should also cover the safe cases of memory transaction commits during
migration.


Peter, I tried this way and found that the target qemu will crash.

Here is the gdb backtrace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x7ff2929d851a in __GI_abort () at abort.c:118
#2  0x7ff2929cfe67 in __assert_fail_base (fmt=, 
assertion=assertion@entry=0x55a32578cdc0 "memory_region_transaction_depth == 0", 
file=file@entry=0x55a32575d9b0 "/data00/migration/qemu-5.2.0/include/exec/memory.h",
line=line@entry=766, function=function@entry=0x55a32578d6e0 
<__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at assert.c:92
#3  0x7ff2929cff12 in __GI___assert_fail (assertion=assertion@entry=0x55a32578cdc0 
"memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0 
"/data00/migration/qemu-5.2.0/include/exec/memory.h", line=line@entry=766,
function=function@entry=0x55a32578d6e0 <__PRETTY_FUNCTION__.20463> 
"address_space_to_flatview") at assert.c:101
#4  0x55a324b2ed5e in address_space_to_flatview (as=0x55a326132580 
) at 
/data00/migration/qemu-5.2.0/include/exec/memory.h:766
#5  0x55a324e79559 in address_space_to_flatview (as=0x55a326132580 
) at ../softmmu/memory.c:811
#6  address_space_get_flatview (as=0x55a326132580 ) at 
../softmmu/memory.c:805
#7  0x55a324e96474 in address_space_cache_init (cache=cache@entry=0x55a32a4fb000, 
as=, addr=addr@entry=68404985856, len=len@entry=4096, 
is_write=false) at ../softmmu/physmem.c:3307
#8  0x55a324ea9cba in virtio_init_region_cache (vdev=0x55a32985d9a0, n=0) 
at ../hw/virtio/virtio.c:185
#9  0x55a324eaa615 in virtio_load (vdev=0x55a32985d9a0, f=, 
version_id=) at ../hw/virtio/virtio.c:3203
#10 0x55a324c6ab96 in vmstate_load_state (f=f@entry=0x55a329dc0c00, 
vmsd=0x55a325fc1a60 , opaque=0x55a32985d9a0, version_id=1) 
at ../migration/vmstate.c:143
#11 0x55a324cda138 in vmstate_load (f=0x55a329dc0c00, se=0x55a329941c90) at 
../migration/savevm.c:913
#12 0x55a324cdda34 in qemu_loadvm_section_start_full (mis=0x55a3284ef9e0, 
f=0x55a329dc0c00) at ../migration/savevm.c:2741
#13 qemu_loadvm_state_main (f=f@entry=0x55a329dc0c00, 
mis=mis@entry=0x55a3284ef9e0) at ../migration/savevm.c:2939
#14 0x55a324cdf66a in qemu_loadvm_state (f=0x55a329dc0c00) at 
../migration/savevm.c:3021
#15 0x55a324d14b4e in process_incoming_migration_co (opaque=) at ../migration/migration.c:574
#16 0x55a32501ae3b in coroutine_trampoline (i0=, i1=) at ../util/coroutine-ucontext.c:173
#17 0x7ff2929e8000 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#18 0x7ffed80dc2a0 in ?? ()
#19 0x in ?? ()

address_space_cache_init() is the only caller of address_space_to_flatview
I can find in vmstate_load call stack so far. Although I think the mr used
by address_space_cache_init() won't be affected by the delay of
memory_region_transaction_commit(), we really need a mechanism to prevent
the modified mr from being used.

Maybe we can build a stale list:
If a subregion is added, add its parent to the stale list(considering that
new subregion's priority has uncertain effects on flatviews).
If a subregion is deleted, add itself to the stale list.
When memory_region_transaction_commit() regenerates flatviews, cle

Re: [PATCH for 7.2?] target/i386: Remove compilation errors when -Werror=maybe-uninitialized

2022-12-07 Thread Eric Auger
Hi Stefan,

On 12/7/22 16:55, Stefan Hajnoczi wrote:
> On Wed, 7 Dec 2022 at 09:34, Eric Auger  wrote:
>> Hi Stefan,
>>
>> On 12/7/22 15:09, Stefan Hajnoczi wrote:
>>> On Wed, 7 Dec 2022 at 08:31, Eric Auger  wrote:
 On 12/7/22 14:24, Eric Auger wrote:
> Initialize r0-3 to avoid compilation errors when
> -Werror=maybe-uninitialized is used
>
> ../target/i386/ops_sse.h: In function ‘helper_vpermdq_ymm’:
> ../target/i386/ops_sse.h:2495:13: error: ‘r3’ may be used uninitialized 
> in this function [-Werror=maybe-uninitialized]
>  2495 | d->Q(3) = r3;
>   | ^~~~
> ../target/i386/ops_sse.h:2494:13: error: ‘r2’ may be used uninitialized 
> in this function [-Werror=maybe-uninitialized]
>  2494 | d->Q(2) = r2;
>   | ^~~~
> ../target/i386/ops_sse.h:2493:13: error: ‘r1’ may be used uninitialized 
> in this function [-Werror=maybe-uninitialized]
>  2493 | d->Q(1) = r1;
>   | ^~~~
> ../target/i386/ops_sse.h:2492:13: error: ‘r0’ may be used uninitialized 
> in this function [-Werror=maybe-uninitialized]
>  2492 | d->Q(0) = r0;
>   | ^~~~
>
> Signed-off-by: Eric Auger 
> Fixes: 790684776861 ("target/i386: reimplement 0x0f 0x3a, add AVX")
>
> ---
>
> Am I the only one getting this? Or anything wrong in my setup.
 With Stefan's correct address. Forgive me for the noise.
>>> When is -Wmaybe-uninitialized used? QEMU's build system doesn't set
>>> it. Unless it's automatically set by meson this must be a manual
>>> --extra-cflags= option you set.
>> I am using this configure cmd line:
>>
>> ./configure --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/lib/qemu
>> --target-list=x86_64-softmmu --docdir=/usr/share/doc/qemu --enable-kvm
>> --extra-cflags=-O --enable-trace-backends=log --python=/usr/bin/python3
>> --extra-cflags=-Wall --extra-cflags=-Wundef
>> --extra-cflags=-Wwrite-strings --extra-cflags=-Wmissing-prototypes
>> --extra-cflags=-fno-strict-aliasing --extra-cflags=-fno-common
>> --extra-cflags=-Werror=type-limits
>>> If you added it manually then let's fix this in 8.0 since it's not
>>> tested/supported and very few people will see this issue.
> Did you create the ./configure command-line manually? Do you think
> other people will hit this?
no I did not. I just tried to install a fresh qemu repo and just ran the
above configure command. You should be able to reproduce I think.

I am actually surprised nobody hit that already.

Thanks

Eric
>
> Stefan
>




Re: [PATCH 00/18] block: Introduce a block graph rwlock

2022-12-07 Thread Kevin Wolf
Am 07.12.2022 um 15:12 hat Emanuele Giuseppe Esposito geschrieben:
> > Emanuele Giuseppe Esposito (7):
> >   graph-lock: Implement guard macros
> >   async: Register/unregister aiocontext in graph lock list
> >   block: wrlock in bdrv_replace_child_noperm
> >   block: remove unnecessary assert_bdrv_graph_writable()
> >   block: assert that graph read and writes are performed correctly
> >   block-coroutine-wrapper.py: introduce annotations that take the graph
> > rdlock
> >   block: use co_wrapper_mixed_bdrv_rdlock in functions taking the rdlock
> > 
> > Kevin Wolf (10):
> >   block: Factor out bdrv_drain_all_begin_nopoll()
> >   Import clang-tsa.h
> >   clang-tsa: Add TSA_ASSERT() macro
> >   clang-tsa: Add macros for shared locks
> >   configure: Enable -Wthread-safety if present
> >   test-bdrv-drain: Fix incorrrect drain assumptions
> >   block: Fix locking in external_snapshot_prepare()
> >   graph-lock: TSA annotations for lock/unlock functions
> >   Mark assert_bdrv_graph_readable/writable() GRAPH_RD/WRLOCK
> >   block: GRAPH_RDLOCK for functions only called by co_wrappers
> > 
> > Paolo Bonzini (1):
> >   graph-lock: Introduce a lock to protect block graph operations
> > 
> Reviewed-by: Emanuele Giuseppe Esposito 

Thanks!

> ^ I am curious to see if I am allowed to have my r-b also on my patches :)

That's actually a good question. I wondered myself whether I should add
my R-b to patches that I picked up from you, but which already have my
S-o-b now, of course, and are possibly modified by me.

I would say you're allowed as long as you actually reviewed them in the
version I sent to make sure that I didn't mess them up. :-)
And similarly I'll probably add my R-b on patches that contain code from
you.

Kevin




  1   2   >