On Mon, May 13, 2024 at 8:37 PM Daniel Henrique Barboza < dbarb...@ventanamicro.com> wrote:
> Hi Frank, > > > On 5/8/24 08:15, Daniel Henrique Barboza wrote: > > Hi Frank, > > > > I'll reply with that I've done so far. Still missing some stuff: > > > > On 5/2/24 08:37, Frank Chang wrote: > >> Hi Daniel, > >> > >> Daniel Henrique Barboza <dbarb...@ventanamicro.com> 於 2024年3月8日 週五 > 上午12:04寫道: > >>> > >>> From: Tomasz Jeznach <tjezn...@rivosinc.com> > >>> > >>> The RISC-V IOMMU specification is now ratified as-per the RISC-V > >>> international process. The latest frozen specifcation can be found > >>> at: > >>> > >>> > https://github.com/riscv-non-isa/riscv-iommu/releases/download/v1.0/riscv-iommu.pdf > >>> > >>> Add the foundation of the device emulation for RISC-V IOMMU, which > >>> includes an IOMMU that has no capabilities but MSI interrupt support > and > >>> fault queue interfaces. We'll add add more features incrementally in > the > >>> next patches. > >>> > >>> Co-developed-by: Sebastien Boeuf <s...@rivosinc.com> > >>> Signed-off-by: Sebastien Boeuf <s...@rivosinc.com> > >>> Signed-off-by: Tomasz Jeznach <tjezn...@rivosinc.com> > >>> Signed-off-by: Daniel Henrique Barboza <dbarb...@ventanamicro.com> > >>> --- > >>> hw/riscv/Kconfig | 4 + > > (...) > > >>> + > >>> + s->iommus.le_next = NULL; > >>> + s->iommus.le_prev = NULL; > >>> + QLIST_INIT(&s->spaces); > >>> + qemu_cond_init(&s->core_cond); > >>> + qemu_mutex_init(&s->core_lock); > >>> + qemu_spin_init(&s->regs_lock); > >>> + qemu_thread_create(&s->core_proc, "riscv-iommu-core", > >>> + riscv_iommu_core_proc, s, QEMU_THREAD_JOINABLE); > >> > >> In our experience, using QEMU thread increases the latency of command > >> queue processing, > >> which leads to the potential IOMMU fence timeout in the Linux driver > >> when using IOMMU with KVM, > >> e.g. booting the guest Linux. > >> > >> Is it possible to remove the thread from the IOMMU just like ARM, AMD, > >> and Intel IOMMU models? > > > > Interesting. We've been using this emulation internally in Ventana, with > > KVM and VFIO, and didn't experience this issue. Drew is on CC and can > talk > > more about it. > > > > That said, I don't mind this change, assuming it's feasible to make it > for this > > first version. I'll need to check it how other IOMMUs are doing it. > > > I removed the threading and it seems to be working fine without it. I'll > commit this > change for v3. > > > > > > > > >> > >>> +} > >>> + > > > > (...) > > > >>> + > >>> +static AddressSpace *riscv_iommu_find_as(PCIBus *bus, void *opaque, > int devfn) > >>> +{ > >>> + RISCVIOMMUState *s = (RISCVIOMMUState *) opaque; > >>> + PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn); > >>> + AddressSpace *as = NULL; > >>> + > >>> + if (pdev && pci_is_iommu(pdev)) { > >>> + return s->target_as; > >>> + } > >>> + > >>> + /* Find first registered IOMMU device */ > >>> + while (s->iommus.le_prev) { > >>> + s = *(s->iommus.le_prev); > >>> + } > >>> + > >>> + /* Find first matching IOMMU */ > >>> + while (s != NULL && as == NULL) { > >>> + as = riscv_iommu_space(s, PCI_BUILD_BDF(pci_bus_num(bus), > devfn)); > >> > >> For pci_bus_num(), > >> riscv_iommu_find_as() can be called at the very early stage > >> where software has no chance to enumerate the bus numbers. > > I investigated and this doesn't seem to be a problem. This function is > called at the > last step of the realize() steps of both riscv_iommu_pci_realize() and > riscv_iommu_sys_realize(), and by that time the pci_bus_num() is already > assigned. > Other iommus use pci_bus_num() into their own get_address_space() > callbacks like > this too. > Hi Daniel, IIUC, pci_bus_num() by default is assigned to pcibus_num(): static int pcibus_num(PCIBus *bus) { if (pci_bus_is_root(bus)) { return 0; /* pci host bridge */ } return bus->parent_dev->config[PCI_SECONDARY_BUS]; } If the bus is not the root bus, it tries to read the bus' parent device's secondary bus number (PCI_SECONDARY_BUS) field in the PCI configuration space. This field should be programmable by the SW during PCIe enumeration. But I don't think SW has a chance to be executed before riscv_iommu_sys_realize() is called, since it's pretty early before CPU's execution unless RISC-V IOMMU is hot-plugged. Even if RISC-V IOMMU is hot-plugged, I think riscv_iommu_sys_realize() is still called before SW aware of the existence of IOMMU on the PCI topology tree. Do you think this matches your observation? Regards, Frank Chang > > > Thanks, > > > Daniel > > > > > > I'll see how other IOMMUs are handling their iommu_find_as() > > > > > > Thanks, > > > > > > Daniel > > > > > >> > >> > >> > >> > >>> + s = s->iommus.le_next; > >>> + } > >>> + > >>> + return as ? as : &address_space_memory; > >>> +} > >>> + > >>> +static const PCIIOMMUOps riscv_iommu_ops = { > >>> + .get_address_space = riscv_iommu_find_as, > >>> +}; > >>> + > >>> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus, > >>> + Error **errp) > >>> +{ > >>> + if (bus->iommu_ops && > >>> + bus->iommu_ops->get_address_space == riscv_iommu_find_as) { > >>> + /* Allow multiple IOMMUs on the same PCIe bus, link known > devices */ > >>> + RISCVIOMMUState *last = (RISCVIOMMUState *)bus->iommu_opaque; > >>> + QLIST_INSERT_AFTER(last, iommu, iommus); > >>> + } else if (bus->iommu_ops == NULL) { > >>> + pci_setup_iommu(bus, &riscv_iommu_ops, iommu); > >>> + } else { > >>> + error_setg(errp, "can't register secondary IOMMU for PCI bus > #%d", > >>> + pci_bus_num(bus)); > >>> + } > >>> +} > >>> + > >>> +static int riscv_iommu_memory_region_index(IOMMUMemoryRegion > *iommu_mr, > >>> + MemTxAttrs attrs) > >>> +{ > >>> + return attrs.unspecified ? RISCV_IOMMU_NOPASID : (int)attrs.pasid; > >>> +} > >>> + > >>> +static int riscv_iommu_memory_region_index_len(IOMMUMemoryRegion > *iommu_mr) > >>> +{ > >>> + RISCVIOMMUSpace *as = container_of(iommu_mr, RISCVIOMMUSpace, > iova_mr); > >>> + return 1 << as->iommu->pasid_bits; > >>> +} > >>> + > >>> +static void riscv_iommu_memory_region_init(ObjectClass *klass, void > *data) > >>> +{ > >>> + IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(klass); > >>> + > >>> + imrc->translate = riscv_iommu_memory_region_translate; > >>> + imrc->notify_flag_changed = riscv_iommu_memory_region_notify; > >>> + imrc->attrs_to_index = riscv_iommu_memory_region_index; > >>> + imrc->num_indexes = riscv_iommu_memory_region_index_len; > >>> +} > >>> + > >>> +static const TypeInfo riscv_iommu_memory_region_info = { > >>> + .parent = TYPE_IOMMU_MEMORY_REGION, > >>> + .name = TYPE_RISCV_IOMMU_MEMORY_REGION, > >>> + .class_init = riscv_iommu_memory_region_init, > >>> +}; > >>> + > >>> +static void riscv_iommu_register_mr_types(void) > >>> +{ > >>> + type_register_static(&riscv_iommu_memory_region_info); > >>> + type_register_static(&riscv_iommu_info); > >>> +} > >>> + > >>> +type_init(riscv_iommu_register_mr_types); > >>> diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h > >>> new file mode 100644 > >>> index 0000000000..6f740de690 > >>> --- /dev/null > >>> +++ b/hw/riscv/riscv-iommu.h > >>> @@ -0,0 +1,141 @@ > >>> +/* > >>> + * QEMU emulation of an RISC-V IOMMU (Ziommu) > >>> + * > >>> + * Copyright (C) 2022-2023 Rivos Inc. > >>> + * > >>> + * This program is free software; you can redistribute it and/or > modify > >>> + * it under the terms of the GNU General Public License as published > by > >>> + * the Free Software Foundation; either version 2 of the License. > >>> + * > >>> + * This program is distributed in the hope that it will be useful, > >>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of > >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > >>> + * GNU General Public License for more details. > >>> + * > >>> + * You should have received a copy of the GNU General Public License > along > >>> + * with this program; if not, see <http://www.gnu.org/licenses/>. > >>> + */ > >>> + > >>> +#ifndef HW_RISCV_IOMMU_STATE_H > >>> +#define HW_RISCV_IOMMU_STATE_H > >>> + > >>> +#include "qemu/osdep.h" > >>> +#include "qom/object.h" > >>> + > >>> +#include "hw/riscv/iommu.h" > >>> + > >>> +struct RISCVIOMMUState { > >>> + /*< private >*/ > >>> + DeviceState parent_obj; > >>> + > >>> + /*< public >*/ > >>> + uint32_t version; /* Reported interface version number */ > >>> + uint32_t pasid_bits; /* process identifier width */ > >>> + uint32_t bus; /* PCI bus mapping for non-root endpoints */ > >>> + > >>> + uint64_t cap; /* IOMMU supported capabilities */ > >>> + uint64_t fctl; /* IOMMU enabled features */ > >>> + > >>> + bool enable_off; /* Enable out-of-reset OFF mode (DMA > disabled) */ > >>> + bool enable_msi; /* Enable MSI remapping */ > >>> + > >>> + /* IOMMU Internal State */ > >>> + uint64_t ddtp; /* Validated Device Directory Tree Root > Pointer */ > >>> + > >>> + dma_addr_t cq_addr; /* Command queue base physical address */ > >>> + dma_addr_t fq_addr; /* Fault/event queue base physical address > */ > >>> + dma_addr_t pq_addr; /* Page request queue base physical address > */ > >>> + > >>> + uint32_t cq_mask; /* Command queue index bit mask */ > >>> + uint32_t fq_mask; /* Fault/event queue index bit mask */ > >>> + uint32_t pq_mask; /* Page request queue index bit mask */ > >>> + > >>> + /* interrupt notifier */ > >>> + void (*notify)(RISCVIOMMUState *iommu, unsigned vector); > >>> + > >>> + /* IOMMU State Machine */ > >>> + QemuThread core_proc; /* Background processing thread */ > >>> + QemuMutex core_lock; /* Global IOMMU lock, used for cache/regs > updates */ > >>> + QemuCond core_cond; /* Background processing wake up signal */ > >>> + unsigned core_exec; /* Processing thread execution actions */ > >>> + > >>> + /* IOMMU target address space */ > >>> + AddressSpace *target_as; > >>> + MemoryRegion *target_mr; > >>> + > >>> + /* MSI / MRIF access trap */ > >>> + AddressSpace trap_as; > >>> + MemoryRegion trap_mr; > >>> + > >>> + GHashTable *ctx_cache; /* Device translation Context > Cache */ > >>> + > >>> + /* MMIO Hardware Interface */ > >>> + MemoryRegion regs_mr; > >>> + QemuSpin regs_lock; > >>> + uint8_t *regs_rw; /* register state (user write) */ > >>> + uint8_t *regs_wc; /* write-1-to-clear mask */ > >>> + uint8_t *regs_ro; /* read-only mask */ > >>> + > >>> + QLIST_ENTRY(RISCVIOMMUState) iommus; > >>> + QLIST_HEAD(, RISCVIOMMUSpace) spaces; > >>> +}; > >>> + > >>> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, PCIBus *bus, > >>> + Error **errp); > >>> + > >>> +/* private helpers */ > >>> + > >>> +/* Register helper functions */ > >>> +static inline uint32_t riscv_iommu_reg_mod32(RISCVIOMMUState *s, > >>> + unsigned idx, uint32_t set, uint32_t clr) > >>> +{ > >>> + uint32_t val; > >>> + qemu_spin_lock(&s->regs_lock); > >>> + val = ldl_le_p(s->regs_rw + idx); > >>> + stl_le_p(s->regs_rw + idx, (val & ~clr) | set); > >>> + qemu_spin_unlock(&s->regs_lock); > >>> + return val; > >>> +} > >>> + > >>> +static inline void riscv_iommu_reg_set32(RISCVIOMMUState *s, > >>> + unsigned idx, uint32_t set) > >>> +{ > >>> + qemu_spin_lock(&s->regs_lock); > >>> + stl_le_p(s->regs_rw + idx, set); > >>> + qemu_spin_unlock(&s->regs_lock); > >>> +} > >>> + > >>> +static inline uint32_t riscv_iommu_reg_get32(RISCVIOMMUState *s, > >>> + unsigned idx) > >>> +{ > >>> + return ldl_le_p(s->regs_rw + idx); > >>> +} > >>> + > >>> +static inline uint64_t riscv_iommu_reg_mod64(RISCVIOMMUState *s, > >>> + unsigned idx, uint64_t set, uint64_t clr) > >>> +{ > >>> + uint64_t val; > >>> + qemu_spin_lock(&s->regs_lock); > >>> + val = ldq_le_p(s->regs_rw + idx); > >>> + stq_le_p(s->regs_rw + idx, (val & ~clr) | set); > >>> + qemu_spin_unlock(&s->regs_lock); > >>> + return val; > >>> +} > >>> + > >>> +static inline void riscv_iommu_reg_set64(RISCVIOMMUState *s, > >>> + unsigned idx, uint64_t set) > >>> +{ > >>> + qemu_spin_lock(&s->regs_lock); > >>> + stq_le_p(s->regs_rw + idx, set); > >>> + qemu_spin_unlock(&s->regs_lock); > >>> +} > >>> + > >>> +static inline uint64_t riscv_iommu_reg_get64(RISCVIOMMUState *s, > >>> + unsigned idx) > >>> +{ > >>> + return ldq_le_p(s->regs_rw + idx); > >>> +} > >>> + > >>> + > >>> + > >>> +#endif > >>> diff --git a/hw/riscv/trace-events b/hw/riscv/trace-events > >>> new file mode 100644 > >>> index 0000000000..42a97caffa > >>> --- /dev/null > >>> +++ b/hw/riscv/trace-events > >>> @@ -0,0 +1,11 @@ > >>> +# See documentation at docs/devel/tracing.rst > >>> + > >>> +# riscv-iommu.c > >>> +riscv_iommu_new(const char *id, unsigned b, unsigned d, unsigned f) > "%s: device attached %04x:%02x.%d" > >>> +riscv_iommu_flt(const char *id, unsigned b, unsigned d, unsigned f, > uint64_t reason, uint64_t iova) "%s: fault %04x:%02x.%u reason: 0x%"PRIx64" > iova: 0x%"PRIx64 > >>> +riscv_iommu_pri(const char *id, unsigned b, unsigned d, unsigned f, > uint64_t iova) "%s: page request %04x:%02x.%u iova: 0x%"PRIx64 > >>> +riscv_iommu_dma(const char *id, unsigned b, unsigned d, unsigned f, > unsigned pasid, const char *dir, uint64_t iova, uint64_t phys) "%s: > translate %04x:%02x.%u #%u %s 0x%"PRIx64" -> 0x%"PRIx64 > >>> +riscv_iommu_msi(const char *id, unsigned b, unsigned d, unsigned f, > uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u MSI 0x%"PRIx64" > -> 0x%"PRIx64 > >>> +riscv_iommu_cmd(const char *id, uint64_t l, uint64_t u) "%s: command > 0x%"PRIx64" 0x%"PRIx64 > >>> +riscv_iommu_notifier_add(const char *id) "%s: dev-iotlb notifier > added" > >>> +riscv_iommu_notifier_del(const char *id) "%s: dev-iotlb notifier > removed" > >>> diff --git a/hw/riscv/trace.h b/hw/riscv/trace.h > >>> new file mode 100644 > >>> index 0000000000..b88504b750 > >>> --- /dev/null > >>> +++ b/hw/riscv/trace.h > >>> @@ -0,0 +1,2 @@ > >>> +#include "trace/trace-hw_riscv.h" > >>> + > >>> diff --git a/include/hw/riscv/iommu.h b/include/hw/riscv/iommu.h > >>> new file mode 100644 > >>> index 0000000000..403b365893 > >>> --- /dev/null > >>> +++ b/include/hw/riscv/iommu.h > >>> @@ -0,0 +1,36 @@ > >>> +/* > >>> + * QEMU emulation of an RISC-V IOMMU (Ziommu) > >>> + * > >>> + * Copyright (C) 2022-2023 Rivos Inc. > >>> + * > >>> + * This program is free software; you can redistribute it and/or > modify > >>> + * it under the terms of the GNU General Public License as published > by > >>> + * the Free Software Foundation; either version 2 of the License. > >>> + * > >>> + * This program is distributed in the hope that it will be useful, > >>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of > >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > >>> + * GNU General Public License for more details. > >>> + * > >>> + * You should have received a copy of the GNU General Public License > along > >>> + * with this program; if not, see <http://www.gnu.org/licenses/>. > >>> + */ > >>> + > >>> +#ifndef HW_RISCV_IOMMU_H > >>> +#define HW_RISCV_IOMMU_H > >>> + > >>> +#include "qemu/osdep.h" > >>> +#include "qom/object.h" > >>> + > >>> +#define TYPE_RISCV_IOMMU "riscv-iommu" > >>> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUState, RISCV_IOMMU) > >>> +typedef struct RISCVIOMMUState RISCVIOMMUState; > >>> + > >>> +#define TYPE_RISCV_IOMMU_MEMORY_REGION "riscv-iommu-mr" > >>> +typedef struct RISCVIOMMUSpace RISCVIOMMUSpace; > >>> + > >>> +#define TYPE_RISCV_IOMMU_PCI "riscv-iommu-pci" > >>> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUStatePci, RISCV_IOMMU_PCI) > >>> +typedef struct RISCVIOMMUStatePci RISCVIOMMUStatePci; > >>> + > >>> +#endif > >>> diff --git a/meson.build b/meson.build > >>> index c59ca496f2..75e56f3282 100644 > >>> --- a/meson.build > >>> +++ b/meson.build > >>> @@ -3361,6 +3361,7 @@ if have_system > >>> 'hw/rdma', > >>> 'hw/rdma/vmw', > >>> 'hw/rtc', > >>> + 'hw/riscv', > >>> 'hw/s390x', > >>> 'hw/scsi', > >>> 'hw/sd', > >>> -- > >>> 2.43.2 > >>> > >>> >