On 05/04/2018 06:51 AM, David Gibson wrote: > On Thu, May 03, 2018 at 06:06:14PM +0200, Cédric Le Goater wrote: >> On 05/03/2018 07:35 AM, David Gibson wrote: >>> On Thu, Apr 26, 2018 at 11:27:21AM +0200, Cédric Le Goater wrote: >>>> On 04/26/2018 09:11 AM, David Gibson wrote: >>>>> On Thu, Apr 19, 2018 at 02:43:02PM +0200, Cédric Le Goater wrote: >>>>>> The XIVE presenter engine uses a set of registers to handle priority >>>>>> management and interrupt acknowledgment among other things. The most >>>>>> important ones being : >>>>>> >>>>>> - Interrupt Priority Register (PIPR) >>>>>> - Interrupt Pending Buffer (IPB) >>>>>> - Current Processor Priority (CPPR) >>>>>> - Notification Source Register (NSR) >>>>>> >>>>>> There is one set of registers per level of privilege, four in all : >>>>>> HW, HV pool, OS and User. These are called rings. All registers are >>>>>> accessible through a specific MMIO region called the Thread Interrupt >>>>>> Management Areas (TIMA) but, depending on the privilege level of the >>>>>> CPU, the view of the TIMA is filtered. The sPAPR machine runs at the >>>>>> OS privilege and therefore can only accesses the OS and the User >>>>>> rings. The others are for hypervisor levels. >>>>>> >>>>>> The CPU interrupt state is modeled with a XiveNVT object which stores >>>>>> the values of the different registers. The different TIMA views are >>>>>> mapped at the same address for each CPU and 'current_cpu' is used to >>>>>> retrieve the XiveNVT holding the ring registers. >>>>>> >>>>>> Signed-off-by: Cédric Le Goater <c...@kaod.org> >>>>>> --- >>>>>> >>>>>> Changes since v2 : >>>>>> >>>>>> - introduced the XiveFabric interface >>>>>> >>>>>> hw/intc/spapr_xive.c | 25 ++++ >>>>>> hw/intc/xive.c | 279 >>>>>> ++++++++++++++++++++++++++++++++++++++++++++ >>>>>> include/hw/ppc/spapr_xive.h | 5 + >>>>>> include/hw/ppc/xive.h | 31 +++++ >>>>>> include/hw/ppc/xive_regs.h | 84 +++++++++++++ >>>>>> 5 files changed, 424 insertions(+) >>>>>> >>>>>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c >>>>>> index 90cde8a4082d..f07832bf0a00 100644 >>>>>> --- a/hw/intc/spapr_xive.c >>>>>> +++ b/hw/intc/spapr_xive.c >>>>>> @@ -13,6 +13,7 @@ >>>>>> #include "target/ppc/cpu.h" >>>>>> #include "sysemu/cpus.h" >>>>>> #include "monitor/monitor.h" >>>>>> +#include "hw/ppc/spapr.h" >>>>>> #include "hw/ppc/spapr_xive.h" >>>>>> #include "hw/ppc/xive.h" >>>>>> #include "hw/ppc/xive_regs.h" >>>>>> @@ -95,6 +96,22 @@ static void spapr_xive_realize(DeviceState *dev, >>>>>> Error **errp) >>>>>> >>>>>> /* Allocate the Interrupt Virtualization Table */ >>>>>> xive->ivt = g_new0(XiveIVE, xive->nr_irqs); >>>>>> + >>>>>> + /* The Thread Interrupt Management Area has the same address for >>>>>> + * each chip. On sPAPR, we only need to expose the User and OS >>>>>> + * level views of the TIMA. >>>>>> + */ >>>>>> + xive->tm_base = XIVE_TM_BASE; >>>>> >>>>> The constant should probably have PAPR in the name somewhere, since >>>>> it's just for PAPR machines (same for the ESB mappings, actually). >>>> >>>> ok. >>>> >>>> I have also made 'tm_base' a property, like 'vc_base' for ESBs, in >>>> case we want to change the value when the guest is instantiated. >>>> I doubt it but this is an address in the global address space, so >>>> letting the machine have control is better I think. >>> >>> I agree. >>> >>>>>> + >>>>>> + memory_region_init_io(&xive->tm_mmio_user, OBJECT(xive), >>>>>> + &xive_tm_user_ops, xive, "xive.tima.user", >>>>>> + 1ull << TM_SHIFT); >>>>>> + sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_user); >>>>>> + >>>>>> + memory_region_init_io(&xive->tm_mmio_os, OBJECT(xive), >>>>>> + &xive_tm_os_ops, xive, "xive.tima.os", >>>>>> + 1ull << TM_SHIFT); >>>>>> + sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_os); >>>>>> } >>>>>> >>>>>> static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn) >>>>>> @@ -104,6 +121,13 @@ static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, >>>>>> uint32_t lisn) >>>>>> return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL; >>>>>> } >>>>>> >>>>>> +static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server) >>>>>> +{ >>>>>> + PowerPCCPU *cpu = spapr_find_cpu(server); >>>>>> + >>>>>> + return cpu ? XIVE_NVT(cpu->intc) : NULL; >>>>>> +} >>>>> >>>>> So this is a bit of a tangent, but I've been thinking of implementing >>>>> a scheme where there's an opaque pointer in the cpu structure for the >>>>> use of the machine. I'm planning for that to replace the intc pointer >>>>> (which isn't really used directly by the cpu). That would allow us to >>>>> have spapr put a structure there and have both xics and xive pointers >>>>> which could be useful later on. >>>> >>>> ok. That should simplify the patchset at the end, in which we need to >>>> switch the 'intc' pointer. >>>> >>>>> I think we'd need something similar to correctly handle migration of >>>>> the VPA state, which is currently horribly broken. >>>>> >>>>>> + >>>>>> static const VMStateDescription vmstate_spapr_xive_ive = { >>>>>> .name = TYPE_SPAPR_XIVE "/ive", >>>>>> .version_id = 1, >>>>>> @@ -143,6 +167,7 @@ static void spapr_xive_class_init(ObjectClass >>>>>> *klass, void *data) >>>>>> dc->vmsd = &vmstate_spapr_xive; >>>>>> >>>>>> xfc->get_ive = spapr_xive_get_ive; >>>>>> + xfc->get_nvt = spapr_xive_get_nvt; >>>>>> } >>>>>> >>>>>> static const TypeInfo spapr_xive_info = { >>>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c >>>>>> index dccad0318834..5691bb9474e4 100644 >>>>>> --- a/hw/intc/xive.c >>>>>> +++ b/hw/intc/xive.c >>>>>> @@ -14,7 +14,278 @@ >>>>>> #include "sysemu/cpus.h" >>>>>> #include "sysemu/dma.h" >>>>>> #include "monitor/monitor.h" >>>>>> +#include "hw/ppc/xics.h" /* for ICP_PROP_CPU */ >>>>>> #include "hw/ppc/xive.h" >>>>>> +#include "hw/ppc/xive_regs.h" >>>>>> + >>>>>> +/* >>>>>> + * XIVE Interrupt Presenter >>>>>> + */ >>>>>> + >>>>>> +static uint64_t xive_nvt_accept(XiveNVT *nvt) >>>>>> +{ >>>>>> + return 0; >>>>>> +} >>>>>> + >>>>>> +static void xive_nvt_set_cppr(XiveNVT *nvt, uint8_t cppr) >>>>>> +{ >>>>>> + if (cppr > XIVE_PRIORITY_MAX) { >>>>>> + cppr = 0xff; >>>>>> + } >>>>>> + >>>>>> + nvt->ring_os[TM_CPPR] = cppr; >>>>> >>>>> Surely this needs to recheck if we should be interrupting the cpu? >>>> >>>> yes. In patch 9, when we introduce the nvt notify routine. >>> >>> Ok. >>> >>>>>> +} >>>>>> + >>>>>> +/* >>>>>> + * OS Thread Interrupt Management Area MMIO >>>>>> + */ >>>>>> +static uint64_t xive_tm_read_special(XiveNVT *nvt, hwaddr offset, >>>>>> + unsigned size) >>>>>> +{ >>>>>> + uint64_t ret = -1; >>>>>> + >>>>>> + if (offset == TM_SPC_ACK_OS_REG && size == 2) { >>>>>> + ret = xive_nvt_accept(nvt); >>>>>> + } else { >>>>>> + qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%" >>>>>> + HWADDR_PRIx" size %d\n", offset, size); >>>>>> + } >>>>>> + >>>>>> + return ret; >>>>>> +} >>>>>> + >>>>>> +#define TM_RING(offset) ((offset) & 0xf0) >>>>>> + >>>>>> +static uint64_t xive_tm_os_read(void *opaque, hwaddr offset, >>>>>> + unsigned size) >>>>>> +{ >>>>>> + PowerPCCPU *cpu = POWERPC_CPU(current_cpu); >>>>> >>>>> So, as I said on a previous version of this, we can actually correctly >>>>> represent different mappings in different cpu spaces, by exploiting >>>>> cpu->as and not just having them all point to &address_space_memory. >>>> >>>> Yes, you did and I haven't studied the question yet. For the next version. >>> >>> So, it's possible that using the cpu->as thing will be more trouble >>> that it's worth. >> >> One of the trouble is the number of memory regions to use, one per cpu, > > Well, we're already going to have an NVT object for each cpu, yes? So > a memory region per-cpu doesn't seem like a big stretch. > >> and the KVM support. > > And I really don't see how the memory regions impacts KVM.
The TIMA is setup when the KVM device is initialized using some specific ioctl to get an fd on a MMIO region from the host. It is then passed to the guest as a 'ram_device', same for the ESBs. This is not a common region. >> Having a single region is much easier. >> >>> I am a little concerned about using current_cpu though. >>> First, will it work with KVM with kernel_irqchip=off - the >>> cpus are running truly concurrently, >> >> FWIW, I didn't see any issue yet while stressing. > > Ok. > >>> but we still need to work out who's poking at the TIMA. >> >> I understand. The registers are accessed by the current cpu to set the >> CPPR and to ack an interrupt. But when we route an event, we also access >> and modify the registers. Do you suggest some locking ? I am not sure >> how are protected the TIMA region accesses vs. the routing, which is >> necessarily initiated by an ESB MMIO though. > > Locking isn't really the issue. I mean, we do need locking, but the > BQL should provide that. The issue is what exactly does "current" > mean in the context of multiple concurrently running cpus. Does it > always mean what we need it to mean in every context we might call > this from. I would say so. C. >>> Second, are there any cases where we might >>> need to trip this "on behalf of" a specific cpu that's not the current >>> one. >> >> ah. yes. sort of :) only in powernv, when the xive is reseted (and when >> dumping the state for debug). >> >> The IC has a way to access indirectly the registers of a HW thread. >> It, first, sets the PC_TCTXT_INDIR_THRDID register with the PIR of >> the targeted thread and then loads on the indirect TIMA can be done >> as if it was the current thread. The indirect TIMA is mapped 4 pages >> after the IC BAR. >> >> The resulting memory region op is a little ugly and might need >> some rework : >> >> static uint64_t xive_tm_hv_read(void *opaque, hwaddr offset, >> unsigned size) >> { >> PowerPCCPU **cpuptr = opaque; >> PowerPCCPU *cpu = *cpuptr ? *cpuptr : POWERPC_CPU(current_cpu); >> ... >> >> >>>>>> + XiveNVT *nvt = XIVE_NVT(cpu->intc); >>>>>> + uint64_t ret = -1; >>>>>> + int i; >>>>>> + >>>>>> + if (offset >= TM_SPC_ACK_EBB) { >>>>>> + return xive_tm_read_special(nvt, offset, size); >>>>>> + } >>>>>> + >>>>>> + if (TM_RING(offset) != TM_QW1_OS) { >>>>>> + qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS >>>>>> ring @%" >>>>>> + HWADDR_PRIx"\n", offset); >>>>>> + return ret; >>>>> >>>>> Just return -1 would be clearer here; >>>> >>>> ok. >>>> >>>>> >>>>>> + } >>>>>> + >>>>>> + ret = 0; >>>>>> + for (i = 0; i < size; i++) { >>>>>> + ret |= (uint64_t) nvt->regs[offset + i] << (8 * (size - i - 1)); >>>>>> + } >>>>>> + >>>>>> + return ret; >>>>>> +} >>>>>> + >>>>>> +static bool xive_tm_is_readonly(uint8_t offset) >>>>>> +{ >>>>>> + return offset != TM_QW1_OS + TM_CPPR; >>>>>> +} >>>>>> + >>>>>> +static void xive_tm_write_special(XiveNVT *nvt, hwaddr offset, >>>>>> + uint64_t value, unsigned size) >>>>>> +{ >>>>>> + /* TODO: support TM_SPC_SET_OS_PENDING */ >>>>>> + >>>>>> + /* TODO: support TM_SPC_ACK_OS_EL */ >>>>>> +} >>>>>> + >>>>>> +static void xive_tm_os_write(void *opaque, hwaddr offset, >>>>>> + uint64_t value, unsigned size) >>>>>> +{ >>>>>> + PowerPCCPU *cpu = POWERPC_CPU(current_cpu); >>>>>> + XiveNVT *nvt = XIVE_NVT(cpu->intc); >>>>>> + int i; >>>>>> + >>>>>> + if (offset >= TM_SPC_ACK_EBB) { >>>>>> + xive_tm_write_special(nvt, offset, value, size); >>>>>> + return; >>>>>> + } >>>>>> + >>>>>> + if (TM_RING(offset) != TM_QW1_OS) { >>>>> >>>>> Why have this if you have separate OS and user regions as you appear >>>>> to do below? >>>> >>>> This is another problem we are trying to solve. >>>> >>>> The registers a CPU can access depends on the TIMA view it is using. >>>> The OS TIMA view only sees the OS ring registers. The HV view sees all. >>>> >>>>> Or to look at it another way, shouldn't it be possible to make the >>>>> read/write accessors the same for the OS and user rings? >>>> >>>> For some parts yes, but the special load/store addresses are different >>>> for each view, the read-only register also. It seemed easier to duplicate. >>>> >>>> I think the problem will become clearer (or worse) with pnv which uses >>>> the HV mode. >>> >>> Oh. I had the impression that each ring had a basically identical set >>> of registers and you just had access to the region for your ring and >>> the ones below. Are you saying instead it's basically a single block >>> of registers with various different privilege levels for each of them? >> >> yes. I think I answered this question more clearly in a previous email. >> >>> [snip] >>>>>> +} >>>>>> + >>>>>> +static void xive_nvt_unrealize(DeviceState *dev, Error **errp) >>>>>> +{ >>>>>> + qemu_unregister_reset(xive_nvt_reset, dev); >>>>>> +} >>>>>> + >>>>>> +static void xive_nvt_init(Object *obj) >>>>>> +{ >>>>>> + XiveNVT *nvt = XIVE_NVT(obj); >>>>>> + >>>>>> + nvt->ring_os = &nvt->regs[TM_QW1_OS]; >>>>> >>>>> The ring_os field is basically pointless, being just an offset into a >>>>> structure you already have. A macro or inline would be a better idea. >>>> >>>> ok. I liked the idea but I agree it's overkill to have an init routine >>>> just for this. I will find something. >>> >>> That too, but it's also something that looks like an optimization but >>> isn't, which is bad practice. On modern cpus math is cheap (and this >>> is just a trivial offset), memory accesses are expensive. You're >>> essentially caching this offset - raising all the usual invalidation >>> questions for a cache - when caching it is *more* expensive than just >>> computing it every time. >> >> ok. removing this offset was a good opportunity to generalize the >> routing algorithm and use a 'ring' parameter in all routines. Same >> for the accept path. >> >> >> C. >> >