Re: [Qemu-devel] [PATCH 24/58] PPC: E500: Add PV spinning code

Alexander Graf Tue, 27 Sep 2011 10:23:43 -0700

On 27.09.2011, at 19:20, Blue Swirl wrote:

> On Tue, Sep 27, 2011 at 5:03 PM, Alexander Graf <ag...@suse.de> wrote:
>> 
>> On 27.09.2011, at 18:53, Blue Swirl wrote:
>> 
>>> On Tue, Sep 27, 2011 at 3:59 PM, Alexander Graf <ag...@suse.de> wrote:
>>>> 
>>>> On 27.09.2011, at 17:50, Blue Swirl wrote:
>>>> 
>>>>> On Mon, Sep 26, 2011 at 11:19 PM, Scott Wood <scottw...@freescale.com> 
>>>>> wrote:
>>>>>> On 09/24/2011 05:00 AM, Alexander Graf wrote:
>>>>>>> On 24.09.2011, at 10:44, Blue Swirl wrote:
>>>>>>>> On Sat, Sep 24, 2011 at 8:03 AM, Alexander Graf <ag...@suse.de> wrote:
>>>>>>>>> On 24.09.2011, at 09:41, Blue Swirl wrote:
>>>>>>>>>> On Mon, Sep 19, 2011 at 4:12 PM, Scott Wood 
>>>>>>>>>> <scottw...@freescale.com> wrote:
>>>>>>>>>>> The goal with the spin table stuff, suboptimal as it is, was 
>>>>>>>>>>> something
>>>>>>>>>>> that would work on any powerpc implementation.  Other
>>>>>>>>>>> implementation-specific release mechanisms are allowed, and are
>>>>>>>>>>> indicated by a property in the cpu node, but only if the loader 
>>>>>>>>>>> knows
>>>>>>>>>>> that the OS supports it.
>>>>>>>>>>> 
>>>>>>>>>>>> IIUC the spec that includes these bits is not finalized yet. It is 
>>>>>>>>>>>> however in use on all u-boot versions for e500 that I'm aware of 
>>>>>>>>>>>> and the method Linux uses to bring up secondary CPUs.
>>>>>>>>>>> 
>>>>>>>>>>> It's in ePAPR 1.0, which has been out for a while now.  ePAPR 1.1 
>>>>>>>>>>> was
>>>>>>>>>>> just released which clarifies some things such as WIMG.
>>>>>>>>>>> 
>>>>>>>>>>>> Stuart / Scott, do you have any pointers to documentation where 
>>>>>>>>>>>> the spinning is explained?
>>>>>>>>>>> 
>>>>>>>>>>> https://www.power.org/resources/downloads/Power_ePAPR_APPROVED_v1.1.pdf
>>>>>>>>>> 
>>>>>>>>>> Chapter 5.5.2 describes the table. This is actually an interface
>>>>>>>>>> between OS and Open Firmware, obviously there can't be a real 
>>>>>>>>>> hardware
>>>>>>>>>> device that magically loads r3 etc.
>>>>>> 
>>>>>> Not Open Firmware, but rather an ePAPR-compliant loader.
>>>>> 
>>>>> 'boot program to client program interface definition'.
>>>>> 
>>>>>>>>>> The device method would break abstraction layers,
>>>>>> 
>>>>>> Which abstraction layers?
>>>>> 
>>>>> QEMU system emulation emulates hardware, not software. Hardware
>>>>> devices don't touch CPU registers.
>>>> 
>>>> The great part about this emulated device is that it's basically guest 
>>>> software running in host context. To the guest, it's not a device in the 
>>>> ordinary sense, such as vmport, but rather the same as software running on 
>>>> another core, just that the other core isn't running any software.
>>>> 
>>>> Sure, if you consider this a device, it does break abstraction layers. 
>>>> Just consider it as host running guest code, then it makes sense :).
>>>> 
>>>>> 
>>>>>>>>>> it's much like
>>>>>>>>>> vmport stuff in x86. Using a hypercall would be a small improvement.
>>>>>>>>>> Instead it should be possible to implement a small boot ROM which 
>>>>>>>>>> puts
>>>>>>>>>> the secondary CPUs into managed halt state without spinning, then the
>>>>>>>>>> boot CPU could send an IPI to a halted CPU to wake them up based on
>>>>>>>>>> the spin table, just like real HW would do.
>>>>>> 
>>>>>> The spin table, with no IPI or halt state, is what real HW does (or
>>>>>> rather, what software does on real HW) today.  It's ugly and inefficient
>>>>>> but it should work everywhere.  Anything else would be dependent on a
>>>>>> specific HW implementation.
>>>>> 
>>>>> Yes. Hardware doesn't ever implement the spin table.
>>>>> 
>>>>>>>>>> On Sparc32 OpenBIOS this
>>>>>>>>>> is something like a few lines of ASM on both sides.
>>>>>>>>> 
>>>>>>>>> That sounds pretty close to what I had implemented in v1. Back then 
>>>>>>>>> the only comment was to do it using this method from Scott.
>>>>>> 
>>>>>> I had some comments on the actual v1 implementation as well. :-)
>>>>>> 
>>>>>>>>> So we have the choice between having code inside the guest that
>>>>>>>>> spins, maybe even only checks every x ms, by programming a timer,
>>>>>>>>> or we can try to make an event out of the memory write. V1 was
>>>>>>>>> the former, v2 (this one) is the latter. This version performs a
>>>>>>>>> lot better and is easier to understand.
>>>>>>>> 
>>>>>>>> The abstraction layers should not be broken lightly, I suppose some
>>>>>>>> performance or laziness^Wlocal optimization reasons were behind vmport
>>>>>>>> design too. The ideal way to solve this could be to detect a spinning
>>>>>>>> CPU and optimize that for all architectures, that could be tricky
>>>>>>>> though (if a CPU remains in the same TB for extended periods, inspect
>>>>>>>> the TB: if it performs a loop with a single load instruction, replace
>>>>>>>> the load by a special wait operation for any memory stores to that
>>>>>>>> page).
>>>>>> 
>>>>>> How's that going to work with KVM?
>>>>>> 
>>>>>>> In fact, the whole kernel loading way we go today is pretty much
>>>>>>> wrong. We should rather do it similar to OpenBIOS where firmware
>>>>>>> always loads and then pulls the kernel from QEMU using a PV
>>>>>>> interface. At that point, we would have to implement such an
>>>>>>> optimization as you suggest. Or implement a hypercall :).
>>>>>> 
>>>>>> I think the current approach is more usable for most purposes.  If you
>>>>>> start U-Boot instead of a kernel, how do pass information on from the
>>>>>> user (kernel, rfs, etc)?  Require the user to create flash images[1]?
>>>>> 
>>>>> No, for example OpenBIOS gets the kernel command line from fw_cfg device.
>>>>> 
>>>>>> Maybe that's a useful mode of operation in some cases, but I don't think
>>>>>> we should be slavishly bound to it.  Think of the current approach as
>>>>>> something between whole-system and userspace emulation.
>>>>> 
>>>>> This is similar to ARM, M68k and Xtensa semi-hosting mode, but not at
>>>>> kernel level but lower. Perhaps this mode should be enabled with
>>>>> -semihosting flag or a new flag. Then the bare metal version could be
>>>>> run without the flag.
>>>> 
>>>> and then we'd have 2 implementations for running in system emulation mode 
>>>> and need to maintain both. I don't think that scales very well.
>>> 
>>> No, but such hacks are not common.
>>> 
>>>>> 
>>>>>> Where does the device tree come from?  How do you tell the guest about
>>>>>> what devices it has, especially in virtualization scenarios with non-PCI
>>>>>> passthrough devices, or custom qdev instantiations?
>>>>>> 
>>>>>>> But at least we'd always be running the same guest software stack.
>>>>>> 
>>>>>> No we wouldn't.  Any U-Boot that runs under QEMU would have to be
>>>>>> heavily modified, unless we want to implement a ton of random device
>>>>>> emulation, at least one extra memory translation layer (LAWs, localbus
>>>>>> windows, CCSRBAR, and such), hacks to allow locked cache lines to
>>>>>> operate despite a lack of backing store, etc.
>>>>> 
>>>>> I'd say HW emulation business as usual. Now with the new memory API,
>>>>> it should be possible to emulate the caches with line locking and TLBs
>>>>> etc., this was not previously possible. IIRC implementing locked cache
>>>>> lines would allow x86 to boot unmodified coreboot.
>>>> 
>>>> So how would you emulate cache lines with line locking on KVM?
>>> 
>>> The cache would be a MMIO device which registers to handle all memory
>>> space. Configuring the cache controller changes how the device
>>> operates. Put this device between CPU and memory and other devices.
>>> Performance would probably be horrible, so CPU should disable the
>>> device automatically after some time.
>> 
>> So how would you execute code on this region then? :)
> 
> Easy, fix QEMU to allow executing from MMIO. (Yeah, I forgot about that).


It's not quite as easy to fix KVM to do the same though unfortunately. We'd 
have to either implement a full instruction emulator in the kernel (x86 style) 
or transfer all state from KVM into QEMU to execute it there (hell breaks 
loose). Both alternatives are not exactly appealing.

> 
>>> 
>>>> However, we already have a number of hacks in SeaBIOS to run in QEMU, so I 
>>>> don't see an issue in adding a few here and there in u-boot. The memory 
>>>> pressure is a real issue though. I'm not sure how we'd manage that one. 
>>>> Maybe we could try and reuse the host u-boot binary? heh
>>> 
>>> I don't think SeaBIOS breaks layering except for fw_cfg.
>> 
>> I'm not saying we're breaking layering there. I'm saying that changing 
>> u-boot is not so bad, since it's the same as we do with SeaBIOS. It was an 
>> argument in favor of your position.
> 
> Never mind then ;-)
> 
>>> For extremely
>>> memory limited situation, perhaps QEMU (or Native KVM Tool for lean
>>> and mean version) could be run without glibc, inside kernel or even
>>> interfacing directly with the hypervisor. I'd also continue making it
>>> possible to disable building unused devices and features.
>> 
>> I'm pretty sure you're not the only one with that goal ;).
> 
> Great, let's do it.

VGA comes first :)


Alex

Re: [Qemu-devel] [PATCH 24/58] PPC: E500: Add PV spinning code

Reply via email to