On 27.09.2011, at 19:20, Blue Swirl wrote: > On Tue, Sep 27, 2011 at 5:03 PM, Alexander Graf <ag...@suse.de> wrote: >> >> On 27.09.2011, at 18:53, Blue Swirl wrote: >> >>> On Tue, Sep 27, 2011 at 3:59 PM, Alexander Graf <ag...@suse.de> wrote: >>>> >>>> On 27.09.2011, at 17:50, Blue Swirl wrote: >>>> >>>>> On Mon, Sep 26, 2011 at 11:19 PM, Scott Wood <scottw...@freescale.com> >>>>> wrote: >>>>>> On 09/24/2011 05:00 AM, Alexander Graf wrote: >>>>>>> On 24.09.2011, at 10:44, Blue Swirl wrote: >>>>>>>> On Sat, Sep 24, 2011 at 8:03 AM, Alexander Graf <ag...@suse.de> wrote: >>>>>>>>> On 24.09.2011, at 09:41, Blue Swirl wrote: >>>>>>>>>> On Mon, Sep 19, 2011 at 4:12 PM, Scott Wood >>>>>>>>>> <scottw...@freescale.com> wrote: >>>>>>>>>>> The goal with the spin table stuff, suboptimal as it is, was >>>>>>>>>>> something >>>>>>>>>>> that would work on any powerpc implementation. Other >>>>>>>>>>> implementation-specific release mechanisms are allowed, and are >>>>>>>>>>> indicated by a property in the cpu node, but only if the loader >>>>>>>>>>> knows >>>>>>>>>>> that the OS supports it. >>>>>>>>>>> >>>>>>>>>>>> IIUC the spec that includes these bits is not finalized yet. It is >>>>>>>>>>>> however in use on all u-boot versions for e500 that I'm aware of >>>>>>>>>>>> and the method Linux uses to bring up secondary CPUs. >>>>>>>>>>> >>>>>>>>>>> It's in ePAPR 1.0, which has been out for a while now. ePAPR 1.1 >>>>>>>>>>> was >>>>>>>>>>> just released which clarifies some things such as WIMG. >>>>>>>>>>> >>>>>>>>>>>> Stuart / Scott, do you have any pointers to documentation where >>>>>>>>>>>> the spinning is explained? >>>>>>>>>>> >>>>>>>>>>> https://www.power.org/resources/downloads/Power_ePAPR_APPROVED_v1.1.pdf >>>>>>>>>> >>>>>>>>>> Chapter 5.5.2 describes the table. This is actually an interface >>>>>>>>>> between OS and Open Firmware, obviously there can't be a real >>>>>>>>>> hardware >>>>>>>>>> device that magically loads r3 etc. >>>>>> >>>>>> Not Open Firmware, but rather an ePAPR-compliant loader. >>>>> >>>>> 'boot program to client program interface definition'. >>>>> >>>>>>>>>> The device method would break abstraction layers, >>>>>> >>>>>> Which abstraction layers? >>>>> >>>>> QEMU system emulation emulates hardware, not software. Hardware >>>>> devices don't touch CPU registers. >>>> >>>> The great part about this emulated device is that it's basically guest >>>> software running in host context. To the guest, it's not a device in the >>>> ordinary sense, such as vmport, but rather the same as software running on >>>> another core, just that the other core isn't running any software. >>>> >>>> Sure, if you consider this a device, it does break abstraction layers. >>>> Just consider it as host running guest code, then it makes sense :). >>>> >>>>> >>>>>>>>>> it's much like >>>>>>>>>> vmport stuff in x86. Using a hypercall would be a small improvement. >>>>>>>>>> Instead it should be possible to implement a small boot ROM which >>>>>>>>>> puts >>>>>>>>>> the secondary CPUs into managed halt state without spinning, then the >>>>>>>>>> boot CPU could send an IPI to a halted CPU to wake them up based on >>>>>>>>>> the spin table, just like real HW would do. >>>>>> >>>>>> The spin table, with no IPI or halt state, is what real HW does (or >>>>>> rather, what software does on real HW) today. It's ugly and inefficient >>>>>> but it should work everywhere. Anything else would be dependent on a >>>>>> specific HW implementation. >>>>> >>>>> Yes. Hardware doesn't ever implement the spin table. >>>>> >>>>>>>>>> On Sparc32 OpenBIOS this >>>>>>>>>> is something like a few lines of ASM on both sides. >>>>>>>>> >>>>>>>>> That sounds pretty close to what I had implemented in v1. Back then >>>>>>>>> the only comment was to do it using this method from Scott. >>>>>> >>>>>> I had some comments on the actual v1 implementation as well. :-) >>>>>> >>>>>>>>> So we have the choice between having code inside the guest that >>>>>>>>> spins, maybe even only checks every x ms, by programming a timer, >>>>>>>>> or we can try to make an event out of the memory write. V1 was >>>>>>>>> the former, v2 (this one) is the latter. This version performs a >>>>>>>>> lot better and is easier to understand. >>>>>>>> >>>>>>>> The abstraction layers should not be broken lightly, I suppose some >>>>>>>> performance or laziness^Wlocal optimization reasons were behind vmport >>>>>>>> design too. The ideal way to solve this could be to detect a spinning >>>>>>>> CPU and optimize that for all architectures, that could be tricky >>>>>>>> though (if a CPU remains in the same TB for extended periods, inspect >>>>>>>> the TB: if it performs a loop with a single load instruction, replace >>>>>>>> the load by a special wait operation for any memory stores to that >>>>>>>> page). >>>>>> >>>>>> How's that going to work with KVM? >>>>>> >>>>>>> In fact, the whole kernel loading way we go today is pretty much >>>>>>> wrong. We should rather do it similar to OpenBIOS where firmware >>>>>>> always loads and then pulls the kernel from QEMU using a PV >>>>>>> interface. At that point, we would have to implement such an >>>>>>> optimization as you suggest. Or implement a hypercall :). >>>>>> >>>>>> I think the current approach is more usable for most purposes. If you >>>>>> start U-Boot instead of a kernel, how do pass information on from the >>>>>> user (kernel, rfs, etc)? Require the user to create flash images[1]? >>>>> >>>>> No, for example OpenBIOS gets the kernel command line from fw_cfg device. >>>>> >>>>>> Maybe that's a useful mode of operation in some cases, but I don't think >>>>>> we should be slavishly bound to it. Think of the current approach as >>>>>> something between whole-system and userspace emulation. >>>>> >>>>> This is similar to ARM, M68k and Xtensa semi-hosting mode, but not at >>>>> kernel level but lower. Perhaps this mode should be enabled with >>>>> -semihosting flag or a new flag. Then the bare metal version could be >>>>> run without the flag. >>>> >>>> and then we'd have 2 implementations for running in system emulation mode >>>> and need to maintain both. I don't think that scales very well. >>> >>> No, but such hacks are not common. >>> >>>>> >>>>>> Where does the device tree come from? How do you tell the guest about >>>>>> what devices it has, especially in virtualization scenarios with non-PCI >>>>>> passthrough devices, or custom qdev instantiations? >>>>>> >>>>>>> But at least we'd always be running the same guest software stack. >>>>>> >>>>>> No we wouldn't. Any U-Boot that runs under QEMU would have to be >>>>>> heavily modified, unless we want to implement a ton of random device >>>>>> emulation, at least one extra memory translation layer (LAWs, localbus >>>>>> windows, CCSRBAR, and such), hacks to allow locked cache lines to >>>>>> operate despite a lack of backing store, etc. >>>>> >>>>> I'd say HW emulation business as usual. Now with the new memory API, >>>>> it should be possible to emulate the caches with line locking and TLBs >>>>> etc., this was not previously possible. IIRC implementing locked cache >>>>> lines would allow x86 to boot unmodified coreboot. >>>> >>>> So how would you emulate cache lines with line locking on KVM? >>> >>> The cache would be a MMIO device which registers to handle all memory >>> space. Configuring the cache controller changes how the device >>> operates. Put this device between CPU and memory and other devices. >>> Performance would probably be horrible, so CPU should disable the >>> device automatically after some time. >> >> So how would you execute code on this region then? :) > > Easy, fix QEMU to allow executing from MMIO. (Yeah, I forgot about that).
It's not quite as easy to fix KVM to do the same though unfortunately. We'd have to either implement a full instruction emulator in the kernel (x86 style) or transfer all state from KVM into QEMU to execute it there (hell breaks loose). Both alternatives are not exactly appealing. > >>> >>>> However, we already have a number of hacks in SeaBIOS to run in QEMU, so I >>>> don't see an issue in adding a few here and there in u-boot. The memory >>>> pressure is a real issue though. I'm not sure how we'd manage that one. >>>> Maybe we could try and reuse the host u-boot binary? heh >>> >>> I don't think SeaBIOS breaks layering except for fw_cfg. >> >> I'm not saying we're breaking layering there. I'm saying that changing >> u-boot is not so bad, since it's the same as we do with SeaBIOS. It was an >> argument in favor of your position. > > Never mind then ;-) > >>> For extremely >>> memory limited situation, perhaps QEMU (or Native KVM Tool for lean >>> and mean version) could be run without glibc, inside kernel or even >>> interfacing directly with the hypervisor. I'd also continue making it >>> possible to disable building unused devices and features. >> >> I'm pretty sure you're not the only one with that goal ;). > > Great, let's do it. VGA comes first :) Alex