On Mon, Aug 29, 2011 at 12:07:05PM +0200, Cherry G. Mathew wrote: > JM> On Mon, 22 Aug 2011 12:47:40 +0200, Manuel Bouyer wrote: > >>> This is slightly more complicated than it appears. Some of the > >>> "ops" in a per-cpu queue may have ordering dependencies with > >>> other cpu queues, and I think this would be hard to express > >>> trivially. (an example would be a pte update on one queue, and > >>> reading the same pte read on another queue - these cases are > >>> quite analogous (although completely unrelated) > >> > > Hi, > > So I had a better look at this - implemented per-cpu queues and messed > with locking a bit: > > > >> read don't go through the xpq queue, don't they ? > > JM> Nope, PTE are directly obtained from the recursive mappings > JM> (vtopte/kvtopte). > > Let's call this "out of band" reads. But see below for "in-band" reads. > > JM> Content is "obviously" only writable by hypervisor (so it can > JM> keep control of his mapping alone). > >> I think this is similar to a tlb flush but the other way round, I > >> guess we could use a IPI for this too. > > JM> IIRC that's what the current native x86 code does: it uses an > JM> IPI to signal other processors that a shootdown is necessary. > > Xen's TLB_FLUSH operation is synchronous, and doesn't require an IPI > (within the domain), which makes the queue ordering even more important > (to make sure that stale ptes are not reloaded before the per-cpu queue > has made progress). Yes, we can implement a roundabout ipi driven > queueflush + tlbflush scheme(described below), but that would be > performance sensitive, and the basic issue won't go away, imho. > > Let's stick to the xpq ops for a second, ignoring "out-of-band" reads > (for which I agree that your assertion, that locking needs to be done at > a higher level, holds true). > > The question here, really is, what are the global ordering requirements > of per-cpu memory op queues, given the following basic "ops": > > i) write memory (via MMU_NORMAL_PT_UPDATE, MMU_MACHPHYS_UPDATE) > ii) read memory > via: > MMUEXT_PIN_L1_TABLE > MMUEXT_PIN_L2_TABLE > MMUEXT_PIN_L3_TABLE > MMUEXT_PIN_L4_TABLE > MMUEXT_UNPIN_TABLE
This is when adding/removing a page table from a pmap. When this occurs, the pmap is locked, isn't it ? > MMUEXT_NEW_BASEPTR > MMUEXT_NEW_USER_BASEPTR This is a context switch > MMUEXT_TLB_FLUSH_LOCAL > MMUEXT_INVLPG_LOCAL > MMUEXT_TLB_FLUSH_MULTI > MMUEXT_INVLPG_MULTI > MMUEXT_TLB_FLUSH_ALL > MMUEXT_INVLPG_ALL > MMUEXT_FLUSH_CACHE This may, or may not, cause a read. This usually happens after updating the pmap, and I guess this also happens with the pmap locked (I have not carefully checked). So couldn't we just use the pmap lock for this ? I suspect the same lock will also be needed for out of band reads at some point (right now it's protected by splvm()). > [...] > > >>> I'm thinking that it might be easier and more justifiable to > >>> nuke the current queue scheme and implement shadow page tables, > >>> which would fit more naturally and efficiently with CAS pte > >>> updates, etc. > >> > >> I'm not sure this would completely fis the issue: with shadow > >> page tables you can't use a CAS to assure atomic operation with > >> the hardware TLB, as this is, precisely, a shadow PT and not the > >> one used by hardware. > > Definitely worth looking into, I imho. I'm not very comfortable with > the queue based scheme for MP. > > the CAS doesn't provide any guarantees with the TLB on native h/w, > afaict. What makes you think so ? I think the hw TLB also does CAS to update referenced and dirty bits in the PTE, otherwise we couldn't rely on these bits; this would be bad especially for the dirty bit. > If you do a CAS pte update, and the update succeeded, it's a > good idea to invalidate + shootdown anyway (even on baremetal). Yes, of course inval is needed after updating the PTE. But using a true CAS is important to get the refereced and dirty bits right. -- Manuel Bouyer <bou...@antioche.eu.org> NetBSD: 26 ans d'experience feront toujours la difference --