On Tue, Nov 29, 2011 at 10:36:42AM -0600, Becky Bruce wrote: > > On Nov 28, 2011, at 11:25 PM, Benjamin Herrenschmidt wrote: > > > On Mon, 2011-10-10 at 15:50 -0500, Becky Bruce wrote: > >> From: Becky Bruce <bec...@kernel.crashing.org> > >> > >> This updates the hugetlb page table code to handle 64-bit FSL_BOOKE. > >> The previous 32-bit work counted on the inner levels of the page table > >> collapsing. > > > > Seriously, my brain hurts !!! > > Now you know how I felt when I got the original code from David :)
Heh, I can't blame you. Between the constraints of the hardware and fitting in with x86, that thing's accreted into a horrible pile. > > So I've tried to understand that code and so far, what I came up with is > > this, please reply and let me know if I'm full of crack or what ... > > > > - David's code has entire levels "branching off" into hugepd's which > > are hugetlb specific page dirs. That requires address space constrainst. > > Yes. > > > - The hack you do with HUGEPD_PGD_SHIFT defined to PGDIR_SHIFT and > > HUGEPD_PUD_SHIFT to PUD_SHIFT consists of collasping that back into the > > normal levels. > > That exists so Jimi's code for A2 and mine can coexist peacefully > without ifdeffing huge blocks because we peel off the page table at > different points for a hugepd. In my case, if I have a 4M page, > that is handled by having 2 entries at the 2M layer, each of which > is a pointer to the same pte stored in a kmem_cache created for that > purpose. In the server/A2 case, they peel off at the layer above 4M > and have a sub-table kmem_cache that has a bunch of same-size huge > page ptes (this is all because of the slice constraint). > > > - I really don't understand what you are doing in __hugepte_alloc(). It > > seems to me that you are trying to make things still point to some kind > > of separate hugepd dir with the hugeptes in it and have the page tables > > point to that but I don't quite get it. > > In your example below, the alloc code is: > 1) allocating a small kmem_cache for the pte > > 2) filling in 8 entries at the 2M level with the pointers to that pte, with > the upper bit munged to indicate huge and bits in the lower region that store > the huge page size because it can't be encoded in the book3e pte format > > > - Couldn't we just instead ditch the whole hugepd concept alltogether > > and simply have the huge ptes in the page table at the right level, > > using possibly multiple consecutive of them for a single page when > > needed ? > > > > Example: 4K base page size. PMD maps 2M. a 16M page could be > > representing by having 8 consecutive hugeptes pointing to the same huge > > page in the PMD directory. > > We currently have 8 consecutive PMD entries that are pointers to the > same kmem_cache that holds the actual PTE. I did this for a few > reasons: > > 1) I was trying to stay as close to what David had done as possible > > 2) symmetry - in every other case entries at higher levels of the > normal page table are pointers to something, and it's easy to > identify that something is a pointer to hugepte using David's > upper-bit-flipping trick. If we have an actual entry mixed in with > the pointers it might be hard to tell that's it's an actual PTE and > not a pointer without getting information from somewhere else > (i.e. the vma) > > 3) I was trying to avoid having multiple copies of the actual pte - > this way it's easy to do stuff like change the perms on the PTE, > since I only have to modify one copy > > 4) I needed the information laid out for quick TLB miss > fault-handling of hugepte tlb misses (see below). > > > I believe we always "know" when accessing a PTE whether it's going to be > > huge or not and if it's huge, the page size. IE. All the functions we > > care about either get the vma (which gets you everything you need) or > > the size (huge_pte_alloc). > > An exception is the 32-bit fault hander asm code. It does a walk of > the page table to reload the tlb. We need to be able to easily > identify that we're walking into a hugepage area so we know to load > the tlbcam. Having the pointer there with the munged upper bit that > David devised is very convenient for that. Also, the Book3e page > table format does not allow us to represent page sizes > 32m. So > that's encoded in the hugepd instead (and not in the pte). > > I'm not sure how to get around this without slowing things down. I > originally had a slower handler and it turned out to impact > performance of several important workloads and my perf guys griped > at me. I was actually eventually planning to rewrite the 64b fsl > book3e handler to deal with this in asm as well. Large workloads on > our systems do a lot of tlbcam entry replacement due to 1) the small > size of the tlbcam and 2) the lack of any hardware replacement > policy on that array. > > There are other places where we'd have to modify the code to have > the vma available (not that it's hard to do, but it's not floating > around everywhere). And there may be other places where this is an > issue - I'd have to go dig around a bit to answer that. > > For the record, I hate the idea of not being able to walk the page > table without going elsewhere for information. IMHO I should be > able to tell everything I need to load a TLB entry from there > without digging up a vma. I agree, all things being equal, but there might be tradeoffs that make it the least bad option. > > This should be much simpler than what we have today. > > > > That way, we have a completely generic accross-the-board way to store > > huge pte's in our page tables, which is totally disconnected from the > > slices. The later remains a purely server-only thing which only affects > > the SLB code and get_unmapped_area(). > > David and Jimi will have to comment about whether they can flatten > out their stuff to just store PTEs. A lot of my code exists because > I was attempting to be as close to the IBM implementation as > possible. I was talking with Ben about this yesterday, and I think it can probably be done. > > That means that we'll have to find another option for PowerEN giant > > indirects but that's a non issue at the moment. I think we can keep the > > complexity local to the PowerEN code by doing shadows there if we need > > to. > > > > What do you think ? > > I'm happy if we can come up with another solution that still allows > me to do my miss fault handling efficiently in asm.... What we do on > FSL Booke with storing multiple pointers to a single pte is actually > a fairly clean solution given the constraints. It's only in the > context of having a different implementation coexisting with it that > makes things start getting complicated. If you have some > suggestions on another way to deal with this, I'm all ears. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev