On 6/27/13, Chris Torek <to...@elf.torek.net> wrote: > OK, I wasted :-) way too much time, but here's a text file that > can be comment-ified or stored somewhere alongside the code or > whatever... > > (While drawing this I realized that there's at least one "wasted" > page if the machine has .5 TB or less: we can just leave zero > slots in the corresponding L4 direct-map entries. But that would > require switching to the bcopy() method also mentioned below. Or > indexing into vmspace0.vm_pmap.pm_pml4, which is basically the > same thing.) > > Chris > > ----- > > There are six -- or sometimes five -- sets of pages allocated here > at boot time to map physical memory in two ways. Note that each > page, regardless of level, stores 512 PTEs (or PDEs or PDPs, but > let's just use PTE here and prefix it with "level" as needed: 4, > 3, 2, or 1.) > > There is one page for the top level, L4, page table entries. Each > L4 PTE maps 512 GB of space. Unless it's marked "invalid", no L4 > PTE can be marked "stop here": it either is marked as "this > address is invalid", or it points to one physically-adressed page > full of L3 PTEs. Eventually, those L3 PTEs will map-or-reject > half a terabyte. 512 entries, each mapping .5 TB, allow us to map > 256 TB, which is as much as the hardware supports (there are, in > effect, only 48 virtual address bits: the top 16 bits must match > the 47th bit). > > The L4 entry halfway down, at PML4PML4I, is set to point back to > this page itself; that's the "recursive page table" for user > space, which we do nothing else with at boot time. > > We need (up to) NDMPML4E pages, each holding 512 L3 PTEs, for the > direct map space. If the processor supports 1 GB pages, an L3 PTE > can be marked with "stop here" and these L3 PTEs each grant (or > forbid) access to 1 GB of physical space at a time. A system > with, say, 3 GB of RAM starting at 0 can map it all with three L3 > PTEs: "address 0 is valid for 1GB", "address 1GB is valid for > 1GB", "address 2GB is valid for 1GB". The remaining L3 PTEs are > zero, making the remaining address space invalid. > > If the processor does not support 1 GB pages, or if there is less > than 1 GB of RAM "at the end" (e.g., if the system has 4.5 GB), > the L3 PTEs may need to point to more pages holding L2 PTEs. > These L2 PTEs always support 2 MB pages. Each page of L2 PTEs > maps 1 GB. So a machine with 4.5 GB and 1 GB mappings needs one L3 > page with four valid 1 GB L3 PTEs and then one L3 PTE pointing to > one page of L2 PTEs. That one page of L2 PTEs is half-filled, > containing 256 2MB-sized PTEs, mapping the 512 MB. The remaining > half of that page is zero, making the remaining addresses invalid. > > Pictorially, and adding the names of the physical page(s), thus > far we have this. (Note, the L4 PTE page is drawn more than twice > as tall as the L3 and L2 pages, just to get space for arrows.) > > LEVEL 4: LEVEL 3: LEVEL 2: > _._ > KPML4phys v \ > +---------+ | > | 0: | | > |---------| | > | 1: | | DMPDPphys DMPDphys > ( ... ) | .-> +---------+ +----------------+ > | 127: | | / | 0: 0GB | .-> | 0: 4GB | > |---------| | | | 1: 1GB | / | 1: 4GB+2MB | > PML4PML4I: | 128: *--|--/ | | 2: 2GB | / | 2: 4GB+4MB | > |---------| | | 3: 3GB | / ( ... ) > | 129: | | | 4: *--|-/ | 255: 4.5GB-2MB | > | ... | | | 5: | | 256: | > ________ |---------| | ( ... ) | 257: | > / DMPML4I: | *--|-----/ | 511: | ( ... ) > NDMPML4E |---------| +---------+ +----------------+ > \________ | *--|---------> | 0: | > |---------| | 1: | > | | | 2: | (These are used only > |---------| | 3: | if the system has more > | ... | ( ... ) than 512 GB) > ( |---------| ) | 509: | > ( | 510: see below ) | 510: | > ( |---------| ) | 511: | > ( | 511: see below ) +---------+ > +---------+ > > If the hardware supports 1GB pages, "ndm1g" is the number of > gigabyte entries (4 in the example above). Otherwise it's just > zero. Meanwhile "ndmpdp" is the number of gigabytes of RAM that > need to be mapped, in this case 5. Thus, if ndmpdp > ndm1g, we > need ndmpdp-ndm1g pages to hold some L2 PTEs. > > Now we get to the weirder case of the kernel itself (both its > non-direct-mapped dynamically allocated virtual memory, and its > text/data/bss). The branch offset limitations encourage the > placement of the kernel's text, etc., in the last 2 GB of virtual > space, i.e., starting at 0xffff.ffff.f800.0000. But, we want > a reasonable amount of room for dynamic VM. So we give the kernel > at least 512 GB of VM -- that's one L4 PTE -- while making sure that > the text snuggles up close to the end of the space, in that last 2 GB > of the at-least-512-GB area. > > Meanwhile, the boot loader has loaded the kernel into relatively > low physical memory addresses. > > If KPML4I is 511 (and it actually is), this uses the final L4 slot > to map the kernel. If we want to allow kernel VM to have more > than 512 GB available, though, we need extra space below KPML4I, > i.e., starting at KPMLBASE. So we allocate NKPML4E pages that > we set up as L3 PTEs, and point the last NKPML4E slots in the L4 > page table here. If NKPML4E is 4, for instance, we will have > this: > > last part of KPML4phys: > ( ... ) .----> [page #0 of all-zero L3 PTEs] > | DMPML4I | / > ( ... ) | .--> [page #1 of all-zero L3 PTEs] > | 507: | | / > | 508: *--|--/ | .-> [page #2 of all-zero L3 PTEs] > | 509: *--|----/ | > | 510: *--|------/ > | 511: *--|---------> [page #3 of L3 PTEs, see below] > +---------+ > > The reason for having those "empty" (all-zero) PTE pages is that > whenever new processes are created, in pmap_pinit(), they have > their (new) L4 PTE page set up to point to the *same* physical > pages that the kernel is using. Thus, if the kernel creates or > destroys any level-3-or-below mapping by writing into any of the > above four pages, that mapping is also created/destroyed in all > processes. Similarly, the NDMPML4 pages starting at DMPDPphys are > mapped identically in all processes. The kernel can therefore > "borrow" a user pmap at any time, i.e., there's no need to adjust > the CPU's CR4 on entry to the kernel. > > (If we used bcopy() to copy the kernel pmap's NKPML4E and NDMPML4E > entries into the new pmap, the L3 pages would not have to be > physically contiguous, but the KVA ones would still all have to > exist. It's free to allocate physically contiguous pages here > anyway though.) > > So, the last NKPML4E slots in KPML4phys point to the following > page tables, which use all of L3, L2, and L1 style PTEs. (Note > that we did not need any L1 PTEs for the direct map, which always > uses 2MB or 1GB super-pages.) > > LEVEL 3: LEVEL 2: LEVEL 1: > > (assuming NKPML4=4) (nkpt pages) > KPDPphys KPTphys > +---------+ +---------------+ > page 0 | 0: | .-> | 0: 0 KB | > | 1: | / | 1: 4 KB | > | 2: | / | 2: 8 KB | > | 3: | / | 3: 12 KB | > ( ... ) | ( ... ) > | 509: | | | 509: 2MB-12KB | > | 510: | | | 510: 2MB-8KB | > | 511: | | | 511: 2MB-4KB | > +---------+ | +---------------+ > page 1 | 0: | | .-> | 0: 2 MB | > | 1: | | / | 1: 2MB+4KB | > | 2: | | | ( ... ) > | 3: | | | ( ... ) > ( ... ) | | +---------------+ > | 509: | | | .-> ( ... ) > | 510: | | | | ( ... ) > | 511: | KPDphys | | | +---------------+ > +---------+ +---------+ | | | ..( ... ... ... ) > page 2 | 0: | .---> | 0: *--|--/ | | . [etc] > | 1: | / | 1: *--|---/ | . > | 2: | | | 2: *--|-----/ . > | 3: | | | 3: *--|---.... > ( ... ) | ( ... ) > | 509: | | | 509: ...| > | 510: | | | 510: ...| > | 511: | | | 511: ...| > +---------+ | +---------+ > page 3 | 0: | | .-> | 0: ...| > | 1: | | / ( ... ) > | 2: | | | ( ... ) > | 3: | | | ( ... ) > ( ... ) | | ( ... ) > | 509: | | | ( ... ) > | 510: *--|--/ | ( ... ) > | 511: *--|----/ | 511: | > +---------+ +---------+ > > There are nkpdpe pages at KPDphys, where nkpdpe is either 1 or 2. > One page maps 1 GB, and the other page maps the remaining 1 GB. > Remember that kernel text+data+bss lives in the final 2 GB of the > virtual address space, so there cannot be more than 2 GB. These > one or two pages map nkpt pages at KPTphys.
added two VM guru, to CC > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" > _______________________________________________ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"