Re: expanding amd64 past the 1TB limit

Chris Torek Thu, 27 Jun 2013 14:42:53 -0700

OK, I wasted :-) way too much time, but here's a text file that
can be comment-ified or stored somewhere alongside the code or
whatever...

(While drawing this I realized that there's at least one "wasted"
page if the machine has .5 TB or less: we can just leave zero
slots in the corresponding L4 direct-map entries.  But that would
require switching to the bcopy() method also mentioned below.  Or
indexing into vmspace0.vm_pmap.pm_pml4, which is basically the
same thing.)

Chris

    -----

There are six -- or sometimes five -- sets of pages allocated here
at boot time to map physical memory in two ways.  Note that each
page, regardless of level, stores 512 PTEs (or PDEs or PDPs, but
let's just use PTE here and prefix it with "level" as needed: 4,
3, 2, or 1.)

There is one page for the top level, L4, page table entries.  Each
L4 PTE maps 512 GB of space.  Unless it's marked "invalid", no L4
PTE can be marked "stop here": it either is marked as "this
address is invalid", or it points to one physically-adressed page
full of L3 PTEs.  Eventually, those L3 PTEs will map-or-reject
half a terabyte.  512 entries, each mapping .5 TB, allow us to map
256 TB, which is as much as the hardware supports (there are, in
effect, only 48 virtual address bits: the top 16 bits must match
the 47th bit).

The L4 entry halfway down, at PML4PML4I, is set to point back to
this page itself; that's the "recursive page table" for user
space, which we do nothing else with at boot time.

We need (up to) NDMPML4E pages, each holding 512 L3 PTEs, for the
direct map space.  If the processor supports 1 GB pages, an L3 PTE
can be marked with "stop here" and these L3 PTEs each grant (or
forbid) access to 1 GB of physical space at a time.  A system
with, say, 3 GB of RAM starting at 0 can map it all with three L3
PTEs: "address 0 is valid for 1GB", "address 1GB is valid for
1GB", "address 2GB is valid for 1GB".  The remaining L3 PTEs are
zero, making the remaining address space invalid.

If the processor does not support 1 GB pages, or if there is less
than 1 GB of RAM "at the end" (e.g., if the system has 4.5 GB),
the L3 PTEs may need to point to more pages holding L2 PTEs.
These L2 PTEs always support 2 MB pages.  Each page of L2 PTEs
maps 1 GB. So a machine with 4.5 GB and 1 GB mappings needs one L3
page with four valid 1 GB L3 PTEs and then one L3 PTE pointing to
one page of L2 PTEs.  That one page of L2 PTEs is half-filled,
containing 256 2MB-sized PTEs, mapping the 512 MB.  The remaining
half of that page is zero, making the remaining addresses invalid.

Pictorially, and adding the names of the physical page(s), thus
far we have this.  (Note, the L4 PTE page is drawn more than twice
as tall as the L3 and L2 pages, just to get space for arrows.)

              LEVEL 4:                LEVEL 3:             LEVEL 2:
                      _._
          KPML4phys  v   \
             +---------+  |
             |  0:     |  |
             |---------|  |
             |  1:     |  |         DMPDPphys              DMPDphys
             (   ...   )  |    .-> +---------+         +----------------+
             | 127:    |  |   /    |  0: 0GB |     .-> |  0: 4GB        |
             |---------|  |  |     |  1: 1GB |    /    |  1: 4GB+2MB    |
  PML4PML4I: | 128: *--|--/  |     |  2: 2GB |   /     |  2: 4GB+4MB    |
             |---------|     |     |  3: 3GB |  /      (      ...       )
             | 129:    |     |     |  4:  *--|-/       | 255: 4.5GB-2MB |
             |   ...   |     |     |  5:     |         | 256:           |
  ________   |---------|     |     (   ...   )         | 257:           |
 /  DMPML4I: |      *--|-----/     | 511:    |         (      ...       )
 NDMPML4E    |---------|           +---------+         +----------------+
 \________   |      *--|---------> |   0:    |
             |---------|           |   1:    |
             |         |           |   2:    |  (These are used only
             |---------|           |   3:    |   if the system has more
             |   ...   |           (   ...   )   than 512 GB)
           ( |---------|      )    | 509:    |
           ( | 510: see below )    | 510:    |
           ( |---------|      )    | 511:    |
           ( | 511: see below )    +---------+
             +---------+

If the hardware supports 1GB pages, "ndm1g" is the number of
gigabyte entries (4 in the example above).  Otherwise it's just
zero.  Meanwhile "ndmpdp" is the number of gigabytes of RAM that
need to be mapped, in this case 5.  Thus, if ndmpdp > ndm1g, we
need ndmpdp-ndm1g pages to hold some L2 PTEs.

Now we get to the weirder case of the kernel itself (both its
non-direct-mapped dynamically allocated virtual memory, and its
text/data/bss).  The branch offset limitations encourage the
placement of the kernel's text, etc., in the last 2 GB of virtual
space, i.e., starting at 0xffff.ffff.f800.0000.  But, we want
a reasonable amount of room for dynamic VM.  So we give the kernel
at least 512 GB of VM -- that's one L4 PTE -- while making sure that
the text snuggles up close to the end of the space, in that last 2 GB
of the at-least-512-GB area.

Meanwhile, the boot loader has loaded the kernel into relatively
low physical memory addresses.

If KPML4I is 511 (and it actually is), this uses the final L4 slot
to map the kernel.  If we want to allow kernel VM to have more
than 512 GB available, though, we need extra space below KPML4I,
i.e., starting at KPMLBASE.  So we allocate NKPML4E pages that
we set up as L3 PTEs, and point the last NKPML4E slots in the L4
page table here.  If NKPML4E is 4, for instance, we will have
this:

  last part of KPML4phys:
             (   ...   )    .----> [page #0 of all-zero L3 PTEs]
             | DMPML4I |   /
             (   ...   )   |  .--> [page #1 of all-zero L3 PTEs]
             | 507:    |   | /
             | 508: *--|--/  | .-> [page #2 of all-zero L3 PTEs]
             | 509: *--|----/  |
             | 510: *--|------/
             | 511: *--|---------> [page #3 of L3 PTEs, see below]
             +---------+

The reason for having those "empty" (all-zero) PTE pages is that
whenever new processes are created, in pmap_pinit(), they have
their (new) L4 PTE page set up to point to the *same* physical
pages that the kernel is using.  Thus, if the kernel creates or
destroys any level-3-or-below mapping by writing into any of the
above four pages, that mapping is also created/destroyed in all
processes.  Similarly, the NDMPML4 pages starting at DMPDPphys are
mapped identically in all processes.  The kernel can therefore
"borrow" a user pmap at any time, i.e., there's no need to adjust
the CPU's CR4 on entry to the kernel.

(If we used bcopy() to copy the kernel pmap's NKPML4E and NDMPML4E
entries into the new pmap, the L3 pages would not have to be
physically contiguous, but the KVA ones would still all have to
exist.  It's free to allocate physically contiguous pages here
anyway though.)

So, the last NKPML4E slots in KPML4phys point to the following
page tables, which use all of L3, L2, and L1 style PTEs.  (Note
that we did not need any L1 PTEs for the direct map, which always
uses 2MB or 1GB super-pages.)

          LEVEL 3:         LEVEL 2:                 LEVEL 1:

    (assuming NKPML4=4)                             (nkpt pages)
         KPDPphys                                      KPTphys
        +---------+                               +---------------+
 page 0 |  0:     |                           .-> |  0:      0 KB |
        |  1:     |                          /    |  1:      4 KB |
        |  2:     |                         /     |  2:      8 KB |
        |  3:     |                        /      |  3:     12 KB |
        (   ...   )                       |       (      ...      )
        | 509:    |                       |       | 509: 2MB-12KB |
        | 510:    |                       |       | 510: 2MB-8KB  |
        | 511:    |                       |       | 511: 2MB-4KB  |
        +---------+                       |       +---------------+
 page 1 |  0:     |                       |   .-> |  0:      2 MB |
        |  1:     |                       |  /    |  1:   2MB+4KB |
        |  2:     |                       | |     (      ...      )
        |  3:     |                       | |     (      ...      )
        (   ...   )                       | |     +---------------+
        | 509:    |                       | | .-> (      ...      )
        | 510:    |                       | | |   (      ...      )
        | 511:    |            KPDphys    | | |   +---------------+
        +---------+          +---------+  | | | ..(  ... ... ...  )
 page 2 |  0:     |    .---> |  0:  *--|--/ | | .       [etc]
        |  1:     |   /      |  1:  *--|---/  | .
        |  2:     |  |       |  2:  *--|-----/ .
        |  3:     |  |       |  3:  *--|---....
        (   ...   )  |       (   ...   )
        | 509:    |  |       | 509: ...|
        | 510:    |  |       | 510: ...|
        | 511:    |  |       | 511: ...|
        +---------+  |       +---------+
 page 3 |  0:     |  |   .-> |  0:  ...|
        |  1:     |  |  /    (   ...   )
        |  2:     |  |  |    (   ...   )
        |  3:     |  |  |    (   ...   )
        (   ...   )  |  |    (   ...   )
        | 509:    |  |  |    (   ...   )
        | 510: *--|--/  |    (   ...   )
        | 511: *--|----/     | 511:    |
        +---------+          +---------+

There are nkpdpe pages at KPDphys, where nkpdpe is either 1 or 2.
One page maps 1 GB, and the other page maps the remaining 1 GB.
Remember that kernel text+data+bss lives in the final 2 GB of the
virtual address space, so there cannot be more than 2 GB.  These
one or two pages map nkpt pages at KPTphys.
_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: expanding amd64 past the 1TB limit

Reply via email to