On Thu, 6 Oct 2022 at 08:34, a b <blue_3...@hotmail.com> wrote:
>
> Thanks a lot Peter for the clarification. It is very helpful.
>
> My naive understanding is that each MMU has only 1 TLB, why do we need an 
> array of CPUTLBDescFast structures? How are these different CPUTLBDescFast 
> data structures correlate with a hardware TLB?
>
> 220 typedef struct CPUTLB {
> 221     CPUTLBCommon c;
> 222     CPUTLBDesc d[NB_MMU_MODES];
> 223     CPUTLBDescFast f[NB_MMU_MODES];
> 224 } CPUTLB;

QEMU's "TLB" doesn't really correlate with a hardware TLB
except in that they're serving vaguely similar purposes.
A hardware TLB is a h/w structure which accelerates the lookup
  virtual-address => (physical-address, permissions)
QEMU's TLB is a software data structure which accelerates
the lookup
  virtual-address => (host virtual address or device MemoryRegion structure)

It's not an emulation of the "real" CPU TLB. (Note that this
means that you can't use QEMU to look at performance behaviour
around whether guest code is hitting or missing in the TLB,
and that the size of QEMU's TLB is unrelated to the size of a
TLB on the real CPU.)

Further, the set of things that can be done fast in hardware
differs from the set of things that can be done fast in
software. In hardware, a TLB is a "content-addressable
memory" that essentially checks every entry in parallel to
find the match in fixed time. In this kind of hardware it's
easy to add checks like "and it should match the right ASID"
or "and it must be an entry for EL2" without it making the
lookup slower. In software, you can't do that kind of parallel
lookup, so we must use a different structure. Instead of
having one TLB that can store entries for multiple contexts
at once and where we check that the context is correct when
we look up the address, we have effectively a separate TLB
for each context, so we can look up the address in an O(1)
data structure that has exactly one entry for the address,
and know that if it is present it is the correct entry.

The aim of the QEMU TLB design is to make the "fast path"
lookup of guest virtual address to host virtual address for
RAM accesses as fast as possible (it is a handful of
instructions directly generated as part of the JIT output);
the slow path for faults, hardware accesses, etc, is handled
in C code and is less performance critical.

> Why do we want to store a shifted (n_entries-1) in mask?
> 184 typedef struct CPUTLBDescFast {
> 185     /* Contains (n_entries - 1) << CPU_TLB_ENTRY_BITS */
> 186     uintptr_t mask;
> 187     /* The array of tlb entries itself. */
> 188     CPUTLBEntry *table;
> 189 } CPUTLBDescFast QEMU_ALIGNED(2 * sizeof(void *));

The mask field is a pre-calculated value that is going to
be used as part of the "given a virtual address, find the
table entry" lookup. Because the number of entries in the table
varies, the part of the address we need to use as the index
also varies. We pre-calculate the mask in a convenient format
for the generated JIT code because if we stored just n_entries
here it would cost us an extra instruction or two in the fast path.
(To understand these data structures you probably want to also
be looking at the code that generates the lookup code, which
you can find under tcg/, usually in a function named
tcg_out_tlb_load or tcg_out_tlb_read or similar.)

> Why doesn't CPUTLBEntry have information like ASID, shared
> (or global) bits?  How do we know if the TLB entry is a match
> for a particular process?

We don't store the ASID because it would be slow to do a check
on it when we got a TLB hit, and it would be too expensive to
have an entire separate TLB per-ASID. Instead we simply flush
the appropriate TLB when the ASID is changed. That means that
we can rely on a TLB hit being for the current context/process.

-- PMM

Reply via email to