On Thu, 6 Oct 2022 at 08:34, a b <blue_3...@hotmail.com> wrote: > > Thanks a lot Peter for the clarification. It is very helpful. > > My naive understanding is that each MMU has only 1 TLB, why do we need an > array of CPUTLBDescFast structures? How are these different CPUTLBDescFast > data structures correlate with a hardware TLB? > > 220 typedef struct CPUTLB { > 221 CPUTLBCommon c; > 222 CPUTLBDesc d[NB_MMU_MODES]; > 223 CPUTLBDescFast f[NB_MMU_MODES]; > 224 } CPUTLB;
QEMU's "TLB" doesn't really correlate with a hardware TLB except in that they're serving vaguely similar purposes. A hardware TLB is a h/w structure which accelerates the lookup virtual-address => (physical-address, permissions) QEMU's TLB is a software data structure which accelerates the lookup virtual-address => (host virtual address or device MemoryRegion structure) It's not an emulation of the "real" CPU TLB. (Note that this means that you can't use QEMU to look at performance behaviour around whether guest code is hitting or missing in the TLB, and that the size of QEMU's TLB is unrelated to the size of a TLB on the real CPU.) Further, the set of things that can be done fast in hardware differs from the set of things that can be done fast in software. In hardware, a TLB is a "content-addressable memory" that essentially checks every entry in parallel to find the match in fixed time. In this kind of hardware it's easy to add checks like "and it should match the right ASID" or "and it must be an entry for EL2" without it making the lookup slower. In software, you can't do that kind of parallel lookup, so we must use a different structure. Instead of having one TLB that can store entries for multiple contexts at once and where we check that the context is correct when we look up the address, we have effectively a separate TLB for each context, so we can look up the address in an O(1) data structure that has exactly one entry for the address, and know that if it is present it is the correct entry. The aim of the QEMU TLB design is to make the "fast path" lookup of guest virtual address to host virtual address for RAM accesses as fast as possible (it is a handful of instructions directly generated as part of the JIT output); the slow path for faults, hardware accesses, etc, is handled in C code and is less performance critical. > Why do we want to store a shifted (n_entries-1) in mask? > 184 typedef struct CPUTLBDescFast { > 185 /* Contains (n_entries - 1) << CPU_TLB_ENTRY_BITS */ > 186 uintptr_t mask; > 187 /* The array of tlb entries itself. */ > 188 CPUTLBEntry *table; > 189 } CPUTLBDescFast QEMU_ALIGNED(2 * sizeof(void *)); The mask field is a pre-calculated value that is going to be used as part of the "given a virtual address, find the table entry" lookup. Because the number of entries in the table varies, the part of the address we need to use as the index also varies. We pre-calculate the mask in a convenient format for the generated JIT code because if we stored just n_entries here it would cost us an extra instruction or two in the fast path. (To understand these data structures you probably want to also be looking at the code that generates the lookup code, which you can find under tcg/, usually in a function named tcg_out_tlb_load or tcg_out_tlb_read or similar.) > Why doesn't CPUTLBEntry have information like ASID, shared > (or global) bits? How do we know if the TLB entry is a match > for a particular process? We don't store the ASID because it would be slow to do a check on it when we got a TLB hit, and it would be too expensive to have an entire separate TLB per-ASID. Instead we simply flush the appropriate TLB when the ASID is changed. That means that we can rely on a TLB hit being for the current context/process. -- PMM