Thanks Peter.

How QEMU deals with different page size?   Does a 2GB hugepage has a single 
corresponding TLB entry? Or it is partitioned to 512 4K pages and has 512 TLB 
entries?

does a CPUTLBDescFast always hold TLB entries for the same single process? Is 
it always flushed/restored on context switch?

Is MMU-IDX for different translation regimes or exception level?

How about ITLB? It looks that QEMU has a mixed TLB implementation since the ELB 
entries have read/write/execute flags. Am I correct?


I am exploring to reconstruct a guest TLB (i.e. guest VA --> guest PA) for the 
running process (i.e. I can live with a TLB just for the running process). I 
found that exelog.c calls qemu_plugin_get_hwaddr to get  the guest PA. A quick 
eye-balling the function seems suggests it  populates data->v.ram.hostaddr with 
host VA (see line 1699 below). Am I correct?

What is the correct way to construct the guest TLB for the running process 
based on QEMU data structure at runtime?

1681 bool tlb_plugin_lookup(CPUState *cpu, target_ulong addr, int mmu_idx,
1682                        bool is_store, struct qemu_plugin_hwaddr *data)
1683 {
1684     CPUArchState *env = cpu->env_ptr;
1685     CPUTLBEntry *tlbe = tlb_entry(env, mmu_idx, addr);
1686     uintptr_t index = tlb_index(env, mmu_idx, addr);
1687     target_ulong tlb_addr = is_store ? tlb_addr_write(tlbe) : 
tlbe->addr_read;
1688
1689     if (likely(tlb_hit(tlb_addr, addr))) {
1690         /* We must have an iotlb entry for MMIO */
1691         if (tlb_addr & TLB_MMIO) {
1692             CPUIOTLBEntry *iotlbentry;
1693             iotlbentry = &env_tlb(env)->d[mmu_idx].iotlb[index];
1694             data->is_io = true;
1695             data->v.io.section = iotlb_to_section(cpu, iotlbentry->addr, 
iotlbentry->attrs);
1696             data->v.io.offset = (iotlbentry->addr & TARGET_PAGE_MASK) + 
addr;
1697         } else {
1698             data->is_io = false;
1699             data->v.ram.hostaddr = (void *)((uintptr_t)addr + 
tlbe->addend);
1700         }
1701         return true;
1702     } else {
1703         SavedIOTLB *saved = &cpu->saved_iotlb;
1704         data->is_io = true;
1705         data->v.io.section = saved->section;
1706         data->v.io.offset = saved->mr_offset;
1707         return true;
1708     }
1709 }

Thanks a bunch!

Regards
________________________________
From: Peter Maydell <peter.mayd...@linaro.org>
Sent: October 6, 2022 10:50 AM
To: a b <blue_3...@hotmail.com>
Cc: qemu-devel@nongnu.org <qemu-devel@nongnu.org>
Subject: Re: A few QEMU questiosn

On Thu, 6 Oct 2022 at 08:34, a b <blue_3...@hotmail.com> wrote:
>
> Thanks a lot Peter for the clarification. It is very helpful.
>
> My naive understanding is that each MMU has only 1 TLB, why do we need an 
> array of CPUTLBDescFast structures? How are these different CPUTLBDescFast 
> data structures correlate with a hardware TLB?
>
> 220 typedef struct CPUTLB {
> 221     CPUTLBCommon c;
> 222     CPUTLBDesc d[NB_MMU_MODES];
> 223     CPUTLBDescFast f[NB_MMU_MODES];
> 224 } CPUTLB;

QEMU's "TLB" doesn't really correlate with a hardware TLB
except in that they're serving vaguely similar purposes.
A hardware TLB is a h/w structure which accelerates the lookup
  virtual-address => (physical-address, permissions)
QEMU's TLB is a software data structure which accelerates
the lookup
  virtual-address => (host virtual address or device MemoryRegion structure)

It's not an emulation of the "real" CPU TLB. (Note that this
means that you can't use QEMU to look at performance behaviour
around whether guest code is hitting or missing in the TLB,
and that the size of QEMU's TLB is unrelated to the size of a
TLB on the real CPU.)

Further, the set of things that can be done fast in hardware
differs from the set of things that can be done fast in
software. In hardware, a TLB is a "content-addressable
memory" that essentially checks every entry in parallel to
find the match in fixed time. In this kind of hardware it's
easy to add checks like "and it should match the right ASID"
or "and it must be an entry for EL2" without it making the
lookup slower. In software, you can't do that kind of parallel
lookup, so we must use a different structure. Instead of
having one TLB that can store entries for multiple contexts
at once and where we check that the context is correct when
we look up the address, we have effectively a separate TLB
for each context, so we can look up the address in an O(1)
data structure that has exactly one entry for the address,
and know that if it is present it is the correct entry.

The aim of the QEMU TLB design is to make the "fast path"
lookup of guest virtual address to host virtual address for
RAM accesses as fast as possible (it is a handful of
instructions directly generated as part of the JIT output);
the slow path for faults, hardware accesses, etc, is handled
in C code and is less performance critical.

> Why do we want to store a shifted (n_entries-1) in mask?
> 184 typedef struct CPUTLBDescFast {
> 185     /* Contains (n_entries - 1) << CPU_TLB_ENTRY_BITS */
> 186     uintptr_t mask;
> 187     /* The array of tlb entries itself. */
> 188     CPUTLBEntry *table;
> 189 } CPUTLBDescFast QEMU_ALIGNED(2 * sizeof(void *));

The mask field is a pre-calculated value that is going to
be used as part of the "given a virtual address, find the
table entry" lookup. Because the number of entries in the table
varies, the part of the address we need to use as the index
also varies. We pre-calculate the mask in a convenient format
for the generated JIT code because if we stored just n_entries
here it would cost us an extra instruction or two in the fast path.
(To understand these data structures you probably want to also
be looking at the code that generates the lookup code, which
you can find under tcg/, usually in a function named
tcg_out_tlb_load or tcg_out_tlb_read or similar.)

> Why doesn't CPUTLBEntry have information like ASID, shared
> (or global) bits?  How do we know if the TLB entry is a match
> for a particular process?

We don't store the ASID because it would be slow to do a check
on it when we got a TLB hit, and it would be too expensive to
have an entire separate TLB per-ASID. Instead we simply flush
the appropriate TLB when the ASID is changed. That means that
we can rely on a TLB hit being for the current context/process.

-- PMM

Reply via email to