On Fri, May 08, 2015 at 14:51:58 -0700, Richard Henderson wrote: > Ouch. 24 of 64 wasted bytes for 64-bit? > > I wonder if there's a better way we can encode this to avoid 3 copies of the > virtual address for read/write/code. Or if we're better off using more than > one insn to multiply by a non-power-of-two.
Adding one more instruction works well. Perf tests (# time ./selftest.sh for [1]) show no appreciable difference. Patch (i386-only) appended. [1] http://wiki.qemu.org/download/ppc-virtexml507-linux-2_6_34.tgz > Or if the hardware multiplier is > fast enough just multiply by the proper constant. This option should be slower (3-cycle latency for IMUL on Ivy Bridge/Haswell, probably slower on others), but would make the code very simple. Unfortunately I couldn't write a patch to do it due to my poor grasp of TCG backend code. If someone provides a patch I'd be glad to test it. If the appended ends up being the preferred option I can extend it to support all the TCG targets. I cannot however test all of them--only have access to x86 and ppc hardware at the moment. Thanks, Emilio [PATCH] i386-only: remove sizeof(CPUTLBEntry)=pow2 constraint Breaks all non-i386 TCG backends! Do not apply. Signed-off-by: Emilio G. Cota <c...@braap.org> --- include/exec/cpu-defs.h | 23 +++++++++++++++++------ tcg/i386/tcg-target.c | 12 +++++++----- 2 files changed, 24 insertions(+), 11 deletions(-) diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h index 8891f16..716052a 100644 --- a/include/exec/cpu-defs.h +++ b/include/exec/cpu-defs.h @@ -96,14 +96,25 @@ typedef struct CPUTLBEntry { /* Addend to virtual address to get host address. IO accesses use the corresponding iotlb value. */ uintptr_t addend; - /* padding to get a power of two size */ - uint8_t dummy[(1 << CPU_TLB_ENTRY_BITS) - - (sizeof(target_ulong) * 4 + - ((-sizeof(target_ulong) * 4) & (sizeof(uintptr_t) - 1)) + - sizeof(uintptr_t))]; } CPUTLBEntry; -QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != (1 << CPU_TLB_ENTRY_BITS)); +/* + * Fast TLB hits are essential for softmmu performance. Since we hand-code in + * assembly the TCG check for TLB hits, we define here a pair of constants that + * allow us to use only shifts to obtain a TLBEntry address given its index. + * NOTE: The constants below should thus be updated every time changes are + * made to the CPUTLBEntry struct. See the compile-time consistency check below. + */ +#if TARGET_LONG_BITS == 64 +#define CPU_TLB_ENTRY_OUT_SHIFT 3 +#define CPU_TLB_ENTRY_IN_SHIFT 2 +#else +#define CPU_TLB_ENTRY_OUT_SHIFT (HOST_LONG_BITS == 32 ? 2 : 3) +#define CPU_TLB_ENTRY_IN_SHIFT (HOST_LONG_BITS == 32 ? 2 : 1) +#endif + +QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != \ + (1 + BIT(CPU_TLB_ENTRY_IN_SHIFT)) << CPU_TLB_ENTRY_OUT_SHIFT); /* The IOTLB is not accessed directly inline by generated TCG code, * so the CPUIOTLBEntry layout is not as critical as that of the diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c index ab63823..54250f5 100644 --- a/tcg/i386/tcg-target.c +++ b/tcg/i386/tcg-target.c @@ -1195,15 +1195,17 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi, tcg_out_mov(s, htype, r0, addrlo); tcg_out_mov(s, ttype, r1, addrlo); - tcg_out_shifti(s, SHIFT_SHR + hrexw, r0, - TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS); + tcg_out_shifti(s, SHIFT_SHR + hrexw, r0, TARGET_PAGE_BITS); tgen_arithi(s, ARITH_AND + trexw, r1, TARGET_PAGE_MASK | ((1 << s_bits) - 1), 0); - tgen_arithi(s, ARITH_AND + hrexw, r0, - (CPU_TLB_SIZE - 1) << CPU_TLB_ENTRY_BITS, 0); + tgen_arithi(s, ARITH_AND + hrexw, r0, CPU_TLB_SIZE - 1, 0); - tcg_out_modrm_sib_offset(s, OPC_LEA + hrexw, r0, TCG_AREG0, r0, 0, + tcg_out_modrm_sib_offset(s, OPC_LEA + hrexw, r0, r0, r0, + CPU_TLB_ENTRY_IN_SHIFT, 0); + + tcg_out_modrm_sib_offset(s, OPC_LEA + hrexw, r0, TCG_AREG0, r0, + CPU_TLB_ENTRY_OUT_SHIFT, offsetof(CPUArchState, tlb_table[mem_index][0]) + which); -- 1.8.3