aarch64: implement tlb lookup fast path

Richard Henderson Fri, 31 May 2013 13:27:14 -0700

On 05/31/2013 11:07 AM, Jani Kokkonen wrote:
> +/* Load and compare a TLB entry, leaving the flags set.  Leaves X2 pointing
> +   to the tlb entry.  Clobbers X0,X1,X2,X3 and TMP.  */
> +
> +static void tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg,
> +                             int s_bits, uint8_t **label_ptr, int tlb_offset)
> +{


You copied the comment from ARM, and it isn't correct.  You generate branches.

> +    TCGReg base = TCG_AREG0;
> +
> +    tcg_out_shr(s, 1, TCG_REG_TMP, addr_reg, TARGET_PAGE_BITS);
> +    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X1, tlb_offset);
> +    tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, base, TCG_REG_X1, 0);
> +    tcg_out_andi(s, 1, TCG_REG_X0, TCG_REG_TMP, CPU_TLB_BITS, 0);
> +    tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, TCG_REG_X2,
> +                             TCG_REG_X0, -CPU_TLB_ENTRY_BITS);
> +#if TARGET_LONG_BITS == 64
> +    tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
> +#else
> +    tcg_out_ldst(s, LDST_32, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
> +#endif
> +    /* check alignment */
> +    if (s_bits) {
> +        tcg_out_tst(s, 1, addr_reg, s_bits, 0);
> +        label_ptr[0] = s->code_ptr;
> +        tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
> +    }
> +    tcg_out_cmp(s, 1, TCG_REG_X3, TCG_REG_TMP, -TARGET_PAGE_BITS);
> +    label_ptr[1] = s->code_ptr;
> +    tcg_out_goto_cond_noaddr(s, TCG_COND_NE);

I'm positive that the branch predictor would be happier with a single branch
rather than the two you generate here.  It ought to be possible to use a
different set of insns to do this in one go.

How about something like

        @ extract the tlb index from the address
        ubfm    w0, addr_reg, TARGET_PAGE_BITS, CPU_TLB_BITS

        @ add any "high bits" from the tlb offset
        @ noting that env will be much smaller than 24 bits.
        add     x1, env, tlb_offset & 0xfff000

        @ zap the tlb index from the address for compare
        @ this is all high bits plus 0-3 low bits set, so this
        @ should match a logical immediate.
        and     w/x2, addr_reg, TARGET_PAGE_MASK | ((1 << s_bits) - 1)

        @ merge the tlb index into the env+tlb_offset
        add     x1, x1, x0, lsl #3

        @ load the tlb comparator.  the 12-bit scaled offset
        @ form will fit the bits remaining from above, given that
        @ we're loading an aligned object, and so the low 2/3 bits
        @ will be clear.
        ldr     w/x0, [x1, tlb_offset & 0xfff]

        @ load the tlb addend.  do this early to avoid stalling.
        @ the addend_offset differs from tlb_offset by 1-3 words.
        @ given that we've got overlap between the scaled 12-bit
        @ value and the 12-bit shifted value above, this also ought
        @ to always be representable.
        ldr     x3, [x1, (tlb_offset & 0xfff) + (addend_offset - tlb_offset)]

        @ perform the comparison
        cmp     w/x0, w/x2

        @ generate the complete host address in parallel with the cmp.
        add     x3, x3, addr_reg                @ 64-bit guest
        add     x3, x3, addr_reg, uxtw          @ 32-bit guest

        bne     miss_label

Note that the w/x above indicates the ext setting that ought to be used,
depending on the address size of the guest.

This is at least 2 insns shorter than your sequence.

Have you looked at doing the out-of-line tlb miss sequence right from the
very beginning?  It's not that much more difficult to accomplish than the
inline tlb miss.

See CONFIG_QEMU_LDST_OPTIMIZATION, and the implementation in tcg/arm.
You won't need two nops after the call; aarch64 can do all the required
extensions and data movement operations in a single insn.


r~

Re: [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path

Reply via email to