On 05/31/2013 11:07 AM, Jani Kokkonen wrote: > +/* Load and compare a TLB entry, leaving the flags set. Leaves X2 pointing > + to the tlb entry. Clobbers X0,X1,X2,X3 and TMP. */ > + > +static void tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg, > + int s_bits, uint8_t **label_ptr, int tlb_offset) > +{
You copied the comment from ARM, and it isn't correct. You generate branches. > + TCGReg base = TCG_AREG0; > + > + tcg_out_shr(s, 1, TCG_REG_TMP, addr_reg, TARGET_PAGE_BITS); > + tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X1, tlb_offset); > + tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, base, TCG_REG_X1, 0); > + tcg_out_andi(s, 1, TCG_REG_X0, TCG_REG_TMP, CPU_TLB_BITS, 0); > + tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, TCG_REG_X2, > + TCG_REG_X0, -CPU_TLB_ENTRY_BITS); > +#if TARGET_LONG_BITS == 64 > + tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0); > +#else > + tcg_out_ldst(s, LDST_32, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0); > +#endif > + /* check alignment */ > + if (s_bits) { > + tcg_out_tst(s, 1, addr_reg, s_bits, 0); > + label_ptr[0] = s->code_ptr; > + tcg_out_goto_cond_noaddr(s, TCG_COND_NE); > + } > + tcg_out_cmp(s, 1, TCG_REG_X3, TCG_REG_TMP, -TARGET_PAGE_BITS); > + label_ptr[1] = s->code_ptr; > + tcg_out_goto_cond_noaddr(s, TCG_COND_NE); I'm positive that the branch predictor would be happier with a single branch rather than the two you generate here. It ought to be possible to use a different set of insns to do this in one go. How about something like @ extract the tlb index from the address ubfm w0, addr_reg, TARGET_PAGE_BITS, CPU_TLB_BITS @ add any "high bits" from the tlb offset @ noting that env will be much smaller than 24 bits. add x1, env, tlb_offset & 0xfff000 @ zap the tlb index from the address for compare @ this is all high bits plus 0-3 low bits set, so this @ should match a logical immediate. and w/x2, addr_reg, TARGET_PAGE_MASK | ((1 << s_bits) - 1) @ merge the tlb index into the env+tlb_offset add x1, x1, x0, lsl #3 @ load the tlb comparator. the 12-bit scaled offset @ form will fit the bits remaining from above, given that @ we're loading an aligned object, and so the low 2/3 bits @ will be clear. ldr w/x0, [x1, tlb_offset & 0xfff] @ load the tlb addend. do this early to avoid stalling. @ the addend_offset differs from tlb_offset by 1-3 words. @ given that we've got overlap between the scaled 12-bit @ value and the 12-bit shifted value above, this also ought @ to always be representable. ldr x3, [x1, (tlb_offset & 0xfff) + (addend_offset - tlb_offset)] @ perform the comparison cmp w/x0, w/x2 @ generate the complete host address in parallel with the cmp. add x3, x3, addr_reg @ 64-bit guest add x3, x3, addr_reg, uxtw @ 32-bit guest bne miss_label Note that the w/x above indicates the ext setting that ought to be used, depending on the address size of the guest. This is at least 2 insns shorter than your sequence. Have you looked at doing the out-of-line tlb miss sequence right from the very beginning? It's not that much more difficult to accomplish than the inline tlb miss. See CONFIG_QEMU_LDST_OPTIMIZATION, and the implementation in tcg/arm. You won't need two nops after the call; aarch64 can do all the required extensions and data movement operations in a single insn. r~