Hi LIuis

we can probably generate vector intrinsics using the tcg, e.g. add support
to tcg to emit vector instructions directly in code cache

why would a larger TLB make some operations slower, the TLB is a
direct-mapped hash and lookup should be O(1) there. In the cputlb, the 
CPU_TLB_SIZE  is always used to index into the TLB, i.e. (X & (CPU_TLB_SIZE
 -1)).

Thank you
Xin


On Wed, Nov 27, 2013 at 5:12 AM, Lluís Vilanova <vilan...@ac.upc.edu> wrote:

> Xin Tong writes:
>
> > I am trying to implement a out-of-line TLB lookup for QEMU
> softmmu-x86-64 on
> > x86-64 machine, potentially for better instruction cache performance, I
> have a
> > few questions.
>
> > 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are
> generated
> > when tcg_out_tb_finalize is called. And when a TLB lookup misses, it
> jumps to
> > the generated slow path and slow path refills the TLB, then load/store
> and jumps
> > to the next emulated instruction. I am wondering is it easy to outline
> the code
> > for the slow path. I am thinking when a TLB misses, the outlined TLB
> lookup code
> > should generate a call out to the qemu_ld/st_helpers[opc & ~MO_SIGN] and
> rewalk
> > the TLB after its refilled ? This code is off the critical path, so its
> not as
> > important as the code when TLB hits.
> > 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries.
> the TLB
> > lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
> > measured this using Intel PIN. so even the miss rate is low (say 3%) the
> overall
> > time spent in the cpu_x86_handle_mmu_fault is still signifcant. I am
> thinking
> > the tlb may need to be organized in a set associative fashion to reduce
> conflict
> > miss, e.g. 2 way set associative to reduce the miss rate. or have a
> victim tlb
> > that is 4 way associative and use x86 simd instructions to do the lookup
> once
> > the direct-mapped tlb misses. Has anybody done any work on this front ?
> > 3. what are some of the drawbacks of using a superlarge TLB, i.e. a TLB
> with 4K
> > entries ?
>
> Using vector intrinsics for the TLB lookup will probably make the code less
> portable. I don't know how compatible are the GCC and LLVM vectorizing
> intrinsics between each other (since there has been some efforts on making
> QEMU
> also compile with LLVM).
>
> A larger TLB will make some operations slower (e.g., look for CPU_TLB_SIZE
> in
> cputlb.c), but the higher hit ratio could pay off, although I don't know
> how the
> current size was chosen.
>
>
> Lluis
>
> --
>  "And it's much the same thing with knowledge, for whenever you learn
>  something new, the whole world becomes that much richer."
>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>  Tollbooth
>
>

Reply via email to