Hi I have found that adding a small (8-entry) fully associative victim TLB (http://en.wikipedia.org/wiki/Victim_Cache) before the refill path (page table walking) improves the performance of QEMU x86_64 system emulation mode significantly on the specint2006 benchmarks. This is primarily due to the fact that the primary TLB is directly mapped and suffer from conflict misses. I have this implemented on QEMU trunk and would like to contribute this back to QEMU. Where should i start ?
Xin On Tue, Dec 17, 2013 at 8:22 PM, Xin Tong <trent.t...@gmail.com> wrote: > why is QEMU TLB organized based on the modes, e.g. on x86 there are 3 > modes. what i think is that there may be conflicts between virtual > addresses and physical addresses. organizing it by modes guarantees > that QEMU does not hit a physical address translation entry when in > user mode and vice versa ? > > Thank you, > Xin > > On Tue, Dec 17, 2013 at 10:52 PM, Xin Tong <trent.t...@gmail.com> wrote: >> On Sun, Dec 8, 2013 at 2:54 AM, Xin Tong <trent.t...@gmail.com> wrote: >>> >>> >>> >>> On Thu, Nov 28, 2013 at 8:12 AM, LluĂs Vilanova <vilan...@ac.upc.edu> wrote: >>>> >>>> Xin Tong writes: >>>> >>>> > Hi LIuis >>>> > we can probably generate vector intrinsics using the tcg, e.g. add >>>> > support to >>>> > tcg to emit vector instructions directly in code cache >>>> >>>> There was some discussion long ago about adding vector instructions to >>>> TCG, but >>>> I don't remember what was the conclusion. >>>> >>>> Also remember that using vector instructions will "emulate" a >>>> low-associativity >>>> TLB; don't know how much better than a 1-way TLB will that be, though. >>>> >>>> >>>> > why would a larger TLB make some operations slower, the TLB is a >>>> > direct-mapped >>>> > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is >>>> > always >>>> > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)). >>>> >>>> It would make TLB invalidations slower (e.g., see 'tlb_flush' in >>>> "cputlb.c"). And right now QEMU performs full TLB invalidations more >>>> frequently >>>> than the equivalent HW needs to, although I suppose that should be >>>> quantified >>>> too. >> >> I see QEMU executed ~1M instructions per context switch for >> qemu-system-x86_64. Is this because of the fact that the periodical >> time interval interrupt is delivered in real time while QEMU is >> significantly slower than real hw ? >> >> Xin >> >>>> >>> you are right LIuis. QEMU does context switch quite more often that real hw, >>> this is probably primarily due to the fact that QEMU is magnitude slower >>> than real hw. I am wondering where timer is emulated in QEMU system-x86_64. >>> I imagine the guest OS must program the timers to do interrupt for context >>> switches. >>> >>> Another question, what happens when a vcpu is stuck in an infinite loop ? >>> QEMU must need an timer interrupt somewhere as well ? >>> >>> Is my understanding correct ? >>> >>> Xin >>>> >>>> >>>> Lluis >>>> >>>> -- >>>> "And it's much the same thing with knowledge, for whenever you learn >>>> something new, the whole world becomes that much richer." >>>> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom >>>> Tollbooth >>> >>>