Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements

Pranith Kumar Mon, 27 Mar 2017 20:04:35 -0700

Hi Richard,

Thanks for the feedback. Please find some comments inline.


On Mon, Mar 27, 2017 at 6:57 AM, Richard Henderson <r...@twiddle.net> wrote:
>
> 128MB is really quite large.  I doubt doubling the cache size will really
> help that much.  That said, it's really quite trivial to make this change,
> if you'd like to experiment.
>
> FWIW, I rarely see TB flushes for alpha -- not one during an entire gcc
> bootstrap.  Now, this is usually with 4GB ram, which by default implies
> 512MB translation cache.  But it does mean that, given an ideal guest, TB
> flushes should not dominate anything at all.
>
> If you're seeing multiple flushes during the startup of a browser, your
> guest must be flushing for other reasons than the code_gen_buffer being
> full.
>

This is indeed the case. From commit a9353fe897ca onwards, we are
flushing the tb cache instead of invalidating a single TB from
breakpoint_invalidate(). Now that MTTCG added proper tb/mmap locking,
we can revert that commit. I will do so once the merge windows opens.

>
>> * Implement an LRU translation block code cache.
>
>
> The major problem you'll encounter is how to manage allocation in this case.
>
> The current mechanism means that it is trivial to not know how much code is
> going to be generated for a given set of TCG opcodes.  When we reach the
> high-water mark, we've run out of room.  We then flush everything and start
> over at the beginning of the buffer.
>
> If you manage the cache with an allocator, you'll need to know in advance
> how much code is going to be generated.  This is going to require that you
> either (1) severely over-estimate the space required (qemu_ld generates lots
> more code than just add), (2) severely increase the time required, by
> generating code twice, or (3) somewhat increase the time required, by
> generating position-independent code into an external buffer and copying it
> into place after determining the size.
>

3 seems to the only feasible options, but I am not sure how easy it is
to generate position-independent code. Do you think it can be done as
a GSoC project?

>
>> * Avoid consistency overhead for strong memory model guests by generating
>>   load-acquire and store-release instructions.
>
>
> This is probably required for good performance of the user-only code path,
> but considering the number of other insns required for the system tlb
> lookup, I'm surprised that the memory barrier matters.
>

I know that having some experimental data will help to accurately show
the benefit here, but my observation from generating store-release
instruction instead of store+fence is that it helps make the system
more usable. I will try to collect this data for a linux x86 guest.

>
> I think it would be interesting to place TranslationBlock structures into
> the same memory block as code_gen_buffer, immediately before the code that
> implements the TB.
>
> Consider what happens within every TB:
>
> (1) We have one or more references to the TB address, via exit_tb.
>
> For aarch64, this will normally require 2-4 insns.
>
> # alpha-softmmu
> 0x7f75152114:  d0ffb320      adrp x0, #-0x99a000 (addr 0x7f747b8000)
> 0x7f75152118:  91004c00      add x0, x0, #0x13 (19)
> 0x7f7515211c:  17ffffc3      b #-0xf4 (addr 0x7f75152028)
>
> # alpha-linux-user
> 0x00569500:  d2800260      mov x0, #0x13
> 0x00569504:  f2b59820      movk x0, #0xacc1, lsl #16
> 0x00569508:  f2c00fe0      movk x0, #0x7f, lsl #32
> 0x0056950c:  17ffffdf      b #-0x84 (addr 0x569488)
>
> We would reduce this to one insn, always, if the TB were close by, since the
> ADR instruction has a range of 1MB.
>
> (2) We have zero to two references to a linked TB, via goto_tb.
>
> Your stated goal above for eliminating the code_gen_buffer maximum of 128MB
> can be done in two ways.
>
> (2A) Raise the maximum to 2GB.  For this we would align an instruction pair,
> adrp+add, to compute the address; the following insn would branch.  The
> update code would write a new destination by modifing the adrp+add with a
> single 64-bit store.
>
> (2B) Eliminate the maximum altogether by referencing the destination
> directly in the TB.  This is the !USE_DIRECT_JUMP path.  It is normally not
> used on 64-bit targets because computing the full 64-bit address of the TB
> is harder, or just as hard, as computing the full 64-bit address of the
> destination.
>
> However, if the TB is nearby, aarch64 can load the address from
> TB.jmp_target_addr in one insn, with LDR (literal).  This pc-relative load
> also has a 1MB range.
>
> This has the side benefit that it is much quicker to re-link TBs, both in
> the computation of the code for the destination as well as re-flushing the
> icache.

This(2B) is the idea I had in mind. If we could have a combination of
both the above. If address range falls outside the 1MB range, we take
the penalty and generate the full 64-bit address.

>
>
> In addition, I strongly suspect the 1,342,177 entries (153MB) that we
> currently allocate for tcg_ctx.tb_ctx.tbs, given a 512MB code_gen_buffer, is
> excessive.
>
> If we co-allocate the TB and the code, then we get exactly the right number
> of TBs allocated with no further effort.
>
> There will be some additional memory wastage, since we'll want to keep the
> code and the data in different cache lines and that means padding, but I
> don't think that'll be significant.  Indeed, given the above over-allocation
> will probably still be a net savings.
>

If you think the project makes sense, I will add it to the GSoC wiki
so that others can also apply for it. Please let me know if you are
interested in mentoring it along with Alex.

Thanks,
-- 
Pranith

Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements

Reply via email to