BALATON Zoltan <bala...@eik.bme.hu> writes:

> On Mon, 28 Apr 2025, Richard Henderson wrote:
>> On 4/28/25 06:26, BALATON Zoltan wrote:
>>> I have tried profiling the dst in real card vfio vram with dcbz
>>> case (with 100 iterations instead of 10000 in above tests) but I'm
>>> not sure I understand the results. vperm and dcbz show up but not
>>> too high. Can somebody explain what is happening here and where the
>>> overhead likely comes from? Here is the profile result I got:
>>> Samples: 104K of event 'cycles:Pu', Event count (approx.):
>>> 122371086557
>>>    Children      Self  Command          Shared Object            Symbol
>>> -   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.]
>>> cpu_exec_loop
>>>     - 98.49% cpu_exec_loop
>>>        - 98.48% cpu_tb_exec
>>>           - 90.95% 0x7f4e705d8f15
>>>                helper_ldub_mmu
>>>                do_ld_mmio_beN
>>>              - cpu_io_recompile
>>>                 - 45.79% cpu_loop_exit_noexc
>>
>> I think the real problem is the number of loop exits due to i/o.  If
>> I'm reading this rightly, 45% of execution is in cpu_io_recompile.
>>
>> I/O can only happen as the last insn of a translation block.
>
> I'm not sure I understand this. A comment above cpu_io_recompile says
> "In deterministic execution mode, instructions doing device I/Os must
> be at the end of the TB." Is that wrong? Otherwise shouldn't this only
> apply if running with icount or something like that?

That comment should be fixed. It used to only be the case for icount
mode but there was another race bug that meant we need to honour device
access as the last insn for both modes.

>
>> When we detect that it has happened in the middle of a translation
>> block, we abort the block, compile a new one, and restart execution.
>
> Where does that happen? The calls of cpu_io_recompile in this case
> seem to come from io_prepare which is called from do_ld16_mmio_beN if
> (!cpu->neg.can_do_io) but I don't see how can_do_io is set.

Inline by set_can_do_io()

>> Where this becomes a bottleneck is when this same translation block
>> is in a loop.  Exactly this case of memset/memcpy of VRAM.  This
>> could be addressed by invalidating the previous translation block
>> and creating a new one which always ends with the i/o.
>
> And where to do that? cpu_io_recompile just exits the TB but what
> generates the new TB? I need some more clues to understands how to do
> this.

  cpu->cflags_next_tb = curr_cflags(cpu) | CF_MEMI_ONLY | CF_NOIRQ | n;

sets the cflags for the next cb, which typically will fail to find and
then regenerate. Normally cflags_next_tb is empty.

>
> Regards,
> BALATON Zoltan

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Reply via email to