BALATON Zoltan <bala...@eik.bme.hu> writes:

> On Tue, 29 Apr 2025, Alex Bennée wrote:
>> BALATON Zoltan <bala...@eik.bme.hu> writes:
>>> On Mon, 28 Apr 2025, Richard Henderson wrote:
>>>> On 4/28/25 06:26, BALATON Zoltan wrote:
>>>>> I have tried profiling the dst in real card vfio vram with dcbz
>>>>> case (with 100 iterations instead of 10000 in above tests) but I'm
>>>>> not sure I understand the results. vperm and dcbz show up but not
>>>>> too high. Can somebody explain what is happening here and where the
>>>>> overhead likely comes from? Here is the profile result I got:
>>>>> Samples: 104K of event 'cycles:Pu', Event count (approx.):
>>>>> 122371086557
>>>>>    Children      Self  Command          Shared Object            Symbol
>>>>> -   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.]
>>>>> cpu_exec_loop
>>>>>     - 98.49% cpu_exec_loop
>>>>>        - 98.48% cpu_tb_exec
>>>>>           - 90.95% 0x7f4e705d8f15
>>>>>                helper_ldub_mmu
>>>>>                do_ld_mmio_beN
>>>>>              - cpu_io_recompile
>>>>>                 - 45.79% cpu_loop_exit_noexc
>>>>
>>>> I think the real problem is the number of loop exits due to i/o.  If
>>>> I'm reading this rightly, 45% of execution is in cpu_io_recompile.
>>>>
>>>> I/O can only happen as the last insn of a translation block.
>>>
>>> I'm not sure I understand this. A comment above cpu_io_recompile says
>>> "In deterministic execution mode, instructions doing device I/Os must
>>> be at the end of the TB." Is that wrong? Otherwise shouldn't this only
>>> apply if running with icount or something like that?
>>
>> That comment should be fixed. It used to only be the case for icount
>> mode but there was another race bug that meant we need to honour device
>> access as the last insn for both modes.
>>
>>>
>>>> When we detect that it has happened in the middle of a translation
>>>> block, we abort the block, compile a new one, and restart execution.
>>>
>>> Where does that happen? The calls of cpu_io_recompile in this case
>>> seem to come from io_prepare which is called from do_ld16_mmio_beN if
>>> (!cpu->neg.can_do_io) but I don't see how can_do_io is set.
>>
>> Inline by set_can_do_io()
>
> That one I've found but don't know where the cpu_loop_exit returns
> from the end of cpu_io_recompile.

cpu_loop_exit longjmp's back to the top of the execution loop.

>
>>>> Where this becomes a bottleneck is when this same translation block
>>>> is in a loop.  Exactly this case of memset/memcpy of VRAM.  This
>>>> could be addressed by invalidating the previous translation block
>>>> and creating a new one which always ends with the i/o.
>>>
>>> And where to do that? cpu_io_recompile just exits the TB but what
>>> generates the new TB? I need some more clues to understands how to do
>>> this.
>>
>>  cpu->cflags_next_tb = curr_cflags(cpu) | CF_MEMI_ONLY | CF_NOIRQ | n;
>>
>> sets the cflags for the next cb, which typically will fail to find and
>> then regenerate. Normally cflags_next_tb is empty.
>
> Shouldn't this only regenerate the next TB on the first loop iteration
> and not afterwards?

if we've been here before (needing n insn from the base addr) we will
have a cached translation we can re-use. It doesn't stop the longer TB
being called again as we re-enter a loop.

>
> Regards,
> BALATON Zoltan

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Reply via email to