BALATON Zoltan <bala...@eik.bme.hu> writes:

> On Wed, 30 Apr 2025, Nicholas Piggin wrote:
>> On Wed Apr 30, 2025 at 7:09 AM AEST, BALATON Zoltan wrote:
>>> On Tue, 29 Apr 2025, Alex Bennée wrote:
>>>> BALATON Zoltan <bala...@eik.bme.hu> writes:
>>>>> On Tue, 29 Apr 2025, Alex Bennée wrote:
>>>>>> BALATON Zoltan <bala...@eik.bme.hu> writes:
>>>>>>> On Mon, 28 Apr 2025, Richard Henderson wrote:
>>>>>>>> On 4/28/25 06:26, BALATON Zoltan wrote:
<snip>
>>>>
>>>> if we've been here before (needing n insn from the base addr) we will
>>>> have a cached translation we can re-use. It doesn't stop the longer TB
>>>> being called again as we re-enter a loop.
>>>
>>> So then maybe it should at least check if there's already a cached TB
>>> where it can continue before calling cpu_io_recompile in io_prepare and
>>> only recompile if needed?
>>
>> It basically does do that AFAIKS. cpu_io_recompile() name is misleading
>> it does not cause a recompile, it just updates cflags and exits. Next
>> entry will look up TB that has just 1 insn and enter that.
>
> After reading it I came to the same conclusion but then I don't
> understand what causes the problem. Is it just that it will exit the
> loop for every IO to look up the recompiled TB? It looks like it tries
> to chain TBs, why does that not work here?

Any MMIO access has to come via the slow path. Any MMIO also currently
has to be the last instruction in a block in case the operation triggers
a change in the translation regime that needs to be picked up by the
next instruction you execute.

This is a pathological case when modelling VRAM on a device because its
going to be slow either way. At least if you model the multiple byte
access with a helper you can amortise some of the cost of the MMU lookup
with a single probe_() call. 

>>> I was thinking maybe we need a flag or counter
>>> to see if cpu_io_recompile is called more than once and after a limit
>>> invalidate the TB and create two new ones the first ending at the I/O and
>>> then what cpu_io_recompile does now which as I understood was what Richard
>>> suggested but I don't know how to do that.
>>
>> memset/cpy routines had kind of the same problem with real hardware.
>> They wanted to use vector instructions for best performance, but when
>> those are used on MMIO they would trap and be very slow.
>
> Why do those trap on MMIO on real machine? These routines were tested
> on real machines and the reasoning to use the widest possible access
> was that PCI transfer has overhead and that is minimised by
> transferring more bits in one op. I think they also verifed that it
> works at least for the 32 bit CPUs up to G4 that were used on real
> AmigaNG machines. There are some benchmark results here:
> https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS?start=60 which
> is also where the benchmark I used comes from so this should be
> similar. I think the MemCopy on that page has plain unoptimised copy
> as Copy to/from VRAM and optimised routines similar to this benchmark
> as Read/Write Pixel Array, but it's not easy to search. Some of the
> machines like Pegasos II and AmigaOne XE were made with both G3 or G4
> CPUs so if I find a result from those with same graphics card that
> could show if AltiVec is faster (although the G4s were also higher
> clock so not directly comparable). Some results there are also from
> QEMU, mostly those that are with SiliconMotion 502 but that does not
> have this problem only vfio-pci pass through.

They don't - what we need is to have a RAM-like-device model for QEMU
where we can relax the translation rules because we know we are writing
to RAM like things that don't have registers or other state changing
behaviour.

The poor behaviour is because QEMU currently treats all MMIO as
potentially system state altering where as for VRAM it doesn't need to.

> So maybe it's something
> with how vfio-pci maps PCI memory BARs?

I don't know about vfio-pci but blob resources mapped via virtio-gpu
just appear as chunks of RAM to the guest - hence no trapping.

>
>> Problem is we don't know ahead of time if some routine will access
>> MMIO or not. You could recompile it with fewer instructions but then
>> it will be slow when used for regular memory.
>>
>> Heuristics are tough because you could have e.g., one initial big
>> memset to clear a MMIO region that iterates many times over inner
>> loop of dcbz instructions, but then is never used again for MMIO but
>> important for regular page clearing. Making something that dynamically
>> decays or periodically would recompile to non-IO case perhaps, but
>> then complexity goes up.

We can't have heuristics when we must prioritise correctness. However we
could expand the device model to make the exact behaviour of different
devices clear and optimise when we know it is safe. 

>> I would prefer not like to do that just for a microbenchmark, but if
>> you think it is reasonable overall win for average workloads of your
>> users then perhaps.
>
> I'm still trying to understand what to optimise. So far it looks like
> that dcbz has the least impact, then vperm a bit bigger but still only
> about a few percent and the biggest impact is still not known for sure
> but we see faster access on real machines that run on slower PCIe
> (only 4x at best) while CPU benchmarks don't show slower performance
> on QEMU only accessing passed through card's VRAM is slower than
> expected. But if there's a trap involved I've found before that
> exceptions are slower with QEMU but I did not see evidence of that in
> the profile.
>
> Regards,
> BALATON Zoltan

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Reply via email to