On Thu, 15 Feb 2024, Richard Henderson wrote:

> On 2/14/24 22:57, Alexander Monakov wrote:
> > 
> > On Wed, 14 Feb 2024, Richard Henderson wrote:
> > 
> >> v3: https://patchew.org/QEMU/20240206204809.9859-1-amona...@ispras.ru/
> >>
> >> Changes for v4:
> >>    - Keep separate >= 256 entry point, but only keep constant length
> >>      check inline.  This allows the indirect function call to be hidden
> >>      and optimized away when the pointer is constant.
> > 
> > Sorry, I don't understand this. Most of the improvement (at least in our
> > testing) comes from inlining the byte checks, which often fail and eliminate
> > call overhead entirely. Moving them out-of-line seems to lose most of the
> > speedup the patchset was bringing, doesn't it? Is there some concern I am
> > not seeing?
> 
> What is your benchmarking method?

Converting a 4.4 GiB Windows 10 image to qcow2. It was mentioned in v1 and v2,
are you saying they did not reach your inbox?
https://lore.kernel.org/qemu-devel/20231013155856.21475-1-mmroma...@ispras.ru/
https://lore.kernel.org/qemu-devel/20231027143704.7060-1-mmroma...@ispras.ru/

> It was my guess that most of the improvement came from performing those early
> byte checks *at all*, and that the overhead of a function call to a small out
> of line wrapper would be negligible.

qemu-img invokes buffer_is_zero in a fairly tight loop. Let us know if you
need numbers how much the out-of-line version loses.

> By not exposing the function pointer outside the bufferiszero translation
> unit, the compiler can see when the pointer is never modified for a given
> host, and then transform the indirect branch to a direct branch.

Okay, but that does not make it necessary to move byte checks out of line.
I was preparing a rebase that does not expose the function pointer to the
inline wrapper. I was completely unaware that you're taking over the patchset.

Alexander

Reply via email to