On 2/15/24 11:36, Alexander Monakov wrote:
On Thu, 15 Feb 2024, Richard Henderson wrote:
On 2/14/24 22:57, Alexander Monakov wrote:
On Wed, 14 Feb 2024, Richard Henderson wrote:
v3: https://patchew.org/QEMU/20240206204809.9859-1-amona...@ispras.ru/
Changes for v4:
- Keep separate >= 256 entry point, but only keep constant length
check inline. This allows the indirect function call to be hidden
and optimized away when the pointer is constant.
Sorry, I don't understand this. Most of the improvement (at least in our
testing) comes from inlining the byte checks, which often fail and eliminate
call overhead entirely. Moving them out-of-line seems to lose most of the
speedup the patchset was bringing, doesn't it? Is there some concern I am
not seeing?
What is your benchmarking method?
Converting a 4.4 GiB Windows 10 image to qcow2. It was mentioned in v1 and v2,
are you saying they did not reach your inbox?
https://lore.kernel.org/qemu-devel/20231013155856.21475-1-mmroma...@ispras.ru/
https://lore.kernel.org/qemu-devel/20231027143704.7060-1-mmroma...@ispras.ru/
I'm saying that this is not a reproducible description of methodology.
With master, so with neither of our changes:
I tried converting an 80G win7 image that I happened to have lying about, I see
buffer_zero_avx2 with only 3.03% perf overhead. Then I tried truncating the image to 16G
to see if having the entire image in ram would help -- not yet, still only 3.4% perf
overhead. Finally, I truncated the image to 4G and saw 2.9% overhead.
So... help be out here. I would like to be able to see results that are at least vaguely
similar.
r~