Hi Jan,
On 05/06/2019 08:42, Jan Beulich wrote:
On 04.06.19 at 18:11, <julien.gr...@arm.com> wrote:
On 5/31/19 10:53 AM, Jan Beulich wrote:
According to Linux commit e75bef2a4f ("arm64: Select
ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the
variant using only bitwise operations on at least some hardware, and no
worse on other.
Suggested-by: Andrew Cooper <andrew.coop...@citrix.com>
Signed-off-by: Jan Beulich <jbeul...@suse.com>
---
RFC: To be honest I'm not fully convinced this is a win in particular in
the hweight32() case, as there's no actual shift insn which gets
replaced by the multiplication. Even for hweight64() the compiler
could emit better code and avoid the explicit shift by 32 (which it
emits at least for me).
I can see multiplication instruction used in both hweight32() and
hweight64() with the compiler I am using.
That is for which exact implementation?
A simple call hweight64().
What I was referring to as
"could emit better code" was the multiplication-free variant, where
the compiler fails to recognize (afaict) another opportunity to fold
a shift into an arithmetic instruction:
add x0, x0, x0, lsr #4
and x0, x0, #0xf0f0f0f0f0f0f0f
add x0, x0, x0, lsr #8
add x0, x0, x0, lsr #16
lsr x1, x0, #32
add w0, w1, w0
and w0, w0, #0xff
ret
Afaict the two marked insns could be replaced by
add x0, x0, x0, lsr #32
I am not a compiler expert. Anyway this likely depends on the version of the
compiler you are using. They are becoming smarter and smarter.
With there only a sequence of add-s remaining, I'm having
difficulty seeing how the use of mul+lsr would actually help:
add x0, x0, x0, lsr #4
and x0, x0, #0xf0f0f0f0f0f0f0f
mov x1, #0x101010101010101
mul x0, x0, x1
lsr x0, x0, #56
ret
But of course I know nothing about throughput and latency
of such add-s with one of their operands shifted first. And
yes, the variant using mul is, comparing with the better > optimized case,
still one insn smaller.
The commit message in Linux (and Robin's answer) is pretty clear. It may improve
on some core but does not make it worst on other.
I would expect the compiler could easily replace a multiply by a series
of shift but it would be more difficult to do the invert.
Also, this has been in Linux for a year now, so I am assuming Linux
folks are happy with changes (CCing Robin just in case I missed
anything). Therefore I am happy to give it a go on Xen as well.
In which case - can I take this as an ack, or do you want to first
pursue the discussion?
I will commit it later on with another bunch of patches.
Cheers,
--
Julien Grall
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel