Re: [Xen-devel] [PATCH RFC 3/4] Arm64: further speed-up to hweight{32, 64}()

Julien Grall Wed, 05 Jun 2019 02:25:56 -0700

Hi Jan,

On 05/06/2019 08:42, Jan Beulich wrote:

On 04.06.19 at 18:11, <julien.gr...@arm.com> wrote:

On 5/31/19 10:53 AM, Jan Beulich wrote:

According to Linux commit e75bef2a4f ("arm64: Select
ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the
variant using only bitwise operations on at least some hardware, and no
worse on other.


Suggested-by: Andrew Cooper <andrew.coop...@citrix.com>
Signed-off-by: Jan Beulich <jbeul...@suse.com>
---
RFC: To be honest I'm not fully convinced this is a win in particular in
       the hweight32() case, as there's no actual shift insn which gets
       replaced by the multiplication. Even for hweight64() the compiler
       could emit better code and avoid the explicit shift by 32 (which it
       emits at least for me).


I can see multiplication instruction used in both hweight32() and
hweight64() with the compiler I am using.


That is for which exact implementation?


A simple call hweight64().

What I was referring to as
"could emit better code" was the multiplication-free variant, where
the compiler fails to recognize (afaict) another opportunity to fold
a shift into an arithmetic instruction:

        add     x0, x0, x0, lsr #4
        and     x0, x0, #0xf0f0f0f0f0f0f0f
        add     x0, x0, x0, lsr #8
        add     x0, x0, x0, lsr #16

        lsr     x1, x0, #32
        add     w0, w1, w0

        and     w0, w0, #0xff
        ret

Afaict the two marked insns could be replaced by

        add     x0, x0, x0, lsr #32

I am not a compiler expert. Anyway this likely depends on the version of thecompiler you are using. They are becoming smarter and smarter.


With there only a sequence of add-s remaining, I'm having
difficulty seeing how the use of mul+lsr would actually help:

        add     x0, x0, x0, lsr #4
        and     x0, x0, #0xf0f0f0f0f0f0f0f
        mov     x1, #0x101010101010101
        mul     x0, x0, x1
        lsr     x0, x0, #56
        ret

But of course I know nothing about throughput and latency
of such add-s with one of their operands shifted first. And
yes, the variant using mul is, comparing with the better > optimized case, 
still one insn smaller.

The commit message in Linux (and Robin's answer) is pretty clear. It may improveon some core but does not make it worst on other.

I would expect the compiler could easily replace a multiply by a series
of shift but it would be more difficult to do the invert.

Also, this has been in Linux for a year now, so I am assuming Linux
folks are happy with changes (CCing Robin just in case I missed
anything). Therefore I am happy to give it a go on Xen as well.


In which case - can I take this as an ack, or do you want to first
pursue the discussion?


I will commit it later on with another bunch of patches.

Cheers,

--
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC 3/4] Arm64: further speed-up to hweight{32, 64}()

Reply via email to