On Sun, Oct 8, 2017 at 12:15 PM, Jochen Rollwagen <joro-2...@t-online.de> wrote: > Am 06.10.2017 um 11:57 schrieb Nicolai Hähnle: >> >> On 05.10.2017 20:59, Jochen Rollwagen wrote: >>> >>> Am 04.10.2017 um 05:59 schrieb Matt Turner: >>>> >>>> On Tue, Oct 3, 2017 at 11:01 AM, Jochen Rollwagen >>>> <joro-2...@t-online.de> wrote: >>>>> >>>>> From 4cebe50a9bade6717923e104c954f3fad75f71bb Mon Sep 17 00:00:00 2001 >>>>> From: Jochen Rollwagen <joro-2...@t-online.de> >>>>> Date: Tue, 3 Oct 2017 19:54:10 +0200 >>>>> Subject: [PATCH] Replace byte-swapping code with builtins in pack.c >>>>> >>>>> This patch replaces some code for byte-swapping in pack.c with the >>>>> builtin >>>>> functions allowing the compiler to do its optimization magic >>>>> --- >>>>> src/mesa/main/pack.c | 22 ++-------------------- >>>>> 1 file changed, 2 insertions(+), 20 deletions(-) >>>>> >>>>> diff --git a/src/mesa/main/pack.c b/src/mesa/main/pack.c >>>>> index 94a6d28..9bfde39 100644 >>>>> --- a/src/mesa/main/pack.c >>>>> +++ b/src/mesa/main/pack.c >>>>> @@ -230,26 +230,8 @@ _mesa_pack_bitmap( GLint width, GLint height, >>>>> const >>>>> GLubyte >>>>> *source, >>>>> } >>>>> } >>>>> >>>>> - >>>>> -#define SWAP2BYTE(VALUE) \ >>>>> - { \ >>>>> - GLubyte *bytes = (GLubyte *) &(VALUE); \ >>>>> - GLubyte tmp = bytes[0]; \ >>>>> - bytes[0] = bytes[1]; \ >>>>> - bytes[1] = tmp; \ >>>>> - } >>>>> - >>>>> -#define SWAP4BYTE(VALUE) \ >>>>> - { \ >>>>> - GLubyte *bytes = (GLubyte *) &(VALUE); \ >>>>> - GLubyte tmp = bytes[0]; \ >>>>> - bytes[0] = bytes[3]; \ >>>>> - bytes[3] = tmp; \ >>>>> - tmp = bytes[1]; \ >>>>> - bytes[1] = bytes[2]; \ >>>>> - bytes[2] = tmp; \ >>>>> - } >>>>> - >>>>> +#define SWAP2BYTE(VALUE) __builtin_bswap16(VALUE) >>>>> +#define SWAP4BYTE(VALUE) __builtin_bswap32(VALUE) >>>> >>>> In my experience it's much simpler to just write these as >>>> >>>> return ((x & 0xff) << 8) | ((x >> 8) & 0xff); >>>> >>>> and >>>> >>>> return ((x & 0xff) << 24) | >>>> ((x & 0xff00) << 8) | >>>> ((x & 0xff0000) >> 8) | >>>> ((x >> 24) & 0xff); >>>> >>>> and not have to deal with compiler intrinsics. Compilers will >>>> recognize these patterns and use the appropriate instructions (rol for >>>> 2-bytes and bswap for 4-bytes). >>>> >>>> You should be able to count the numbers of those instructions before >>>> and after such a patch to confirm it's working as expected. >>> >>> Fair enough. I've attached a new patch that optimizes the code on linux >>> systems only where there is byteswap.h containing (hopefully optimal) >>> functions. The other systems may be dealt with by a followup patch if >>> desired. >> >> >> This doesn't really address Matt's point that compilers are good at >> pattern matching byte swaps already. >> >> Unless you can actually show with comparisons of the assembly and/or >> benchmarks that this results in better code, your patch makes the code base >> slightly worse because you've now added two different code paths where there >> was previously only one. >> >> Cheers, >> Nicolai > > I'm afraid you're overestimating the compiler's abilities to detect such a > byte swapping pattern. > > The following c code > > #include <stdint.h> > > static uint32_t > SWAP4BYTE(uint32_t n) > { > return (n >> 24) | > ((n >> 8) & 0x0000ff00) | > ((n << 8) & 0x00ff0000) | > (n << 24); > } > > static uint32_t > builtin_SWAP4BYTE(uint32_t n) > { > return __builtin_bswap32(n); > } > > main(){ > SWAP4BYTE(321764); > buitlin_SWAP4BYTE(321764); > } > > compiles to the following assembler code on powerpc platforms with gcc > --version gcc (Ubuntu 6.2.0-3ubuntu11~12.04) 6.2.0 20160901: > > .section ".text" > .align 2 > .type SWAP4BYTE, @function > SWAP4BYTE: > stwu 1,-32(1) > stw 31,28(1) > mr 31,1 > stw 3,12(31) > lwz 9,12(31) > srwi 10,9,24 > lwz 9,12(31) > srwi 9,9,8 > rlwinm 9,9,0,16,23 > or 10,10,9 > lwz 9,12(31) > slwi 9,9,8 > rlwinm 9,9,0,8,15 > or 10,10,9 > lwz 9,12(31) > slwi 9,9,24 > or 9,10,9 > mr 3,9 > addi 11,31,32 > lwz 31,-4(11) > mr 1,11 > blr > .size SWAP4BYTE,.-SWAP4BYTE > .align 2 > .type builtin_SWAP4BYTE, @function > builtin_SWAP4BYTE: > stwu 1,-32(1) > stw 31,28(1) > mr 31,1 > stw 3,12(31) > addi 10,31,12 > lwbrx 9,0,10 > mr 3,9 > addi 11,31,32 > lwz 31,-4(11) > mr 1,11 > blr > .size builtin_SWAP4BYTE,.-builtin_SWAP4BYTE > .align 2 > .globl main > .type main, @function > > The builtin function seems slightly more optimized.
Compile with optimization, like a real build. With -O2 it produces identical assembly on PowerPC for me. _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev