12 Regression] x86: excessive code generated for 128-bit byteswap

rguenth at gcc dot gnu.org via Gcc-bugs Fri, 21 Jan 2022 00:29:24 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org,
                   |                            |vmakarov at gcc dot gnu.org
           Priority|P3                          |P2
             Target|                            |x86_64-*-*

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
With just SSE2 we get the store vectorized only:

bswap:
.LFB0:
        .cfi_startproc
        bswap   %rsi
        bswap   %rdi
        movq    %rsi, %xmm0
        movq    %rdi, %xmm1
        punpcklqdq      %xmm1, %xmm0
        movaps  %xmm0, -24(%rsp)
        movq    -24(%rsp), %rax
        movq    -16(%rsp), %rdx
        ret

the

<unknown> 1 times vec_perm costs 4 in body
BIT_FIELD_REF <a_3(D), 64, 64> 1 times scalar_stmt costs 4 in body
BIT_FIELD_REF <a_3(D), 64, 0> 1 times scalar_stmt costs 4 in body

costs are what we cost building the initial vector from __int128
compared to splitting that into a low/high part.

  <bb 2> [local count: 1073741824]:
  _8 = BIT_FIELD_REF <a_3(D), 64, 64>;
  _11 = VIEW_CONVERT_EXPR<vector(2) long long unsigned int>(a_3(D));
  _13 = VIEW_CONVERT_EXPR<vector(2) long long unsigned int>(a_3(D));
  _12 = VEC_PERM_EXPR <_11, _13, { 1, 0 }>;
  _14 = VIEW_CONVERT_EXPR<vector(16) char>(_12);
  _15 = VEC_PERM_EXPR <_14, _14, { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11,
10, 9, 8 }>;
  _16 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(_15);
  _1 = __builtin_bswap64 (_8);
  _10 = BIT_FIELD_REF <a_3(D), 64, 0>;
  _2 = __builtin_bswap64 (_10);
  MEM <vector(2) long long unsigned int> [(long long unsigned int *)&y] = _16;
  _7 = MEM <uint128_t> [(char * {ref-all})&y];

doesn't realize that it can maybe move the hi/lo swap across the two
permutes to before the store, otherwise it looks as expected.

Yes, the vectorizer doesn't account for ABI details on the function boundary
but it's very hard to do that in a sensible way.

Practically the worst part of the generated code is

        movq    %rdi, -24(%rsp)
        movq    %rsi, -16(%rsp)
        movdqa  -24(%rsp), %xmm0

because the store will fail to forward, causing a huge performance issue.
I wonder why we fail to merge those.  We face

(insn 26 25 27 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 0)
        (reg:DI 97)) "t.c":2:1 80 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 97)
        (nil)))
(insn 27 26 7 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 8)
        (reg:DI 98)) "t.c":2:1 80 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 98)
        (nil)))
(note 7 27 12 2 NOTE_INSN_FUNCTION_BEG)
(insn 12 7 14 2 (set (reg:V2DI 91)
        (vec_select:V2DI (subreg:V2DI (reg/v:TI 87 [ a ]) 0)
            (parallel [
                    (const_int 1 [0x1])
                    (const_int 0 [0])
                ]))) "t.c":6:12 7927 {*ssse3_palignrv2di_perm}
     (expr_list:REG_DEAD (reg/v:TI 87 [ a ])
        (nil)))

where

Trying 26, 27 -> 12:
   26: r87:TI#0=r97:DI
      REG_DEAD r97:DI
   27: r87:TI#8=r98:DI
      REG_DEAD r98:DI
   12: r91:V2DI=vec_select(r87:TI#0,parallel)
      REG_DEAD r87:TI
Can't combine i2 into i3

possibly because 27 is a partial def of r87.  We expand to

(insn 4 3 5 2 (set (reg:TI 88)
        (subreg:TI (reg:DI 89) 0)) "t.c":2:1 -1
     (nil))
(insn 5 4 6 2 (set (subreg:DI (reg:TI 88) 8)
        (reg:DI 90)) "t.c":2:1 -1
     (nil))
(insn 6 5 7 2 (set (reg/v:TI 87 [ a ])
        (reg:TI 88)) "t.c":2:1 -1
     (nil))
(note 7 6 10 2 NOTE_INSN_FUNCTION_BEG)
(insn 10 7 12 2 (set (reg:V2DI 82 [ _11 ])
        (subreg:V2DI (reg/v:TI 87 [ a ]) 0)) -1
     (nil))
(insn 12 10 13 2 (set (reg:V2DI 91)
        (vec_select:V2DI (reg:V2DI 82 [ _11 ])
            (parallel [
                    (const_int 1 [0x1])
                    (const_int 0 [0])
                ]))) "t.c":6:12 -1
     (nil))

initially from

  _11 = VIEW_CONVERT_EXPR<vector(2) long long unsigned int>(a_3(D));

and fwprop1 still sees

(insn 26 25 27 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 0)
        (reg:DI 89 [ a ])) "t.c":2:1 80 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 95 [ a ])
        (nil)))
(insn 27 26 7 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 8)
        (reg:DI 90 [ a+8 ])) "t.c":2:1 80 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 96 [+8 ])
        (nil)))
(note 7 27 10 2 NOTE_INSN_FUNCTION_BEG) 
(insn 10 7 12 2 (set (reg:V2DI 82 [ _11 ])
        (subreg:V2DI (reg/v:TI 87 [ a ]) 0)) 1700 {movv2di_internal}
     (expr_list:REG_DEAD (reg/v:TI 87 [ a ])
        (nil)))

so that would be the best place to fix this up, realizing reg 87 dies
after insn 10.

Richard - I'm sure we can construct a similar case for aarch64 where
argument passing and vector mode use cause spilling?

On x86 the simplest testcase showing this is

typedef unsigned long long v2di __attribute__((vector_size(16)));
v2di bswap(__uint128_t a)
{
    return *(v2di *)&a;
}

that produces

bswap:
.LFB0:
        .cfi_startproc
        sub     sp, sp, #16
        .cfi_def_cfa_offset 16
        stp     x0, x1, [sp]
        ldr     q0, [sp]
        add     sp, sp, 16
        .cfi_def_cfa_offset 0
        ret

on arm for me.  Maybe the stp x0, x1 store can forward to the ldr load
though and I'm not sure there's another way to move x0/x1 to q0.

Providing LRA with a way to move TImode to VnmImode would of course
also avoid the spilling but getting rid of the TImode pseudo when
it's on there as intermediary for moving two DImode vals to V2DImode
sounds like a useful transform to me.  combine is too late since
fwprop already merged the subreg with the following shuffle for the
larger testcase.

Alternatively LRA could also be taught to spill to %xmm by somehow
telling it of the wastly increased cost of the double-spill, single-reload
sequence?  But I guess it would still need to be teached how to
reload V2DImode from a {DImode, DImode} pair in %xmm regs ...

[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap

Reply via email to