https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68961

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |target

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #6)
> I can only reproduce with a ppc64le cross, but not be cross, supposedly
> because my ppc64le cross is using auto-host.h which Marek sent me from a
> native configured compiler, while the be one does not.  Maybe the difference
> is whether power8 instructions are supported by assembler or something
> similar.
> 
> Looking at the be->le differences, it starts in esra:
> ...
> -Rejected (2230): not aggregate: a
> -Rejected (2231): not aggregate: aa
> -Candidate (2234): u
> -Created a replacement for u offset: 0, size: 64: u$d$0
> +Rejected (2351): not aggregate: a
> +Rejected (2352): not aggregate: aa
> +Candidate (2355): u
> +! Disqualifying u - No scalar replacements to be created.
> ...
>  pack (double a, double aa)
>  {
> -  double u$d$0;
>    union u_ld u;
>    long double _6;
>  
>    <bb 2>:
> -  u$d$0_8 = a_2(D);
> +  u.d[0] = a_2(D);
>    u.d[1] = aa_4(D);
> -  _6 = u$d$0_8;
> +  _6 = u.ld;
>    u ={v} {CLOBBER};
>    return _6;
> 
> and if we don't SRA this, we don't optimize it away.  Ah, actually, I can
> reproduce even with the ppc64be cross, if I use explicit -mlong-double-128.

Ok, I can reproduce with -mlong-double-128 and with that the SLP vectorizer
triggers and produces

pack (double a, double aa)
{
  vector(2) double * vectp.6;
  vector(2) double * vectp_u.5;
  union u_ld u;
  long double _6;
  vector(2) double vect_cst__8;

  <bb 2>:
  vect_cst__8 = {a_2(D), aa_4(D)};
  MEM[(double *)&u] = vect_cst__8;
  _6 = u.ld;
  u ={v} {CLOBBER};
  return _6;

which then, as no FRE is running after vectorization, is not optimized
to a register vector-to-long-double punning (not sure if that would help).
SRA would help as well here.

With -fno-tree-slp-vectorize it's optimized somewhere on RTL.  Not sure
why it is able to grok the more complicated two-stores-one-load but
not the load-after-store.  Before combine we have with the vector code

(note 5 0 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 2 5 3 2 (set (reg/v:DF 158 [ a ])
        (reg:DF 33 1 [ a ])) t.c:5 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg:DF 33 1 [ a ])
        (nil)))
(insn 3 2 4 2 (set (reg/v:DF 159 [ aa ])
        (reg:DF 34 2 [ aa ])) t.c:5 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg:DF 34 2 [ aa ])
        (nil)))
(note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
(insn 7 4 9 2 (set (reg:V2DF 160)
        (vec_concat:V2DF (reg/v:DF 158 [ a ])
            (reg/v:DF 159 [ aa ]))) t.c:7 1084 {vsx_concat_v2df}
     (expr_list:REG_DEAD (reg/v:DF 159 [ aa ])
        (expr_list:REG_DEAD (reg/v:DF 158 [ a ])
            (nil))))
(insn 9 7 13 2 (set (reg:TF 157 [ <retval> ])
        (subreg:TF (reg:V2DF 160) 0)) t.c:9 447 {*movtf_64bit_dm}
     (expr_list:REG_DEAD (reg:V2DF 160)
        (nil)))
(insn 13 9 14 2 (set (reg/i:TF 33 1)
        (reg:TF 157 [ <retval> ])) t.c:10 447 {*movtf_64bit_dm}
     (expr_list:REG_DEAD (reg:TF 157 [ <retval> ])
        (nil)))

and with the non-vector code

(insn 2 5 3 2 (set (reg/v:DF 157 [ a ])
        (reg:DF 33 1 [ a ])) t.c:5 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg:DF 33 1 [ a ])
        (nil)))
(insn 3 2 4 2 (set (reg/v:DF 158 [ aa ])
        (reg:DF 34 2 [ aa ])) t.c:5 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg:DF 34 2 [ aa ])
        (nil)))
(note 4 3 16 2 NOTE_INSN_FUNCTION_BEG)
(insn 16 4 7 2 (set (reg/v:TI 155 [ u ])
        (const_int 0 [0])) t.c:7 -1
     (nil))
(insn 7 16 8 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 0)
        (reg/v:DF 157 [ a ])) t.c:7 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg/v:DF 157 [ a ])
        (nil)))
(insn 8 7 9 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 8)
        (reg/v:DF 158 [ aa ])) t.c:8 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg/v:DF 158 [ aa ])
        (nil)))
(insn 9 8 13 2 (set (reg:TF 156 [ <retval> ])
        (subreg:TF (reg/v:TI 155 [ u ]) 0)) t.c:9 447 {*movtf_64bit_dm}
     (expr_list:REG_DEAD (reg/v:TI 155 [ u ])
        (nil)))
(insn 13 9 14 2 (set (reg/i:TF 33 1)
        (reg:TF 156 [ <retval> ])) t.c:10 447 {*movtf_64bit_dm}
     (expr_list:REG_DEAD (reg:TF 156 [ <retval> ])
        (nil)))


so the difference is

(insn 7 4 9 2 (set (reg:V2DF 160)
        (vec_concat:V2DF (reg/v:DF 158 [ a ])
            (reg/v:DF 159 [ aa ]))) t.c:7 1084 {vsx_concat_v2df}
     (expr_list:REG_DEAD (reg/v:DF 159 [ aa ])
        (expr_list:REG_DEAD (reg/v:DF 158 [ a ])
            (nil))))
(insn 9 7 13 2 (set (reg:TF 157 [ <retval> ])
        (subreg:TF (reg:V2DF 160) 0)) t.c:9 447 {*movtf_64bit_dm}
     (expr_list:REG_DEAD (reg:V2DF 160)
        (nil)))

vs.

(insn 16 4 7 2 (set (reg/v:TI 155 [ u ])
        (const_int 0 [0])) t.c:7 -1
     (nil))
(insn 7 16 8 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 0)
        (reg/v:DF 157 [ a ])) t.c:7 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg/v:DF 157 [ a ])
        (nil)))
(insn 8 7 9 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 8)
        (reg/v:DF 158 [ aa ])) t.c:8 443 {*movdf_hardfloat64}
     (expr_list:REG_DEAD (reg/v:DF 158 [ aa ])
        (nil)))
(insn 9 8 13 2 (set (reg:TF 156 [ <retval> ])
        (subreg:TF (reg/v:TI 155 [ u ]) 0)) t.c:9 447 {*movtf_64bit_dm}
     (expr_list:REG_DEAD (reg/v:TI 155 [ u ])
        (nil)))

combine forwards the argument reg setup into the latter but not the former
but in the end the backend is probably confused by the VSX register use.

I think this should be addressed at the target level as the user may choose
to write this code by exchanging double d[2] in the unions testcase with
v2df d and use GCCs vector extension.

Reply via email to