https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68961
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|tree-optimization |target --- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Jakub Jelinek from comment #6) > I can only reproduce with a ppc64le cross, but not be cross, supposedly > because my ppc64le cross is using auto-host.h which Marek sent me from a > native configured compiler, while the be one does not. Maybe the difference > is whether power8 instructions are supported by assembler or something > similar. > > Looking at the be->le differences, it starts in esra: > ... > -Rejected (2230): not aggregate: a > -Rejected (2231): not aggregate: aa > -Candidate (2234): u > -Created a replacement for u offset: 0, size: 64: u$d$0 > +Rejected (2351): not aggregate: a > +Rejected (2352): not aggregate: aa > +Candidate (2355): u > +! Disqualifying u - No scalar replacements to be created. > ... > pack (double a, double aa) > { > - double u$d$0; > union u_ld u; > long double _6; > > <bb 2>: > - u$d$0_8 = a_2(D); > + u.d[0] = a_2(D); > u.d[1] = aa_4(D); > - _6 = u$d$0_8; > + _6 = u.ld; > u ={v} {CLOBBER}; > return _6; > > and if we don't SRA this, we don't optimize it away. Ah, actually, I can > reproduce even with the ppc64be cross, if I use explicit -mlong-double-128. Ok, I can reproduce with -mlong-double-128 and with that the SLP vectorizer triggers and produces pack (double a, double aa) { vector(2) double * vectp.6; vector(2) double * vectp_u.5; union u_ld u; long double _6; vector(2) double vect_cst__8; <bb 2>: vect_cst__8 = {a_2(D), aa_4(D)}; MEM[(double *)&u] = vect_cst__8; _6 = u.ld; u ={v} {CLOBBER}; return _6; which then, as no FRE is running after vectorization, is not optimized to a register vector-to-long-double punning (not sure if that would help). SRA would help as well here. With -fno-tree-slp-vectorize it's optimized somewhere on RTL. Not sure why it is able to grok the more complicated two-stores-one-load but not the load-after-store. Before combine we have with the vector code (note 5 0 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (insn 2 5 3 2 (set (reg/v:DF 158 [ a ]) (reg:DF 33 1 [ a ])) t.c:5 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg:DF 33 1 [ a ]) (nil))) (insn 3 2 4 2 (set (reg/v:DF 159 [ aa ]) (reg:DF 34 2 [ aa ])) t.c:5 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg:DF 34 2 [ aa ]) (nil))) (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG) (insn 7 4 9 2 (set (reg:V2DF 160) (vec_concat:V2DF (reg/v:DF 158 [ a ]) (reg/v:DF 159 [ aa ]))) t.c:7 1084 {vsx_concat_v2df} (expr_list:REG_DEAD (reg/v:DF 159 [ aa ]) (expr_list:REG_DEAD (reg/v:DF 158 [ a ]) (nil)))) (insn 9 7 13 2 (set (reg:TF 157 [ <retval> ]) (subreg:TF (reg:V2DF 160) 0)) t.c:9 447 {*movtf_64bit_dm} (expr_list:REG_DEAD (reg:V2DF 160) (nil))) (insn 13 9 14 2 (set (reg/i:TF 33 1) (reg:TF 157 [ <retval> ])) t.c:10 447 {*movtf_64bit_dm} (expr_list:REG_DEAD (reg:TF 157 [ <retval> ]) (nil))) and with the non-vector code (insn 2 5 3 2 (set (reg/v:DF 157 [ a ]) (reg:DF 33 1 [ a ])) t.c:5 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg:DF 33 1 [ a ]) (nil))) (insn 3 2 4 2 (set (reg/v:DF 158 [ aa ]) (reg:DF 34 2 [ aa ])) t.c:5 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg:DF 34 2 [ aa ]) (nil))) (note 4 3 16 2 NOTE_INSN_FUNCTION_BEG) (insn 16 4 7 2 (set (reg/v:TI 155 [ u ]) (const_int 0 [0])) t.c:7 -1 (nil)) (insn 7 16 8 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 0) (reg/v:DF 157 [ a ])) t.c:7 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg/v:DF 157 [ a ]) (nil))) (insn 8 7 9 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 8) (reg/v:DF 158 [ aa ])) t.c:8 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg/v:DF 158 [ aa ]) (nil))) (insn 9 8 13 2 (set (reg:TF 156 [ <retval> ]) (subreg:TF (reg/v:TI 155 [ u ]) 0)) t.c:9 447 {*movtf_64bit_dm} (expr_list:REG_DEAD (reg/v:TI 155 [ u ]) (nil))) (insn 13 9 14 2 (set (reg/i:TF 33 1) (reg:TF 156 [ <retval> ])) t.c:10 447 {*movtf_64bit_dm} (expr_list:REG_DEAD (reg:TF 156 [ <retval> ]) (nil))) so the difference is (insn 7 4 9 2 (set (reg:V2DF 160) (vec_concat:V2DF (reg/v:DF 158 [ a ]) (reg/v:DF 159 [ aa ]))) t.c:7 1084 {vsx_concat_v2df} (expr_list:REG_DEAD (reg/v:DF 159 [ aa ]) (expr_list:REG_DEAD (reg/v:DF 158 [ a ]) (nil)))) (insn 9 7 13 2 (set (reg:TF 157 [ <retval> ]) (subreg:TF (reg:V2DF 160) 0)) t.c:9 447 {*movtf_64bit_dm} (expr_list:REG_DEAD (reg:V2DF 160) (nil))) vs. (insn 16 4 7 2 (set (reg/v:TI 155 [ u ]) (const_int 0 [0])) t.c:7 -1 (nil)) (insn 7 16 8 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 0) (reg/v:DF 157 [ a ])) t.c:7 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg/v:DF 157 [ a ]) (nil))) (insn 8 7 9 2 (set (subreg:DF (reg/v:TI 155 [ u ]) 8) (reg/v:DF 158 [ aa ])) t.c:8 443 {*movdf_hardfloat64} (expr_list:REG_DEAD (reg/v:DF 158 [ aa ]) (nil))) (insn 9 8 13 2 (set (reg:TF 156 [ <retval> ]) (subreg:TF (reg/v:TI 155 [ u ]) 0)) t.c:9 447 {*movtf_64bit_dm} (expr_list:REG_DEAD (reg/v:TI 155 [ u ]) (nil))) combine forwards the argument reg setup into the latter but not the former but in the end the backend is probably confused by the VSX register use. I think this should be addressed at the target level as the user may choose to write this code by exchanging double d[2] in the unions testcase with v2df d and use GCCs vector extension.