https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110473
Bug ID: 110473 Summary: vec_convert for aarch64 seems to lower to something which should be improved Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: pinskia at gcc dot gnu.org Target Milestone: --- Target: aarch64 Take: ``` typedef unsigned int v4si __attribute__ ((vector_size (4*sizeof(int)))); typedef unsigned short v4hi __attribute__ ((vector_size (4*sizeof(short)))); v4si f(v4si a, v4si b) { v4hi t = __builtin_convertvector (a, v4hi); v4si t1 = __builtin_convertvector (t, v4si); return t1; } ``` This gets lowered in veclower21 to ``` _6 = BIT_FIELD_REF <_5, 64, 0>; t_2 = _6; _7 = BIT_FIELD_REF <t_2, 16, 0>; _8 = (unsigned int) _7; _9 = BIT_FIELD_REF <t_2, 16, 16>; _10 = (unsigned int) _9; _11 = BIT_FIELD_REF <t_2, 16, 32>; _12 = (unsigned int) _11; _13 = BIT_FIELD_REF <t_2, 16, 48>; _14 = (unsigned int) _13; t1_3 = {_8, _10, _12, _14}; ``` And then forwprop optimizes this to: ``` _6 = BIT_FIELD_REF <_5, 64, 0>; t1_3 = (v4si) _6; ``` And then combine comes along and optimizes that to: (insn 9 8 11 2 (set (reg:V8HI 98) (vec_concat:V8HI (truncate:V4HI (reg:V4SI 102)) (const_vector:V4HI [ (const_int 0 [0]) repeated x4 ]))) "/app/example.cpp":7:8 5467 {truncv4siv4hi2_vec_concatz_le} (expr_list:REG_DEAD (reg:V4SI 102) (nil))) (note 11 9 16 2 NOTE_INSN_DELETED) (insn 16 11 17 2 (set (reg/i:V4SI 32 v0) (sign_extend:V4SI (subreg:V4HI (reg:V8HI 98) 0))) "/app/example.cpp":10:1 5459 {extendv4hiv4si2} (expr_list:REG_DEAD (reg:V8HI 98) (nil))) But the first one is basically just (truncate:V4HI (reg:V4SI 102)) (due to the way the instruction works, the top parts is also zeros). So we get in the end: xtn v0.4h, v0.4s sxtl v0.4s, v0.4h Why couldn't vectlower could just do: ``` _6 = (v4hi)a_1(D); t1_3 = (v4si) _6; ``` In the first place instead of depending on later optimizations (at least handle the second part)? note with the above lowering we might hit an issue in match.pd where it is trying to turn that into t1_3 = t1_3 = {0xFFFF,0xFFFF,0xFFFF,0xFFFF} & _6; (both because of TYPE_PRECISION and because it just uses wide_int_to_tree ...