Hi, I was looking into why we don't produce fmls with a scalar register as the last argument but I found a difference in how fnma<mode>4 is described in RTL which I think is causing the missed optimization. Look at the scalar version:
(define_insn "fnma<mode>4" [(set (match_operand:GPF_F16 0 "register_operand" "=w") (fma:GPF_F16 (neg:GPF_F16 (match_operand:GPF_F16 1 "register_operand" "w")) (match_operand:GPF_F16 2 "register_operand" "w") (match_operand:GPF_F16 3 "register_operand" "w")))] "TARGET_FLOAT" "fmsub\\t%<s>0, %<s>1, %<s>2, %<s>3" [(set_attr "type" "fmac<stype>")] ) vs the vector version: (define_insn "fnma<mode>4" [(set (match_operand:VHSDF 0 "register_operand" "=w") (fma:VHSDF (match_operand:VHSDF 1 "register_operand" "w") (neg:VHSDF (match_operand:VHSDF 2 "register_operand" "w")) (match_operand:VHSDF 3 "register_operand" "0")))] "TARGET_SIMD" "fmls\\t%0.<Vtype>, %1.<Vtype>, %2.<Vtype>" [(set_attr "type" "neon_fp_mla_<stype><q>")] ) Notice how the neg is a different location for both of them. What is the reason for that? Thanks, Andrew