https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70809
Bug ID: 70809
Summary: [AArch64] aarch64_vmls pattern should be rejected if
-ffp-contract=off
Product: gcc
Version: 4.8.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: jgreenhalgh at gcc dot gnu.org
Target Milestone: ---
Target: aarch64*-*-*
Take this simple testcase:
void
foo (float * __restrict__ __attribute__ ((aligned (16))) a,
float * __restrict__ __attribute__ ((aligned (16))) x,
float * __restrict__ __attribute__ ((aligned (16))) y,
float * __restrict__ __attribute__ ((aligned (16))) z)
{
unsigned i = 0;
for (i = 0; i < 256; i++)
a[i] = x[i] - (y[i] * z[i]);
}
GCC for AArch64 (all versions) will generate a vectorized fmls instruction even
when given the --fp-contract=off command (for trunk and 6 you'll need to play
with -mcpu options to find one which permits the combine through the cost
model):
(for trunk) $ gcc -O3 -ffp-contract=off -mcpu=xgene1 foo.c
<snip>
.L4:
ldr q2, [x9, x4]
add w5, w5, 1
ldr q1, [x8, x4]
cmp w5, w7
ldr q0, [x10, x4]
fmls v0.4s, v2.4s, v1.4s
str q0, [x6, x4]
add x4, x4, 16
bcc .L4
<snip>
The problem seems pretty clear, the aarch64_vmls<mode> pattern needs to be
tightened up not to fuse multiplies and subtracts when we're not in
-ffp-contract=fast.
(define_insn "aarch64_vmls<mode>"
[(set (match_operand:VDQF 0 "register_operand" "=w")
(minus:VDQF (match_operand:VDQF 1 "register_operand" "0")
(mult:VDQF (match_operand:VDQF 2 "register_operand" "w")
(match_operand:VDQF 3 "register_operand"
"w"))))]
"TARGET_SIMD"
"fmls\\t%0.<Vtype>, %2.<Vtype>, %3.<Vtype>"
[(set_attr "type" "neon_fp_mla_<Vetype>_scalar<q>")]
)