[committed] amdgcn: Add fold_left_plus vector reductions

Andrew Stubbs Fri, 03 Jul 2020 03:12:19 -0700

This patch implements a floating-point fold_left_plus vector pattern,which gives a significant speed-up in the BabelStream "dot" benchmark.

The GCN architecture can't actually do an in-order vector reduction anymore efficiently than that equivalent scalar algorithm, so this is a bitof a cheat. However, dividing the problem into threads using OpenACC orOpenMP has already broken the in-order semantics, so we may as welloptimize the operation at the vector level too.

If the user has specifically sorted the input data in order to get amore correct FP result then using multiple threads is already the wrongthing to do. But, if the input data is in no particular numerical orderthen this optimization will give a correct answer much faster, albeitpossibly a slightly different one each run.


Andrew

amdgcn: Add fold_left_plus vector reductions

These aren't real in-order instructions, because the ISA can't do that
quickly, but a means to allow regular out-of-order reductions when that's
good enough, but the middle-end doesn't know so.

	gcc/
	* config/gcn/gcn-valu.md (fold_left_plus_<mode>): New.

diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index 6d7fecaa12c..26559ff765e 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -3076,6 +3076,26 @@ (define_expand "reduc_<reduc_op>_scal_<mode>"
     DONE;
   })
 
+;; Warning: This "-ffast-math" implementation converts in-order reductions
+;;          into associative reductions. It's also used where OpenMP or
+;;          OpenACC paralellization has already broken the in-order semantics.
+(define_expand "fold_left_plus_<mode>"
+ [(match_operand:<SCALAR_MODE> 0 "register_operand")
+  (match_operand:<SCALAR_MODE> 1 "gcn_alu_operand")
+  (match_operand:V_FP 2 "gcn_alu_operand")]
+  "can_create_pseudo_p ()
+   && (flag_openacc || flag_openmp
+       || flag_associative_math)"
+  {
+    rtx dest = operands[0];
+    rtx scalar = operands[1];
+    rtx vector = operands[2];
+    rtx tmp = gen_reg_rtx (<SCALAR_MODE>mode);
+
+    emit_insn (gen_reduc_plus_scal_<mode> (tmp, vector));
+    emit_insn (gen_add<scalar_mode>3 (dest, scalar, tmp));
+     DONE;
+   })
 
 (define_insn "*<reduc_op>_dpp_shr_<mode>"
   [(set (match_operand:V_1REG 0 "register_operand"   "=v")

[committed] amdgcn: Add fold_left_plus vector reductions

Reply via email to