Re: [Mesa-dev] [PATCH] nv50/ir: optimize ADD(SHL(a, b), c) to SHLADD(a, b, c)

Samuel Pitoiset Sun, 09 Oct 2016 12:35:09 -0700


On 10/09/2016 09:28 PM, Karol Herbst wrote:

2016-10-09 13:58 GMT+02:00 Samuel Pitoiset <samuel.pitoi...@gmail.com>:



On 10/08/2016 10:04 PM, Karol Herbst wrote:


looks great, a few comments below



Thanks!


2016-10-08 21:55 GMT+02:00 Samuel Pitoiset <samuel.pitoi...@gmail.com>:


total instructions in shared programs :2286901 -> 2284473 (-0.11%)
total gprs used in shared programs    :335256 -> 335273 (0.01%)
total local used in shared programs   :31968 -> 31968 (0.00%)

                local        gpr       inst      bytes
    helped           0          41         852         852
      hurt           0          44          23          23

Signed-off-by: Samuel Pitoiset <samuel.pitoi...@gmail.com>
---
 .../drivers/nouveau/codegen/nv50_ir_peephole.cpp   | 94
++++++++++++++++++++++
 1 file changed, 94 insertions(+)

diff --git a/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
b/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
index 6efb29e..caf5d1d 100644
--- a/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
+++ b/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
@@ -2132,6 +2132,99 @@ AlgebraicOpt::visit(BasicBlock *bb)

 //
=============================================================================

+// ADD(SHL(a, b), c) -> SHLADD(a, b, c)
+class LateAlgebraicOpt : public Pass
+{
+private:
+   virtual bool visit(BasicBlock *);
+
+   void handleADD(Instruction *);
+   bool tryADDToSHLADD(Instruction *);
+
+   BuildUtil bld;
+};
+
+void
+LateAlgebraicOpt::handleADD(Instruction *add)
+{
+   Value *src0 = add->getSrc(0);
+   Value *src1 = add->getSrc(1);
+
+   if (src0->reg.file != FILE_GPR || src1->reg.file != FILE_GPR)
+      return;
+
+   if (prog->getTarget()->isOpSupported(OP_SHLADD, add->dType))
+      tryADDToSHLADD(add);
+}
+
+// ADD(SHL(a, b), c) -> SHLADD(a, b, c)
+bool
+LateAlgebraicOpt::tryADDToSHLADD(Instruction *add)
+{
+   Value *src0 = add->getSrc(0);
+   Value *src1 = add->getSrc(1);
+   Modifier mod[2];
+   Value *src;
+   int s;
+
+   if (add->saturate || add->usesFlags() || typeSizeof(add->dType) == 8)
+      return false;
+
+   if (src0->getUniqueInsn() && src0->getUniqueInsn()->op == OP_SHL)
+      s = 0;
+   else
+   if (src1->getUniqueInsn() && src1->getUniqueInsn()->op == OP_SHL)
+      s = 1;
+   else
+      return false;
+
+   src = add->getSrc(s);
+
+   if (src->getUniqueInsn()->bb != add->bb)
+      return false;



is there a reason to check for the bb here?



Those optimizations like ADD->MAD, ADD->SAD, etc, have to be local (ie.
inside the same basic block). I just apply the same rule here.

+
+   if (src->getInsn()->usesFlags() || src->getInsn()->subOp)
+      return false;
+
+   if (src->getInsn()->src(1).getFile() != FILE_IMMEDIATE)
+      return false;
+
+   mod[0] = add->src(0).mod;
+   mod[1] = add->src(1).mod;
+
+   add->op = OP_SHLADD;
+   add->setSrc(2, add->src(s ? 0 : 1));



I usually use "s ^ 1" instead.



Yeah, but 's ? 0 : 1' seems to be more used than 's ^ 1' (in peephole).

+   add->src(2).mod = mod[s];
+
+   add->setSrc(0, src->getInsn()->getSrc(0));
+   add->src(0).mod = mod[0];
+   add->setSrc(1, src->getInsn()->getSrc(1));
+   add->src(1).mod = Modifier(0);
+
+   return true;
+}
+
+bool
+LateAlgebraicOpt::visit(BasicBlock *bb)
+{
+   Instruction *next;
+
+   for (Instruction *i = bb->getEntry(); i; i = next) {
+      next = i->next;
+      switch (i->op) {
+      case OP_ADD:
+         handleADD(i);
+         break;
+      default:
+         break;
+      }
+   }



you can also just implement visit(Instruction *) and return true; then
you won't need that silly loop.



True.

+
+   return true;
+}
+
+//
=============================================================================
+
 static inline void
 updateLdStOffset(Instruction *ldst, int32_t offset, Function *fn)
 {
@@ -3436,6 +3529,7 @@ Program::optimizeSSA(int level)
    RUN_PASS(2, AlgebraicOpt, run);
    RUN_PASS(2, ModifierFolding, run); // before load propagation -> less
checks
    RUN_PASS(1, ConstantFolding, foldAll);
+   RUN_PASS(2, LateAlgebraicOpt, run);



What is the reason to add a new AlgebraicOpt pass for this?



Because a bunch of optimizations are applied in ConstantFolding which means
this one has to be done *after*. My first attempt was to do it in
AlgebraicOpt, but it's actually not the best place. :-)


Yeah but this applies to many other opts as well. I think in the end
we need to run those passes inside a loop and iterate like 2 or 3
times like I do here:
https://github.com/karolherbst/mesa/commit/e2ae247aa27ddb150ff4ccd1c0f19baa470134ef

The first line is cut of, but the benefit is around 0.5% less
instructions and a decent speed up in pixmark_piano.

Not sure if iterating multiple times is the *right* solution... I thinkthe compiler has serious lacks, and at first look this seems like aworkaround in my opinion.

But I'm definitely not a guru in compiler optimizations, so yeah maybeit makes sense to launch some passes two times.

    RUN_PASS(1, LoadPropagation, run);
    RUN_PASS(1, IndirectPropagation, run);
    RUN_PASS(2, MemoryOpt, run);
--
2.10.0

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH] nv50/ir: optimize ADD(SHL(a, b), c) to SHLADD(a, b, c)

Reply via email to