Re: current NEON status

Marat Zakirov Mon, 11 Aug 2014 23:45:07 -0700

Hi Vladimir!

I think you are as the main IRA contributor would be appropriate personto answer question bellow. Please confirm or refute my statement aboutunsplittable register ranges in GCC IRA.


On 07/30/2014 05:38 PM, Marat Zakirov wrote:

Hi there!

My question came from bug
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725. I found that GCCconsiders NEON register ranges as unsplittable. So any subregister maybe used only after whole chunk is dead. This issue leads to redundantspill fills which is performance trouble.
Example 1: RAL trouble

#include <arm_neon.h>
#include <inttypes.h>

extern  uint16x8x4_t m0;
extern  uint16x8x4_t m1;
extern  uint16x8x4_t m2;
extern  uint16x8x4_t m3;
extern  uint16x8_t   m4;

void foo1(uint16_t * in_ptr)
{
     uint16x8x4_t t0, t1, t2, t3;
     t0 = vld4q_u16((uint16_t *)&in_ptr[0 ]);
     t1 = vld4q_u16((uint16_t *)&in_ptr[64]);
     t2 = vld4q_u16((uint16_t *)&in_ptr[128]);
     t3 = vld4q_u16((uint16_t *)&in_ptr[192]);
     m4 = t0.val[3];
     m4 = m4 * 3;                       <<< *
     t0.val[3] = t1.val[3];
     m0 = t3;
     m1 = t2;
     m2 = t1;
     m3 = t0;
}
Here test uses all NEON registers. No spill is needed. Becausemultiplication requires one Q register which may be obtained from deadt0.val[3] subregister. But GCC makes spill if multiplication (*)exists because of issue described above.
Example 2: CSE makes trouble for IRA

#include <arm_neon.h>
#include <inttypes.h>

extern  uint16x8x4_t m0;
extern  uint16x8x4_t m1;

void foo2(uint16_t * in_ptr)
{
     uint16x8x4_t t0, t1;
     t0 = vld4q_u16((uint16_t *)&in_ptr[0 ]);
     t1 = vld4q_u16((uint16_t *)&in_ptr[64]);
     t0.val[0] *= 333;
     t0.val[1] *= 333;
     t0.val[2] *= 333;
     t0.val[3] *= 333;
     t1.val[0] *= 333;
     t1.val[1] *= 333;
     t1.val[2] *= 333;
     t1.val[3] *= 333;
     m0 = t0;
     m1 = t1;
}
Here test uses only half NEON + one Q for '333' factor. But GCC makesspills here too! Briefly speak problem is in partial CSE. GCCgenerates rtl with the listed bellow form:
Before CSE:

a = b
a0 = a0 * 3
a1 = a1 * 3
a2 = a2 * 3
a3 = a3 * 3

After:

a = b
a0 = b0 * 3
a1 = a1 * 3 <<< *
a2 = a2 * 3
a3 = a3 * 3
CSE do not substitute b1 to a1 because at the moment (*) a0 wasalready defined so actually a != b. Yes but a1 = b1, unfortunately CSEalso do not handle register-ranges parts as RA does. Strange thinghere is that even if we fix CSE, so CSE could propagateregister-ranges subregs, this will make trouble to RAL also because ofthe same reason: IRA do not handle precisely register ranges parts. Iattached a demo patch which forbids partial CSE propagation andremoves spills from Ex2. Is this patch OK? Or maybe CSE should befixed in a different way? Or maybe partial substitution is OK?
Main question: Are there any plans to fix/upgrade IRA?

--Marat

gcc/ChangeLog:

2014-07-30  Marat Zakirov  <m.zaki...@samsung.com>

	* cse.c (canon_reg): Forbid partial CSE.
	* fwprop.c (forward_propagate_and_simplify): Likewise.

diff --git a/gcc/cse.c b/gcc/cse.c
index 34f9364..a9e0442 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -2862,6 +2862,9 @@ canon_reg (rtx x, rtx insn)
 	    || ! REGNO_QTY_VALID_P (REGNO (x)))
 	  return x;
 
+        if (GET_MODE (x) == XImode)
+          return x;
+
 	q = REG_QTY (REGNO (x));
 	ent = &qty_table[q];
 	first = ent->first_reg;
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 547fcd6..eadc729 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1317,6 +1317,9 @@ forward_propagate_and_simplify (df_ref use, rtx def_insn, rtx def_set)
   if (!new_rtx)
     return false;
 
+  if (GET_MODE (reg) == XImode)
+    return false;
+
   return try_fwprop_subst (use, loc, new_rtx, def_insn, set_reg_equal);
 }

Re: current NEON status

Reply via email to