On 12/08/14 07:49, Andrew Pinski wrote: > On Mon, Aug 11, 2014 at 11:44 PM, Marat Zakirov <m.zaki...@samsung.com> wrote: >> Hi Vladimir! >> >> I think you are as the main IRA contributor would be appropriate person to >> answer question bellow. Please confirm or refute my statement about >> unsplittable register ranges in GCC IRA. >> >> >> On 07/30/2014 05:38 PM, Marat Zakirov wrote: >>> >>> Hi there! >>> >>> My question came from bug >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725. I found that GCC >>> considers NEON register ranges as unsplittable. So any subregister may be >>> used only after whole chunk is dead. This issue leads to redundant spill >>> fills which is performance trouble. >>> >>> Example 1: RAL trouble >>> >>> #include <arm_neon.h> >>> #include <inttypes.h> >>> >>> extern uint16x8x4_t m0; >>> extern uint16x8x4_t m1; >>> extern uint16x8x4_t m2; >>> extern uint16x8x4_t m3; >>> extern uint16x8_t m4; >>> >>> void foo1(uint16_t * in_ptr) >>> { >>> uint16x8x4_t t0, t1, t2, t3; >>> t0 = vld4q_u16((uint16_t *)&in_ptr[0 ]); >>> t1 = vld4q_u16((uint16_t *)&in_ptr[64]); >>> t2 = vld4q_u16((uint16_t *)&in_ptr[128]); >>> t3 = vld4q_u16((uint16_t *)&in_ptr[192]); >>> m4 = t0.val[3]; >>> m4 = m4 * 3; <<< * >>> t0.val[3] = t1.val[3]; >>> m0 = t3; >>> m1 = t2; >>> m2 = t1; >>> m3 = t0; >>> } >>> >>> Here test uses all NEON registers. No spill is needed. Because >>> multiplication requires one Q register which may be obtained from dead >>> t0.val[3] subregister. But GCC makes spill if multiplication (*) exists >>> because of issue described above. >>> >>> Example 2: CSE makes trouble for IRA >>> >>> #include <arm_neon.h> >>> #include <inttypes.h> >>> >>> extern uint16x8x4_t m0; >>> extern uint16x8x4_t m1; >>> >>> void foo2(uint16_t * in_ptr) >>> { >>> uint16x8x4_t t0, t1; >>> t0 = vld4q_u16((uint16_t *)&in_ptr[0 ]); >>> t1 = vld4q_u16((uint16_t *)&in_ptr[64]); >>> t0.val[0] *= 333; >>> t0.val[1] *= 333; >>> t0.val[2] *= 333; >>> t0.val[3] *= 333; >>> t1.val[0] *= 333; >>> t1.val[1] *= 333; >>> t1.val[2] *= 333; >>> t1.val[3] *= 333; >>> m0 = t0; >>> m1 = t1; >>> } >>> >>> Here test uses only half NEON + one Q for '333' factor. But GCC makes >>> spills here too! Briefly speak problem is in partial CSE. GCC generates rtl >>> with the listed bellow form: >>> >>> Before CSE: >>> >>> a = b >>> a0 = a0 * 3 >>> a1 = a1 * 3 >>> a2 = a2 * 3 >>> a3 = a3 * 3 >>> >>> After: >>> >>> a = b >>> a0 = b0 * 3 >>> a1 = a1 * 3 <<< * >>> a2 = a2 * 3 >>> a3 = a3 * 3 >>> >>> CSE do not substitute b1 to a1 because at the moment (*) a0 was already >>> defined so actually a != b. Yes but a1 = b1, unfortunately CSE also do not >>> handle register-ranges parts as RA does. Strange thing here is that even if >>> we fix CSE, so CSE could propagate register-ranges subregs, this will make >>> trouble to RAL also because of the same reason: IRA do not handle precisely >>> register ranges parts. I attached a demo patch which forbids partial CSE >>> propagation and removes spills from Ex2. Is this patch OK? Or maybe CSE >>> should be fixed in a different way? Or maybe partial substitution is OK? >>> >>> Main question: Are there any plans to fix/upgrade IRA? >>> >>> --Marat >> >> >> >> gcc/ChangeLog: >> >> 2014-07-30 Marat Zakirov <m.zaki...@samsung.com> >> >> * cse.c (canon_reg): Forbid partial CSE. >> * fwprop.c (forward_propagate_and_simplify): Likewise. >> >> diff --git a/gcc/cse.c b/gcc/cse.c >> index 34f9364..a9e0442 100644 >> --- a/gcc/cse.c >> +++ b/gcc/cse.c >> @@ -2862,6 +2862,9 @@ canon_reg (rtx x, rtx insn) >> || ! REGNO_QTY_VALID_P (REGNO (x))) >> return x; >> >> + if (GET_MODE (x) == XImode) >> + return x; > > This patch is wrong and even more wrong. XImode is not defined in all > targets. >
Even if it were, this still wouldn't be the right fix. What if a machine had a native XImode? Then you'd be arbitrarily disabling parts of the compiler. R. > Maybe the better fix is to have lower subreg come along and split up > the moves for a = b and then a pass after reload comes along and > stitches it back together. > > Thanks, > Andrew > > >> + >> q = REG_QTY (REGNO (x)); >> ent = &qty_table[q]; >> first = ent->first_reg; >> diff --git a/gcc/fwprop.c b/gcc/fwprop.c >> index 547fcd6..eadc729 100644 >> --- a/gcc/fwprop.c >> +++ b/gcc/fwprop.c >> @@ -1317,6 +1317,9 @@ forward_propagate_and_simplify (df_ref use, rtx >> def_insn, rtx def_set) >> if (!new_rtx) >> return false; >> >> + if (GET_MODE (reg) == XImode) >> + return false; >> + >> return try_fwprop_subst (use, loc, new_rtx, def_insn, set_reg_equal); >> } >> >> >