Re: current NEON status

Marat Zakirov Tue, 12 Aug 2014 08:32:58 -0700


On 08/12/2014 12:20 PM, Richard Earnshaw wrote:

On 12/08/14 07:49, Andrew Pinski wrote:

On Mon, Aug 11, 2014 at 11:44 PM, Marat Zakirov <m.zaki...@samsung.com> wrote:

Hi Vladimir!


I think you are as the main IRA contributor would be appropriate person to
answer question bellow. Please confirm or refute my statement about
unsplittable register ranges in GCC IRA.


On 07/30/2014 05:38 PM, Marat Zakirov wrote:

Hi there!

My question came from bug
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725. I found that GCC
considers NEON register ranges as unsplittable. So any subregister may be
used only after whole chunk is dead. This issue leads to redundant spill
fills which is performance trouble.

Example 1: RAL trouble

#include <arm_neon.h>
#include <inttypes.h>

extern  uint16x8x4_t m0;
extern  uint16x8x4_t m1;
extern  uint16x8x4_t m2;
extern  uint16x8x4_t m3;
extern  uint16x8_t   m4;

void foo1(uint16_t * in_ptr)
{
      uint16x8x4_t t0, t1, t2, t3;
      t0 = vld4q_u16((uint16_t *)&in_ptr[0 ]);
      t1 = vld4q_u16((uint16_t *)&in_ptr[64]);
      t2 = vld4q_u16((uint16_t *)&in_ptr[128]);
      t3 = vld4q_u16((uint16_t *)&in_ptr[192]);
      m4 = t0.val[3];
      m4 = m4 * 3;                       <<< *
      t0.val[3] = t1.val[3];
      m0 = t3;
      m1 = t2;
      m2 = t1;
      m3 = t0;
}

Here test uses all NEON registers. No spill is needed. Because
multiplication requires one Q register which may be obtained from dead
t0.val[3] subregister. But GCC makes spill if multiplication (*) exists
because of issue described above.

Example 2: CSE makes trouble for IRA

#include <arm_neon.h>
#include <inttypes.h>

extern  uint16x8x4_t m0;
extern  uint16x8x4_t m1;

void foo2(uint16_t * in_ptr)
{
      uint16x8x4_t t0, t1;
      t0 = vld4q_u16((uint16_t *)&in_ptr[0 ]);
      t1 = vld4q_u16((uint16_t *)&in_ptr[64]);
      t0.val[0] *= 333;
      t0.val[1] *= 333;
      t0.val[2] *= 333;
      t0.val[3] *= 333;
      t1.val[0] *= 333;
      t1.val[1] *= 333;
      t1.val[2] *= 333;
      t1.val[3] *= 333;
      m0 = t0;
      m1 = t1;
}

Here test uses only half NEON + one Q for '333' factor. But GCC makes
spills here too! Briefly speak problem is in partial CSE. GCC generates rtl
with the listed bellow form:

Before CSE:

a = b
a0 = a0 * 3
a1 = a1 * 3
a2 = a2 * 3
a3 = a3 * 3

After:

a = b
a0 = b0 * 3
a1 = a1 * 3 <<< *
a2 = a2 * 3
a3 = a3 * 3

CSE do not substitute b1 to a1 because at the moment (*) a0 was already
defined so actually a != b. Yes but a1 = b1, unfortunately CSE also do not
handle register-ranges parts as RA does. Strange thing here is that even if
we fix CSE, so CSE could propagate register-ranges subregs, this will make
trouble to RAL also because of the same reason: IRA do not handle precisely
register ranges parts. I attached a demo patch which forbids partial CSE
propagation and removes spills from Ex2. Is this patch OK? Or maybe CSE
should be fixed in a different way? Or maybe partial substitution is OK?

Main question: Are there any plans to fix/upgrade IRA?

--Marat

This patch is wrong and even more wrong.  XImode is not defined in all targets.

Even if it were, this still wouldn't be the right fix.  What if a
machine had a native XImode?  Then you'd be arbitrarily disabling parts
of the compiler.

R.

Maybe the better fix is to have lower subreg come along and split up
the moves for a = b and then a pass after reload comes along and
stitches it back together.

Thanks,
Andrew

Hi all,

First, thank you to pay attention for issue which is totally untrivial.

I wrote above that my patch is only a demo which illustrates issue withregister ranges. My question is about conception not implementation.This demo patch switch off CSE partial substitution allowing nextcompiler passes substitute whole b and then throw away a at all whichdramatically decreases register pressure in the example above. It isstrange that disabling some compiler functionality gives you a betterresult. Anyway I believe that only way to really fix the issue - is toupgrade IRA pass. Upgraded version of IRA should work with parts ofrange register as they are separate ones, they should have separate LiveRanges, etc.

Could I propose appropriate version of my patch (which will do the sameconceptually - forbidding partial CSE)? If not what would be the properway to improve the code in my case?

Re: current NEON status

Reply via email to