Re: [RFC] avoid type conversion through versioning loop

guojiufu via Gcc Thu, 25 Mar 2021 06:47:13 -0700

On 2021-03-25 21:32, guojiufu via Gcc wrote:

On 2021-03-25 16:35, Richard Biener wrote:

On Wed, 24 Mar 2021, guojiufu wrote:

On 2021-03-24 20:33, Richard Biener wrote:
> On Wed, 24 Mar 2021, guojiufu wrote:
>
>> On 2021-03-24 15:55, Richard Biener wrote:
>> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guoji...@linux.ibm.com> wrote:
>> >>
>> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
>> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guoji...@imap.linux.ibm.com>
>> >> > wrote:
>> >> >>
>> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
>> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc
>> >> >> > wrote:
>> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus
>> >> >> >> also
>> >> >> >> dependence analysis) to track extra conditions they need to handle
>> >> >> >> cases similar as to how niter analysis computes it's 'assumptions'
>> >> >> >> condition.  That allows the versioning to be done when there's an
>> >> >> >> actual beneficial transform (like vectorization) rather than just
>> >> >> >> upfront for the eventual chance that there'll be any.  Ideally
>> >> >> >> such
>> >> >> >> transform would then choose IVs in their transformed copy that
>> >> >> >> are analyzable w/o repeating such versioning exercise for the next
>> >> >> >> transform.
>> >> >> >
>> >> >> > And it might be beneficial to perform some type promotion/demotion
>> >> >> > pass, either early during vectorization or separately before
>> >> >> > vectorization
>> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
>> >> >> > Find out what type sizes the loop use, first try to demote
>> >> >> > computations
>> >> >> > to narrower types in the vectorized loop candidate (e.g. if
>> >> >> > something
>> >> >> > is computed in a wider type only to have the result demoted to
>> >> >> > narrower
>> >> >> > type), then pick up the widest type size still in use in the loop
>> >> >> > (ok,
>> >> >> > this assumes we don't mix multiple vector sizes in the loop, but
>> >> >> > currently
>> >> >> > our vectorizer doesn't do that) and try to promote computations
>> >> >> > that
>> >> >> > could
>> >> >> > be promoted to that type size.  We do partially something like that
>> >> >> > during
>> >> >> > vect patterns for bool types, but not other types I think.
>> >> >> >
>> >> >> >       Jakub
>> >> >>
>> >> >> Thanks for the suggestions!
>> >> >>
>> >> >> Enhancing SCEV could help other optimizations and improve performance
>> >> >> in
>> >> >> some cases.
>> >> >> While one of the direct ideas of using the '64bit type' is to
>> >> >> eliminate
>> >> >> conversions,
>> >> >> even for some cases which are not easy to be optimized through
>> >> >> ifconv/vectorization,
>> >> >> for examples:
>> >> >>
>> >> >>    unsigned int i = 0;
>> >> >>    while (a[i]>1e-3)
>> >> >>      i++;
>> >> >>
>> >> >>    unsigned int i = 0;
>> >> >>    while (p1[i] == p2[i] && p1[i] != '\0')
>> >> >>      i++;
>> >> >>
>> >> >> Or only do versioning on type for this kind of loop? Any suggestions?
>> >> >
>> >> > But the "optimization" resulting from such versioning is hard to
>> >> > determine upfront which means we'll pay quite a big code size cost
>> >> > for unknown questionable gain.  What's the particular optimization
>> >>
>> >> Right.  Code size increasing is a big pain on large loops. If the gain
>> >> is not significant, this optimization may not positive.
>> >>
>> >> > in the above cases?  Note that for example for
>> >> >
>> >> >     unsigned int i = 0;
>> >> >     while (a[i]>1e-3)
>> >> >        i++;
>> >> >
>> >> > you know that when 'i' wraps then the loop will not terminate.
>> >> > There's
>> >>
>> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
>> >> checkable.  Otherwise, the optimization to avoid zext is not adoptable.
>> >>
>> >> > the address computation that is i * sizeof (T) which is done in a
>> >> > larger
>> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
>> >> > operation
>> >> > that is 'slow' for you?
>> >>
>> >> This is the point: "zext(i)" is the instruction that I want to
>> >> eliminate,
>> >> which is the direct goal of the optimization.
>> >>
>> >> The gain of eliminating the 'zext' is visible or not, and the code size
>> >> increasing is small enough or not, this is a question and needs to
>> >> trade-off.
>> >> It may be only acceptable if the loop is very small, then eliminating
>> >> 'zext'
>> >> would help to save runtime, and code size increase maybe not big.
>> >
>> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
>> > make versioning a good trade-off.  The micro-architecture should better
>> > not make that totally slow (I'd expect an extra latency comparable to
>> > the multiply or add on the &a + zext(i) * 8 instruction chain).
>>
>> Agree, I understand your point.  The concern is some micro-architectures
>> are
>> not
>> very well on this yet.  I tested the above example code:
>>   unsigned i = 0;
>>   while (a[i] > 1e-3 && i < n)
>>     i++;
>> there are ~30% performance improvement if using "long i" instead
>> of"unsigned
>> i"
>> on ppc64le and x86.  It seems those instructions are not optimized too much
>> on
>> some platforms.  So, I'm wondering we need to do this in GCC.
>
> On x86 I see indexed addressing modes being used which should be fine.
> Compilable testcase:
>
> unsigned foo (double *a, unsigned n)
> {
>   unsigned i = 0;
>   while (a[i] > 1e-3 && i < n)
>     i++;
>   return i;
> }
>
> ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard
> time following its assembly ... ah, -fno-unroll-loops "helps" and
> produces
>
> .L5:
>         lfd %f0,0(%r9)
>         addi %r3,%r3,1
>         addi %r9,%r9,8
>         rldicl %r3,%r3,0,32
>         fcmpu %cr0,%f0,%f12
>         bnglr %cr0
>         bdnz .L5
>
> which looks pretty good to me, I suppose the rldicl is the
> zero-extension but the IVs are already 64bit and the
> zero-extension should be sunk to the loop exit instead.

Thanks so much for your quick reply!

Yes, "rldicl %r3,%r3,0,32" is sinkable and avoids this zext,
which also improve performance.

:) my test code is a little differentent:  argument n's type is long
(sorry).
unsigned __attribute__ ((noinline)) foo (double *a, long n)
{
  unsigned i = 0;
  while (a[i] > 1e-3 && i < n)
    i++;
  return i;
}


You can't elide the zero-extend here since n can be bigger
than MAX_UINT and thus i has to wrap.

  111: L111:
  39: pc={(%7:CC>=0)?simple_return:pc}
  102: %3:SI=%3:SI+0x1
  103: %9:DI=%3:DI<<0x3&0x7fffffff8
  104: %3:DI=zero_extend(%3:SI)
  105: %7:CC=cmp(%3:DI,%4:DI)
  106: %0:DF=[%10:DI+%9:DI]
  107: %0:CCFP=cmp(%0:DF,%12:DF)
  109: pc={(%0:CCFP>0)?L111:pc}

.L14:
        bgelr 7
        addi 3,3,1
        rldic 9,3,3,29
        rldicl 3,3,0,32
        cmpd 7,3,4
        lfdx 0,10,9
        fcmpu 0,0,12
        bgt 0,.L14

-------------- or for code:

unsigned __attribute__ ((noinline)) foo (double *a, unsigned n,unsigned i)

while (a[i] > 1e-3 && i != n) ///if using "&& i < n" the zext issinkable

    i++;
  return i;
}


With i != n we cannot compute the number of iterations because i can
wrap, thus we cannot elide the zero-extension.

Yes, because 'i' could wrap, so 'zext' can not be avoided from the loopbody.

   99: L99:
   69: {pc={(ctr:DI==0x1)?L39:pc};ctr:DI=ctr:DI-0x1;clobber
scratch;clobber scratch;}
   36: L36:
   23: %5:SI=%5:SI+0x1
   25: %9:DI=%5:DI<<0x3&0x7fffffff8
   24: %5:DI=zero_extend(%5:SI)
   29: %0:DF=[%3:DI+%9:DI]
   30: %0:CCFP=cmp(%0:DF,%12:DF)
   31: pc={(%0:CCFP>0)?L99:pc}


.L12:
        bdz .L2
.L5:
        addi 5,5,1
        rldic 9,5,3,29
        rldicl 5,5,0,32
        lfdx 0,3,9
        fcmpu 0,0,12
        bgt 0,.L12
.L2:

Any suggestions about how to relax the 'zext' in above cases? Myprevious idea

was versioning :).


I don't think there's another good way.  Not sure if I'd call it
versioning, I suppose for these kind of possibly wrapping IVs it
would be iteration space splitting?  So maybe it could be integrated
into the loop split pass somehow?  Let's look at

unsigned foo (double *a, unsigned n, unsigned i)
{
   while (a[i] > 1e-3 && i != n)
     i++;
   return i;
}

we'd split it as

  while (a[i] > 1e-3 && i > n)
    i++;
  while (a[i] > 1e-3 && i < n)
    i++;

Or:
if (i < n)
  while (a[i] > 1e-3 && i < n)   // i step until n, not wrap
    i++;
else if (i > n)

while (a[i] > 1e-3 && i > n) // or "while (a[i] > 1e-3 && i <=UINT_MAX)"

    i++;

  There is an typo, it should be.
else if (i > n)
  while (a[i] > 1e-3 && i != n) // i wrap.
   i++

As in your split loops:
while (a[i] > 1e-3 && i > n) // i may step to UINT_MAX and then 0
  i++;
while (a[i] > 1e-3 && i < n) // will no wrap
  i++;

These two new loops maybe able to elide 'zext'.


but I still see zero-extends here on x86.

For the case with 'long n' the issue is that the loop might not even
terminate and thus i wraps anyway.  There one could split the
iteration space at MIN (n, UINT_MAX), using an unsigned upper bound.

Yes, if 'long n' > UINT_MAX, 'i' would wrap, so 'zext' is kept in theloop body.

If the compiler can analyze that 'i' will not wrap, then 'zext' may
move out of the loop.

if (n <= UINT_MAX)
  loop run n times
else
  loop may infinite, or say 'unsigned i < n is always true'


Thanks for your always helpful suggestions!


BR.
Jiufu Guo.


Not sure if this makes sense..


Richard.

>
>> >
>> > OTOH making SCEV analysis not give up but instead record the constraints
>> > under which its solution is valid is a very good and useful thing to do.
>>
>> Thanks! Enhance SCEV could help a few cases, especially when other
>> optimizations
>> are enabled.
>>
>> Thanks again for your suggestions!
>>
>> BR.
>> Jiufu Guo.
>>
>> >
>> > Richard.
>> >
>> >> Thanks again for your very helpful comments!
>> >>
>> >> BR.
>> >> Jiufu Guo.
>> >>
>> >> >
>> >> > Richard.
>> >> >
>> >> >> BR.
>> >> >> Jiufu Guo.
>>
>>

Re: [RFC] avoid type conversion through versioning loop

Reply via email to