On 2021-03-24 20:33, Richard Biener wrote:
> On Wed, 24 Mar 2021, guojiufu wrote:
>
>> On 2021-03-24 15:55, Richard Biener wrote:
>> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guoji...@linux.ibm.com> wrote:
>> >>
>> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
>> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guoji...@imap.linux.ibm.com>
>> >> > wrote:
>> >> >>
>> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
>> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc
>> >> >> > wrote:
>> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus
>> >> >> >> also
>> >> >> >> dependence analysis) to track extra conditions they need to handle
>> >> >> >> cases similar as to how niter analysis computes it's 'assumptions'
>> >> >> >> condition. That allows the versioning to be done when there's an
>> >> >> >> actual beneficial transform (like vectorization) rather than just
>> >> >> >> upfront for the eventual chance that there'll be any. Ideally
>> >> >> >> such
>> >> >> >> transform would then choose IVs in their transformed copy that
>> >> >> >> are analyzable w/o repeating such versioning exercise for the next
>> >> >> >> transform.
>> >> >> >
>> >> >> > And it might be beneficial to perform some type promotion/demotion
>> >> >> > pass, either early during vectorization or separately before
>> >> >> > vectorization
>> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
>> >> >> > Find out what type sizes the loop use, first try to demote
>> >> >> > computations
>> >> >> > to narrower types in the vectorized loop candidate (e.g. if
>> >> >> > something
>> >> >> > is computed in a wider type only to have the result demoted to
>> >> >> > narrower
>> >> >> > type), then pick up the widest type size still in use in the loop
>> >> >> > (ok,
>> >> >> > this assumes we don't mix multiple vector sizes in the loop, but
>> >> >> > currently
>> >> >> > our vectorizer doesn't do that) and try to promote computations
>> >> >> > that
>> >> >> > could
>> >> >> > be promoted to that type size. We do partially something like that
>> >> >> > during
>> >> >> > vect patterns for bool types, but not other types I think.
>> >> >> >
>> >> >> > Jakub
>> >> >>
>> >> >> Thanks for the suggestions!
>> >> >>
>> >> >> Enhancing SCEV could help other optimizations and improve performance
>> >> >> in
>> >> >> some cases.
>> >> >> While one of the direct ideas of using the '64bit type' is to
>> >> >> eliminate
>> >> >> conversions,
>> >> >> even for some cases which are not easy to be optimized through
>> >> >> ifconv/vectorization,
>> >> >> for examples:
>> >> >>
>> >> >> unsigned int i = 0;
>> >> >> while (a[i]>1e-3)
>> >> >> i++;
>> >> >>
>> >> >> unsigned int i = 0;
>> >> >> while (p1[i] == p2[i] && p1[i] != '\0')
>> >> >> i++;
>> >> >>
>> >> >> Or only do versioning on type for this kind of loop? Any suggestions?
>> >> >
>> >> > But the "optimization" resulting from such versioning is hard to
>> >> > determine upfront which means we'll pay quite a big code size cost
>> >> > for unknown questionable gain. What's the particular optimization
>> >>
>> >> Right. Code size increasing is a big pain on large loops. If the gain
>> >> is not significant, this optimization may not positive.
>> >>
>> >> > in the above cases? Note that for example for
>> >> >
>> >> > unsigned int i = 0;
>> >> > while (a[i]>1e-3)
>> >> > i++;
>> >> >
>> >> > you know that when 'i' wraps then the loop will not terminate.
>> >> > There's
>> >>
>> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
>> >> checkable. Otherwise, the optimization to avoid zext is not adoptable.
>> >>
>> >> > the address computation that is i * sizeof (T) which is done in a
>> >> > larger
>> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
>> >> > operation
>> >> > that is 'slow' for you?
>> >>
>> >> This is the point: "zext(i)" is the instruction that I want to
>> >> eliminate,
>> >> which is the direct goal of the optimization.
>> >>
>> >> The gain of eliminating the 'zext' is visible or not, and the code size
>> >> increasing is small enough or not, this is a question and needs to
>> >> trade-off.
>> >> It may be only acceptable if the loop is very small, then eliminating
>> >> 'zext'
>> >> would help to save runtime, and code size increase maybe not big.
>> >
>> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
>> > make versioning a good trade-off. The micro-architecture should better
>> > not make that totally slow (I'd expect an extra latency comparable to
>> > the multiply or add on the &a + zext(i) * 8 instruction chain).
>>
>> Agree, I understand your point. The concern is some micro-architectures
>> are
>> not
>> very well on this yet. I tested the above example code:
>> unsigned i = 0;
>> while (a[i] > 1e-3 && i < n)
>> i++;
>> there are ~30% performance improvement if using "long i" instead
>> of"unsigned
>> i"
>> on ppc64le and x86. It seems those instructions are not optimized too much
>> on
>> some platforms. So, I'm wondering we need to do this in GCC.
>
> On x86 I see indexed addressing modes being used which should be fine.
> Compilable testcase:
>
> unsigned foo (double *a, unsigned n)
> {
> unsigned i = 0;
> while (a[i] > 1e-3 && i < n)
> i++;
> return i;
> }
>
> ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard
> time following its assembly ... ah, -fno-unroll-loops "helps" and
> produces
>
> .L5:
> lfd %f0,0(%r9)
> addi %r3,%r3,1
> addi %r9,%r9,8
> rldicl %r3,%r3,0,32
> fcmpu %cr0,%f0,%f12
> bnglr %cr0
> bdnz .L5
>
> which looks pretty good to me, I suppose the rldicl is the
> zero-extension but the IVs are already 64bit and the
> zero-extension should be sunk to the loop exit instead.
Thanks so much for your quick reply!
Yes, "rldicl %r3,%r3,0,32" is sinkable and avoids this zext,
which also improve performance.
:) my test code is a little differentent: argument n's type is long
(sorry).
unsigned __attribute__ ((noinline)) foo (double *a, long n)
{
unsigned i = 0;
while (a[i] > 1e-3 && i < n)
i++;
return i;
}
111: L111:
39: pc={(%7:CC>=0)?simple_return:pc}
102: %3:SI=%3:SI+0x1
103: %9:DI=%3:DI<<0x3&0x7fffffff8
104: %3:DI=zero_extend(%3:SI)
105: %7:CC=cmp(%3:DI,%4:DI)
106: %0:DF=[%10:DI+%9:DI]
107: %0:CCFP=cmp(%0:DF,%12:DF)
109: pc={(%0:CCFP>0)?L111:pc}
.L14:
bgelr 7
addi 3,3,1
rldic 9,3,3,29
rldicl 3,3,0,32
cmpd 7,3,4
lfdx 0,10,9
fcmpu 0,0,12
bgt 0,.L14
-------------- or for code:
unsigned __attribute__ ((noinline)) foo (double *a, unsigned n,
unsigned i)
{
while (a[i] > 1e-3 && i != n) ///if using "&& i < n" the zext is
sinkable
i++;
return i;
}