[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

rguenther at suse dot de Mon, 01 Jun 2020 23:36:15 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398


--- Comment #37 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 2 Jun 2020, guojiufu at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
> 
> --- Comment #36 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
> (In reply to Jakub Jelinek from comment #10)
> > If the compiler knew say from PGO that pos is usually a multiple of certain
> > power of two and that the loop usually iterates many times (I guess the
> > latter can be determined from comparing the bb count of the loop itself and
> > its header), it could emit something like:
> > static int func2(int max, int pos, unsigned char *cur)
> > {
> >   unsigned char *p = cur + pos;
> >   int len = 0;
> >   if (max > 32 && (pos & 7) == 0)
> >     {
> >       int l = ((1 - ((uintptr_t) cur)) & 7) + 1;
> >       while (++len != l)
> >         if (p[len] != cur[len])
> >           goto end;
> >       unsigned long long __attribute__((may_alias)) *p2 = (unsigned long
> > long *) &p[len];
> >       unsigned long long __attribute__((may_alias)) *cur2 = (unsigned long
> > long *) &cur[len];
> >       while (len + 8 < max)
> >         {
> >           if (*p2++ != *cur2++)
> >             break;
> >           len += 8;
> >         }
> >       --len;
> >     }
> >   while (++len != max)
> >     if (p[len] != cur[len])
> >       break;
> > end:
> >   return cur[len];
> > }
> > 
> > or so (untested).  Of course, it could be done using SIMD too if there is a
> > way to terminate the loop if any of the elts is different and could be done
> > in that case at 16 or 32 or 64 characters at a time etc.
> > But, without knowing that pos is typically some power of two this would just
> > waste code size, dealing with the unaligned cases would be more complicated
> > (one can't read the next elt until proving that the current one is all
> > equal), so it would need to involve some rotations (or permutes for SIMD).
> 
> Unaligned reading is supported on some platforms already, and reading
> multi-bytes(64/128bits) takes far less cost than reading 8bits multi-times,
> extremely, dword reading may cost the same cycles as byte reading.
> As the above discussions, there are still a few kinds of stuff need to take
> care of.  I’m wondering if we could introduce this as a compiler optimization
> in some circumstances.

The challenge is as always to not regress cases too much where it ends
up not profitable rather than just making it work for SPEC ;)  Luckily
chasing SPEC numbers isn't our primary objective.

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

Reply via email to