On Wed, Aug 6, 2025 at 9:40 AM Jakub Jelinek <ja...@redhat.com> wrote:
>
> On Wed, Aug 06, 2025 at 08:48:55AM +0200, Richard Biener via Gcc wrote:
> > For loops the canonical place to perform such optimization is the loop
> > distribution pass which already recognizes
> > memcpy but also strlen (strcmp is more like strlen).
> >
> > For straight-line code there's also a bugreport about supporting
> > vectorization of such
> > sequences from the basic-block vectorizer.  Passes with related transforms 
> > are
> > ifcombine (it now also does limited load merging), store-merging and phiopt.
>
> I think store-merging doesn't have useful infrastructure for such kind of
> optimization.  This optimization needs to analyze many short basic blocks.
> Maybe also reassoc, though that one is for the case where
> some conditions can use |/& and others ||/&& in separate basic blocks.
> Or perhaps widening mul, though that one is too late for vectorization.
>
> The question is what the pass should be able to pattern recognize, whether
> just non-inlined functions which perform the per-member comparison (and
> which, whether just all equality or all non-equality comparisons (note, one
> would need to handle both in any case and just verify they jump to the same
> destination with the same return values), or also <=> cases (for unsigned
> char members or little-endian larger unsigned vars)), or also when they are
> inlined and say stored etc.
>
> Some food for thought, have a look at say
> -O3 -fkeep-inline-functions -fdump-tree-all
> #include <compare>
>
> struct S { unsigned char a, b, c, d; unsigned e; bool operator== (const S &) 
> const = default; };
> struct T { unsigned char a, b, c, d; unsigned e; auto operator<=> (const T &) 
> const = default; };
> void foo (const S *a, const S *b, int *c) { *c = *a == *b ? 42 : -13; }
> void bar (const T *a, const T *b, std::strong_ordering *c) { *c = *a <=> *b; }
> bool baz (const S *a, const S *b) {
>   if (a->a != b->a) return false;
>   if (a->b != b->b) return false;
>   if (a->c != b->c) return false;
>   if (a->d != b->d) return false;
>   if (a->e != b->e) return false;
>   return true;
> }
> std::strong_ordering qux (const T *a, const T *b) {
>   if (auto c = a->a <=> b->a; c != 0) return c;
>   if (auto c = a->b <=> b->b; c != 0) return c;
>   if (auto c = a->c <=> b->c; c != 0) return c;
>   if (auto c = a->d <=> b->d; c != 0) return c;
>   return a->e <=> b->e;
> }
> void freddy ();
> void garply (const S *a, const S *b) { if (*a == *b) freddy (); }
> void corge (const T *a, const T *b) { if ((*a <=> *b) < 0) freddy (); }
> In some cases it will be something like:
>   <bb 2> :
>   _1 = this_13(D)->a;
>   _2 = _14(D)->a;
>   if (_1 == _2)
>     goto <bb 4>; [INV]
>   else
>     goto <bb 3>; [INV]
>
>   <bb 3> :
>   // predicted unlikely by early return (on trees) predictor.
>   goto <bb 12>; [INV]
>
>   <bb 4> :
>   _3 = this_13(D)->b;
>   _4 = _14(D)->b;
>   if (_3 == _4)
>     goto <bb 6>; [INV]
>   else
>     goto <bb 5>; [INV]
>
>   <bb 5> :
>   // predicted unlikely by early return (on trees) predictor.
>   goto <bb 12>; [INV]
> ...
>   <bb 12> :
>   # _11 = PHI <0(5), 1(10), 0(3), 0(11), 0(9), 0(7)>
>   return _11;
> i.e. end up with a PHI and sometimes with the forwarder bbs, sometimes
> without.  I see that fnsplit + some IPA optimizations for the
> S::operator== case split the last part into a separate function and
> something during IPA? questionably moved the loads earlier:
> ...
>   <bb 4> [local count: 268435456]:
>   _5 = this_9(D)->c;
>   _6 = _10(D)->c;
>   if (_5 == _6)
>     goto <bb 5>; [50.00%]
>   else
>     goto <bb 9>; [50.00%]
>
>   <bb 5> [local count: 134217728]:
>   _12 = MEM[(unsigned char *)this_9(D) + 3B];
>   _13 = MEM[(unsigned int *)this_9(D) + 4B];
>   _14 = MEM[(unsigned char *)_10(D) + 3B];
>   _15 = MEM[(unsigned int *)_10(D) + 4B];
>   if (_12 == _14)
>     goto <bb 6>; [50.00%]
>   else
>     goto <bb 8>; [50.00%]
>
>   <bb 6> [local count: 67108864]:
>   if (_13 == _15)
>     goto <bb 8>; [50.00%]
>   else
>     goto <bb 7>; [50.00%]
>
>   <bb 7> [local count: 33554432]:
>
>   <bb 8> [local count: 134217728]:
>   # _20 = PHI <0(5), 1(6), 0(7)>
>
>   <bb 9> [local count: 1073741824]:
>   # _7 = PHI <0(3), 0(4), 0(2), _20(8)>
>   return _7;
>
> But in other cases it won't be about PHIs but just about where exactly it
> jumps (in particular, have a look at garply and corge).
> The spaceship cases are even more difficult to pattern recognize and to
> transform that to memcmp it needs to be TYPE_UNSIGNED char (or
> unsigned short, unsigned int, unsigned long etc. for little endian)
> comparison.
>
> And as I've tried to mention in the PR, I'm actually not sure when
> the vectorization of memcmp is a safe thing to do vs. pointers to
> incompletely allocated structs.
> Consider say
> struct U { char a; char b[63]; }; where the b is actually poor man's
> flexible array member (because e.g. standard C++ doesn't have that),
> where allocations are done with offsetof (U, b) + n sizes or so.
> So, the fact that say b[3] is mapped doesn't mean b[4] will be mapped.
> Now, if you do some comparison of such structs (== or <=>), if say
> b[3] is always different between them, the original source will never
> read b[4], but perhaps vectorized memcmp might want to and crash.
>
> This isn't a problem when both arguments of the memcmp are pointers
> to variables, just when they are pointers to something where we don't
> know the size.  It can be vectorized if we make sure to always align
> it, but that might result in too large code.

If the accesses can possibly fault then it's like vectorization of
early breaks - you'd either need alignment guarantees or something
like first-fault-loads (aka loads that do not fault).

Depending on how exactly pointers are formed you can't use
the guarantee that a pointer has to point to a valid object to
ignore ordering either, so for char *, p[0] || p[-1] cannot be
re-ordered (you might argue p[1] || p[0] can because you have
that pointer to p[0])

Richard.

>
> > >   2.  When a struct A has a struct B field and we compare 2 A instances, 
> > > when getting to this pass, A::operator== still contains a call to 
> > > B::operator==, making it hard to merge the comparisons of B together with 
> > > those of A. What is weird to me is that even after einline and IPA inline 
> > > passes it seems it is still there as a call, instead of being inlined.
> >
> > I'd not do this as an early pass (and I'd not do it as a separate pass 
> > anyway).
> >
> > > These are of course 2 cases that could be handled by implementing special 
> > > cases (e.g. detecting the loop structure, recursing into the called 
> > > function, then coming back and merging anyway), but I feel as though 
> > > tackling the problems in this way will result in infinite complexity as I 
> > > discover more cases. That's why I was hoping for some feedback on this, 
> > > does the general approach I am taking seem logical for GCC, should I 
> > > handle these as special cases, is there some other way of going about it?
> > > Any tips very much appreciated, thank you!
> >
> > As for the case with the call I'd figure why we do not inline.
>
>         Jakub
>

Reply via email to