On Wed, Aug 6, 2025 at 9:40 AM Jakub Jelinek <ja...@redhat.com> wrote: > > On Wed, Aug 06, 2025 at 08:48:55AM +0200, Richard Biener via Gcc wrote: > > For loops the canonical place to perform such optimization is the loop > > distribution pass which already recognizes > > memcpy but also strlen (strcmp is more like strlen). > > > > For straight-line code there's also a bugreport about supporting > > vectorization of such > > sequences from the basic-block vectorizer. Passes with related transforms > > are > > ifcombine (it now also does limited load merging), store-merging and phiopt. > > I think store-merging doesn't have useful infrastructure for such kind of > optimization. This optimization needs to analyze many short basic blocks. > Maybe also reassoc, though that one is for the case where > some conditions can use |/& and others ||/&& in separate basic blocks. > Or perhaps widening mul, though that one is too late for vectorization. > > The question is what the pass should be able to pattern recognize, whether > just non-inlined functions which perform the per-member comparison (and > which, whether just all equality or all non-equality comparisons (note, one > would need to handle both in any case and just verify they jump to the same > destination with the same return values), or also <=> cases (for unsigned > char members or little-endian larger unsigned vars)), or also when they are > inlined and say stored etc. > > Some food for thought, have a look at say > -O3 -fkeep-inline-functions -fdump-tree-all > #include <compare> > > struct S { unsigned char a, b, c, d; unsigned e; bool operator== (const S &) > const = default; }; > struct T { unsigned char a, b, c, d; unsigned e; auto operator<=> (const T &) > const = default; }; > void foo (const S *a, const S *b, int *c) { *c = *a == *b ? 42 : -13; } > void bar (const T *a, const T *b, std::strong_ordering *c) { *c = *a <=> *b; } > bool baz (const S *a, const S *b) { > if (a->a != b->a) return false; > if (a->b != b->b) return false; > if (a->c != b->c) return false; > if (a->d != b->d) return false; > if (a->e != b->e) return false; > return true; > } > std::strong_ordering qux (const T *a, const T *b) { > if (auto c = a->a <=> b->a; c != 0) return c; > if (auto c = a->b <=> b->b; c != 0) return c; > if (auto c = a->c <=> b->c; c != 0) return c; > if (auto c = a->d <=> b->d; c != 0) return c; > return a->e <=> b->e; > } > void freddy (); > void garply (const S *a, const S *b) { if (*a == *b) freddy (); } > void corge (const T *a, const T *b) { if ((*a <=> *b) < 0) freddy (); } > In some cases it will be something like: > <bb 2> : > _1 = this_13(D)->a; > _2 = _14(D)->a; > if (_1 == _2) > goto <bb 4>; [INV] > else > goto <bb 3>; [INV] > > <bb 3> : > // predicted unlikely by early return (on trees) predictor. > goto <bb 12>; [INV] > > <bb 4> : > _3 = this_13(D)->b; > _4 = _14(D)->b; > if (_3 == _4) > goto <bb 6>; [INV] > else > goto <bb 5>; [INV] > > <bb 5> : > // predicted unlikely by early return (on trees) predictor. > goto <bb 12>; [INV] > ... > <bb 12> : > # _11 = PHI <0(5), 1(10), 0(3), 0(11), 0(9), 0(7)> > return _11; > i.e. end up with a PHI and sometimes with the forwarder bbs, sometimes > without. I see that fnsplit + some IPA optimizations for the > S::operator== case split the last part into a separate function and > something during IPA? questionably moved the loads earlier: > ... > <bb 4> [local count: 268435456]: > _5 = this_9(D)->c; > _6 = _10(D)->c; > if (_5 == _6) > goto <bb 5>; [50.00%] > else > goto <bb 9>; [50.00%] > > <bb 5> [local count: 134217728]: > _12 = MEM[(unsigned char *)this_9(D) + 3B]; > _13 = MEM[(unsigned int *)this_9(D) + 4B]; > _14 = MEM[(unsigned char *)_10(D) + 3B]; > _15 = MEM[(unsigned int *)_10(D) + 4B]; > if (_12 == _14) > goto <bb 6>; [50.00%] > else > goto <bb 8>; [50.00%] > > <bb 6> [local count: 67108864]: > if (_13 == _15) > goto <bb 8>; [50.00%] > else > goto <bb 7>; [50.00%] > > <bb 7> [local count: 33554432]: > > <bb 8> [local count: 134217728]: > # _20 = PHI <0(5), 1(6), 0(7)> > > <bb 9> [local count: 1073741824]: > # _7 = PHI <0(3), 0(4), 0(2), _20(8)> > return _7; > > But in other cases it won't be about PHIs but just about where exactly it > jumps (in particular, have a look at garply and corge). > The spaceship cases are even more difficult to pattern recognize and to > transform that to memcmp it needs to be TYPE_UNSIGNED char (or > unsigned short, unsigned int, unsigned long etc. for little endian) > comparison. > > And as I've tried to mention in the PR, I'm actually not sure when > the vectorization of memcmp is a safe thing to do vs. pointers to > incompletely allocated structs. > Consider say > struct U { char a; char b[63]; }; where the b is actually poor man's > flexible array member (because e.g. standard C++ doesn't have that), > where allocations are done with offsetof (U, b) + n sizes or so. > So, the fact that say b[3] is mapped doesn't mean b[4] will be mapped. > Now, if you do some comparison of such structs (== or <=>), if say > b[3] is always different between them, the original source will never > read b[4], but perhaps vectorized memcmp might want to and crash. > > This isn't a problem when both arguments of the memcmp are pointers > to variables, just when they are pointers to something where we don't > know the size. It can be vectorized if we make sure to always align > it, but that might result in too large code.
If the accesses can possibly fault then it's like vectorization of early breaks - you'd either need alignment guarantees or something like first-fault-loads (aka loads that do not fault). Depending on how exactly pointers are formed you can't use the guarantee that a pointer has to point to a valid object to ignore ordering either, so for char *, p[0] || p[-1] cannot be re-ordered (you might argue p[1] || p[0] can because you have that pointer to p[0]) Richard. > > > > 2. When a struct A has a struct B field and we compare 2 A instances, > > > when getting to this pass, A::operator== still contains a call to > > > B::operator==, making it hard to merge the comparisons of B together with > > > those of A. What is weird to me is that even after einline and IPA inline > > > passes it seems it is still there as a call, instead of being inlined. > > > > I'd not do this as an early pass (and I'd not do it as a separate pass > > anyway). > > > > > These are of course 2 cases that could be handled by implementing special > > > cases (e.g. detecting the loop structure, recursing into the called > > > function, then coming back and merging anyway), but I feel as though > > > tackling the problems in this way will result in infinite complexity as I > > > discover more cases. That's why I was hoping for some feedback on this, > > > does the general approach I am taking seem logical for GCC, should I > > > handle these as special cases, is there some other way of going about it? > > > Any tips very much appreciated, thank you! > > > > As for the case with the call I'd figure why we do not inline. > > Jakub >