On Wed, Aug 06, 2025 at 08:48:55AM +0200, Richard Biener via Gcc wrote: > For loops the canonical place to perform such optimization is the loop > distribution pass which already recognizes > memcpy but also strlen (strcmp is more like strlen). > > For straight-line code there's also a bugreport about supporting > vectorization of such > sequences from the basic-block vectorizer. Passes with related transforms are > ifcombine (it now also does limited load merging), store-merging and phiopt.
I think store-merging doesn't have useful infrastructure for such kind of optimization. This optimization needs to analyze many short basic blocks. Maybe also reassoc, though that one is for the case where some conditions can use |/& and others ||/&& in separate basic blocks. Or perhaps widening mul, though that one is too late for vectorization. The question is what the pass should be able to pattern recognize, whether just non-inlined functions which perform the per-member comparison (and which, whether just all equality or all non-equality comparisons (note, one would need to handle both in any case and just verify they jump to the same destination with the same return values), or also <=> cases (for unsigned char members or little-endian larger unsigned vars)), or also when they are inlined and say stored etc. Some food for thought, have a look at say -O3 -fkeep-inline-functions -fdump-tree-all #include <compare> struct S { unsigned char a, b, c, d; unsigned e; bool operator== (const S &) const = default; }; struct T { unsigned char a, b, c, d; unsigned e; auto operator<=> (const T &) const = default; }; void foo (const S *a, const S *b, int *c) { *c = *a == *b ? 42 : -13; } void bar (const T *a, const T *b, std::strong_ordering *c) { *c = *a <=> *b; } bool baz (const S *a, const S *b) { if (a->a != b->a) return false; if (a->b != b->b) return false; if (a->c != b->c) return false; if (a->d != b->d) return false; if (a->e != b->e) return false; return true; } std::strong_ordering qux (const T *a, const T *b) { if (auto c = a->a <=> b->a; c != 0) return c; if (auto c = a->b <=> b->b; c != 0) return c; if (auto c = a->c <=> b->c; c != 0) return c; if (auto c = a->d <=> b->d; c != 0) return c; return a->e <=> b->e; } void freddy (); void garply (const S *a, const S *b) { if (*a == *b) freddy (); } void corge (const T *a, const T *b) { if ((*a <=> *b) < 0) freddy (); } In some cases it will be something like: <bb 2> : _1 = this_13(D)->a; _2 = _14(D)->a; if (_1 == _2) goto <bb 4>; [INV] else goto <bb 3>; [INV] <bb 3> : // predicted unlikely by early return (on trees) predictor. goto <bb 12>; [INV] <bb 4> : _3 = this_13(D)->b; _4 = _14(D)->b; if (_3 == _4) goto <bb 6>; [INV] else goto <bb 5>; [INV] <bb 5> : // predicted unlikely by early return (on trees) predictor. goto <bb 12>; [INV] ... <bb 12> : # _11 = PHI <0(5), 1(10), 0(3), 0(11), 0(9), 0(7)> return _11; i.e. end up with a PHI and sometimes with the forwarder bbs, sometimes without. I see that fnsplit + some IPA optimizations for the S::operator== case split the last part into a separate function and something during IPA? questionably moved the loads earlier: ... <bb 4> [local count: 268435456]: _5 = this_9(D)->c; _6 = _10(D)->c; if (_5 == _6) goto <bb 5>; [50.00%] else goto <bb 9>; [50.00%] <bb 5> [local count: 134217728]: _12 = MEM[(unsigned char *)this_9(D) + 3B]; _13 = MEM[(unsigned int *)this_9(D) + 4B]; _14 = MEM[(unsigned char *)_10(D) + 3B]; _15 = MEM[(unsigned int *)_10(D) + 4B]; if (_12 == _14) goto <bb 6>; [50.00%] else goto <bb 8>; [50.00%] <bb 6> [local count: 67108864]: if (_13 == _15) goto <bb 8>; [50.00%] else goto <bb 7>; [50.00%] <bb 7> [local count: 33554432]: <bb 8> [local count: 134217728]: # _20 = PHI <0(5), 1(6), 0(7)> <bb 9> [local count: 1073741824]: # _7 = PHI <0(3), 0(4), 0(2), _20(8)> return _7; But in other cases it won't be about PHIs but just about where exactly it jumps (in particular, have a look at garply and corge). The spaceship cases are even more difficult to pattern recognize and to transform that to memcmp it needs to be TYPE_UNSIGNED char (or unsigned short, unsigned int, unsigned long etc. for little endian) comparison. And as I've tried to mention in the PR, I'm actually not sure when the vectorization of memcmp is a safe thing to do vs. pointers to incompletely allocated structs. Consider say struct U { char a; char b[63]; }; where the b is actually poor man's flexible array member (because e.g. standard C++ doesn't have that), where allocations are done with offsetof (U, b) + n sizes or so. So, the fact that say b[3] is mapped doesn't mean b[4] will be mapped. Now, if you do some comparison of such structs (== or <=>), if say b[3] is always different between them, the original source will never read b[4], but perhaps vectorized memcmp might want to and crash. This isn't a problem when both arguments of the memcmp are pointers to variables, just when they are pointers to something where we don't know the size. It can be vectorized if we make sure to always align it, but that might result in too large code. > > 2. When a struct A has a struct B field and we compare 2 A instances, > > when getting to this pass, A::operator== still contains a call to > > B::operator==, making it hard to merge the comparisons of B together with > > those of A. What is weird to me is that even after einline and IPA inline > > passes it seems it is still there as a call, instead of being inlined. > > I'd not do this as an early pass (and I'd not do it as a separate pass > anyway). > > > These are of course 2 cases that could be handled by implementing special > > cases (e.g. detecting the loop structure, recursing into the called > > function, then coming back and merging anyway), but I feel as though > > tackling the problems in this way will result in infinite complexity as I > > discover more cases. That's why I was hoping for some feedback on this, > > does the general approach I am taking seem logical for GCC, should I handle > > these as special cases, is there some other way of going about it? > > Any tips very much appreciated, thank you! > > As for the case with the call I'd figure why we do not inline. Jakub