https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
--- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> --- If the compiler knew say from PGO that pos is usually a multiple of certain power of two and that the loop usually iterates many times (I guess the latter can be determined from comparing the bb count of the loop itself and its header), it could emit something like: static int func2(int max, int pos, unsigned char *cur) { unsigned char *p = cur + pos; int len = 0; if (max > 32 && (pos & 7) == 0) { int l = ((1 - ((uintptr_t) cur)) & 7) + 1; while (++len != l) if (p[len] != cur[len]) goto end; unsigned long long __attribute__((may_alias)) *p2 = (unsigned long long *) &p[len]; unsigned long long __attribute__((may_alias)) *cur2 = (unsigned long long *) &cur[len]; while (len + 8 < max) { if (*p2++ != *cur2++) break; len += 8; } --len; } while (++len != max) if (p[len] != cur[len]) break; end: return cur[len]; } or so (untested). Of course, it could be done using SIMD too if there is a way to terminate the loop if any of the elts is different and could be done in that case at 16 or 32 or 64 characters at a time etc. But, without knowing that pos is typically some power of two this would just waste code size, dealing with the unaligned cases would be more complicated (one can't read the next elt until proving that the current one is all equal), so it would need to involve some rotations (or permutes for SIMD).