https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
--- Comment #17 from Wilco <wilco at gcc dot gnu.org> --- (In reply to rguent...@suse.de from comment #16) > But you can't elide the checks in the peeled copies and for 4-times > unrolling you have most cases exiting on the first or fourth check. See comment #8 for an example how it should be unrolled (it needs a simple check at entry and a trailing loop as well of course). > Duffs device simply merges the prologue iterations for unrolling > with the loop body so I don't see why it can't be used. It's > > switch (n % 4) > { > loop: > iter > n--; > case 3: > iter > n--; > case 2: > iter > n-- > case 1: > iter > n--; > case 0: > if (n != 0) > goto loop; > } > > it's cost is mainly the computed jump into the loop body. Then > you have a four-fold reduction in branches without the overhead > of having another three loop body copies in the prologue with > retained early exit checks. Duff's device is a bad idea given it adds extra checks and dependencies that aren't necessary if you unroll properly. There is never a need to merge the trailing loop into the unrolled copy, and neither should we peel off 3 iterations for no gain.