https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
--- Comment #16 from rguenther at suse dot de <rguenther at suse dot de> --- On Tue, 8 Jan 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 > > --- Comment #15 from Wilco <wilco at gcc dot gnu.org> --- > (In reply to rguent...@suse.de from comment #14) > > On Mon, 7 Jan 2019, wilco at gcc dot gnu.org wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 > > > > > > --- Comment #13 from Wilco <wilco at gcc dot gnu.org> --- > > > So to add some real numbers to the discussion, the average number of > > > iterations > > > is 4.31. Frequency stats (16 includes all iterations > 16 too): > > > > > > 1: 29.0 > > > 2: 4.2 > > > 3: 1.0 > > > 4: 36.7 > > > 5: 8.7 > > > 6: 3.4 > > > 7: 3.0 > > > 8: 2.6 > > > 9: 2.1 > > > 10: 1.9 > > > 11: 1.6 > > > 12: 1.2 > > > 13: 0.9 > > > 14: 0.8 > > > 15: 0.7 > > > 16: 2.1 > > > > > > So unrolling 4x is perfect for this loop. Note the official xz version has > > > optimized this loop since 2014(!) using unaligned accesses: > > > https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h > > > > I guess if we'd have some data to guide then classical unrolling using > > duffs device would be best here? Because peeling will increase the > > number of dynamic branches and likely the actual distribution of > > #iterations isn't so that they will be well predicted? > > Duff's device is very slow on any CPU with branch prediction. It can't be used > here given you don't know the number of iterations in advance (only the > maximum). > > Unrolling reduces the number of branches given the loop branch is only taken > once every N iterations. The loop overhead is significant here: 3 out of 7 > instructions, with 4x unrolling that reduces to 3 in 19. But you can't elide the checks in the peeled copies and for 4-times unrolling you have most cases exiting on the first or fourth check. Duffs device simply merges the prologue iterations for unrolling with the loop body so I don't see why it can't be used. It's switch (n % 4) { loop: iter n--; case 3: iter n--; case 2: iter n-- case 1: iter n--; case 0: if (n != 0) goto loop; } it's cost is mainly the computed jump into the loop body. Then you have a four-fold reduction in branches without the overhead of having another three loop body copies in the prologue with retained early exit checks.