https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
--- Comment #15 from Wilco <wilco at gcc dot gnu.org> --- (In reply to rguent...@suse.de from comment #14) > On Mon, 7 Jan 2019, wilco at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 > > > > --- Comment #13 from Wilco <wilco at gcc dot gnu.org> --- > > So to add some real numbers to the discussion, the average number of > > iterations > > is 4.31. Frequency stats (16 includes all iterations > 16 too): > > > > 1: 29.0 > > 2: 4.2 > > 3: 1.0 > > 4: 36.7 > > 5: 8.7 > > 6: 3.4 > > 7: 3.0 > > 8: 2.6 > > 9: 2.1 > > 10: 1.9 > > 11: 1.6 > > 12: 1.2 > > 13: 0.9 > > 14: 0.8 > > 15: 0.7 > > 16: 2.1 > > > > So unrolling 4x is perfect for this loop. Note the official xz version has > > optimized this loop since 2014(!) using unaligned accesses: > > https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h > > I guess if we'd have some data to guide then classical unrolling using > duffs device would be best here? Because peeling will increase the > number of dynamic branches and likely the actual distribution of > #iterations isn't so that they will be well predicted? Duff's device is very slow on any CPU with branch prediction. It can't be used here given you don't know the number of iterations in advance (only the maximum). Unrolling reduces the number of branches given the loop branch is only taken once every N iterations. The loop overhead is significant here: 3 out of 7 instructions, with 4x unrolling that reduces to 3 in 19.