On Tue, Mar 21, 2017 at 06:29:10AM -0700, Matthew Wilcox wrote: > > Unrolling the loop could help a bit on old powerpc32s that don't have branch > > units, but on those processors the main driver is the time spent to do the > > effective write to memory, and the operations necessary to unroll the loop > > are not worth the cycle added by the branch. > > > > On more modern powerpc32s, the branch unit implies that branches have a zero > > cost. > > Fair enough. I'm just surprised it was worth unrolling the loop on > powerpc64 and not on powerpc32 -- see mem_64.S.
We can do at most one loop iteration per cycle, but we can do multiple stores per cycle, on modern, bigger CPUs. Many old or small CPUs have only one load/store unit on the other hand. There are other issues, but that is the biggest difference. Segher