On Tue, Apr 10, 2012 at 4:53 PM, Michael Matz <m...@suse.de> wrote: > Hi, > > this patch speeds up polyhedrons protein on Bulldozer quite a bit. The > things is that in this testcase cshift is called with a very short length > (<=3) and that the shift amount always is less than the length. > Surprisingly the division instruction takes up considerable amount of > time, so much that it makes sense to guard it, when the shift is in bound. > > Here's some oprofile of _gfortrani_cshift0_i4 (total 31020 cycles): > > 23 0.0032 : caf00: idiv %r13 > 13863 1.9055 : caf03: lea (%rdx,%r13,1),%r12 > > I.e. despite the memory shuffling one third of the cshift cycles are that > division. With the patch the time for protein drops from 0m21.367s to > 0m20.547s on this Bulldozer machine. I've checked that it has no adverse > effect on older AMD or Intel cores (0:44.30elapsed vs 0:44.00elapsed, > still an improvement). > > Regstrapped on x86_64-linux. Okay for trunk? > > > Ciao, > Michael. > > * m4/cshift0.m4 (cshift0_'rtype_code`): Guard use of modulo. > > * generated/cshift0_c10.c: Regenerated. > * generated/cshift0_c16.c: Regenerated. > * generated/cshift0_c4.c: Regenerated. > * generated/cshift0_c8.c: Regenerated. > * generated/cshift0_i16.c: Regenerated. > * generated/cshift0_i1.c: Regenerated. > * generated/cshift0_i2.c: Regenerated. > * generated/cshift0_i4.c: Regenerated. > * generated/cshift0_i8.c: Regenerated. > * generated/cshift0_r10.c: Regenerated. > * generated/cshift0_r16.c: Regenerated. > * generated/cshift0_r4.c: Regenerated. > * generated/cshift0_r8.c: Regenerated. > > Index: m4/cshift0.m4 > =================================================================== > --- m4/cshift0.m4 (revision 186272) > +++ m4/cshift0.m4 (working copy) > @@ -98,9 +98,13 @@ cshift0_'rtype_code` ('rtype` *ret, cons > rptr = ret->base_addr; > sptr = array->base_addr; > > - shift = len == 0 ? 0 : shift % (ptrdiff_t)len; > - if (shift < 0) > - shift += len; > + /* Avoid the costly modulo for trivially in-bound shifts. */ > + if (shift < 0 || shift >= len) > + { > + shift = len == 0 ? 0 : shift % (ptrdiff_t)len; > + if (shift < 0) > + shift += len; > + } > > while (rptr) > {
This is OK. Do you think it would be worthwhile to do this transformation in the middle end too, based on profile information for values? IIRC value-prof handles constant divmod but not ranges for modulo operations. Steven