Toon Moene wrote:
Toon Moene wrote:
Tim Prince wrote:
> If you want those, you must request them with -mtune=barcelona.
OK, so it is an alignment issue (with -mtune=barcelona):
.L6:
movups 0(%rbp,%rax), %xmm0
movups (%rbx,%rax), %xmm1
incl %ecx
addps %xmm1, %xmm0
movaps %xmm0, (%r8,%rax)
addq $16, %rax
cmpl %r10d, %ecx
jb .L6
Once this problem is solved (well, determined how it could be solved),
we go on to the next, the extraneous induction variable %ecx.
There are two ways to deal with it:
1. Eliminate it with respect to the other induction variable that
counts in the same direction (upwards, with steps 16) and remember
that induction variable's (%rax) limit.
or:
2. Count %ecx down from %r10d to zero (which eliminates %r10d as a loop
carried register).
g77 avoided this by coding counted do loops with a separate loop counter
counting down to zero - not so with gfortran (quoting):
/* Translate the simple DO construct. This is where the loop variable
has integer type and step +-1. We can't use this in the general case
because integer overflow and floating point errors could give
incorrect results.
We translate a do loop from:
DO dovar = from, to, step
body
END DO
to:
[Evaluate loop bounds and step]
dovar = from;
if ((step > 0) ? (dovar <= to) : (dovar => to))
{
for (;;)
{
body;
cycle_label:
cond = (dovar == to);
dovar += step;
if (cond) goto end_label;
}
}
end_label:
This helps the optimizers by avoiding the extra induction variable
used in the general case. */
So either we teach the Fortran front end this trick, or we teach the
loop optimization the trick of flipping the sense of a (n otherwise
unused) induction variable ....
This would have paid off more frequently in i386 mode, where there is a
possibility of integer register pressure in loops small enough for such
an optimization to succeed.
This seems to be among the types of optimizations envisioned for
run-time binary interpretation systems.