On Friday, 23 March 2018 at 00:39:13 UTC, Cecil Ward wrote:
On Thursday, 22 March 2018 at 22:16:16 UTC, Iain Buclaw wrote:
https://bugzilla.gdcproject.org/show_bug.cgi?id=288
--- Comment #1 from Iain Buclaw <ibuc...@gdcproject.org> ---
See the long list of useless conditional jumps towards the
end of the first function in the asm output (whose demangled
name is test.t1(unit))
Well, you'd never use -O3 if you care about speed anyway. :-)
And they are not useless jumps, it's just the foreach loop
unrolled in its entirety. You can see that it's a feature of
the gcc-7 series and latter, irregardless of the target, they
all produce the same unrolled loop.
https://explore.dgnu.org/g/vD3N4Y
It might be a nice experiment to add pragma(ivdep) and
pragma(unroll) support
to give you more control.
https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html
I wouldn't hold my breath though (this is not strictly a bug).
Agreed. It is possibly not a bug, because I don't see that the
code is dysfunctional, but I haven't looked through it. But
since the backend is doing optimisation here with unrolling,
that being sub-optimal given with this weird code is imho a bug
in that the achievement of _optimisation_ is not attained.
No I understand this is nothing to do with D, and I understand
that this is unrolling.
But notice the target of the jumps are all to the same location
and finishes off with an unconditional jump to the same
location.
Not quite, if you look a little closer, some jump to other
branches hidden inbetween.
I feel this is just a quirk of unrolling, in part, but that's
not all I feel as the jumps don't make sense
cmp #n / jxx L3
cmp #m / jxx L3
jmp L3
is what we have so it all basically does absolutely nothing,
unless cmp 1 cmp 2 cmp 3 cmp 4 is an incredibly bad way of
testing ( x>=1 && x<=4 ) but with 30-odd tests it isn't very
funny.
If you compile with -fdump-tree-optimized=stdout you will see
that it's the middle-end that has lowered the code to a series of
if jumps.
The backend consumer doesn't really have any chance for improving
it.
I know this is merely debug-only code, but am wondering what
else might happen if you are misguided enough to use the crazy
-O3 with unrolled loops that have conditionals in them.
My other complaint about GCC back-end’' code generation is that
it (sometimes) doesn't go for jump-less movcc-style operations
when it can. For x86/x64, LDC sometimes generates jump-free
code using conditional instructions where GCC does not.
I can fix the problem with GDC by using A single & instead of a
&&, which happens to be legal here. Sometimes in the past I
have needed to take steps to make sure that I can do such an
operator substitution trick in order to get jump-free far far
faster code, faster where the alternatives are extremely short
(and side-effect-free) and branch prediction failure is a
certainty.
You could also try compiling with -O2. I couldn't really see
this in your given example, but honestly, if you want to optimize
really aggressively you must be willing to coax the compiler in
strange ways anyway.
I don't know if there are ways in which the backend could try
to ascertain whether the results of certain unrolling are
really bad. In some cases they could be bad because the code is
too long and generates problems with code cache size or won't
fit into a loop buffer. A highly per-cpu sub-variant check
would need to be carried out in the generated code size, at
least in all cases where there is still a loop left (as opposed
to full unrolling of known size), as every kind of AMD and
Intel processor is different, as Agner Fog warns us. Here
though I didn't even explicitly ask for unrolling, so you might
harshly say that it is the compiler’s jib ti work out whether
it is actually an anti-optimisation, regardless of the possible
reasons why the result may be bad news, never mind just based
on total generated code size not fitting into some per-CP limit.
Well again, from past experience -O3 doesn't really care about
code size or cache line so much. All optimizations passes which
lower the code this way do so during SSA transformations, so
irrespective of what is being targeted.
My reason for reporting this was to inquire about for loop
unrolling behaves in later versions of the back end, ask about
jump generation vs jump-free alternatives (LDC showing the
correct way to do things) and to ask if there are any
suboptimality nasties junking in code that does not merely come
down to driving an assert.
I would hope for an optimisation that handles the case of
dense-packed cmp #1 | cmp #2 | cmp #3 | cmp #4 etc, especially
with no holes, in the case where _all the jumps go to the same
target_, so this can get reduced down into a two-test range
check and huge optimisation. I would also hope that conditional
jumps followed by an unconditional jump could be spotted and
handled too. (Peephole general low level optimisation then? ie
jxx L1 / jmp L1 = jmp L1)
But is it really sub-optimal what -O3 is doing? Have you
benchmarked it? I certainly didn't.
---
bool IsPowerOf2(T)(T x) pure nothrow @safe @nogc
out ( ret )
{
assert( ret == true || ret == false );
debug
{
bool b = false;
foreach( s; 0.. 8 * x.sizeof )
b = b || ( x == (1uL << s) );
assert( ret == b );
}
}
body
{
return ( ( x & (x - 1) ) == 0 ) & (x > 0);
}
bool t1( int x )
{
return IsPowerOf2( x );
}
void main(string[] args)
{
int test()
{
return t1(cast(int)args.length);
}
import std.datetime.stopwatch;
import std.stdio;
auto r = benchmark!(test)(10000000);
writeln(r);
}
---
[54 ms, 299 μs, and 5 hnsecs]: gdc -O3 -fdebug --param
max-peel-branches=32
[168 ms, 71 μs, and 1 hnsec]: gdc -O3 -fdebug --param
max-peel-branches=24
Looks like that despite your complaint, it is 3x faster with what
you call "strange nonsensical code generation". :-)
Perhaps this is all being generated too late and optimisations
have ready happened and they opportunities those optimisers
provide have been and gone. Would it be possible for the
backend to include a number of repeat optimisation passes of
certain kinds after unrolled code is generated, or doesn't it
work like that?
I had a quick look, and there is a control parameter for this -
'--param max-peel-branches', the default value is 32, reducing it
to 24 is enough to ensure that the complete unrolling never
happens, but see benchmarks for why changing this may not be a
good idea.
Anyway, this is not for you but for that particular backend, I
suspect. I was wondering if someone could have a word, pass it
in to the relevant people. I think it's worth making -O3 more
generally usable rather than crazy because it features good
ideas gone bad.
No, as a language implementer, our job is only to guarantee that
semantic never breaks. Anything else is not relevant to us.
In this case though, it looks like everything is fine though and
there's nothing to report.