RE: kernel + gcc 4.1 = several problems

Zou, Nanhai Wed, 03 Jan 2007 19:09:52 -0800

> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Linus Torvalds
> Sent: 2007年1月4日 0:04
> To: Grzegorz Kulewski
> Cc: Alan; Mikael Pettersson; [EMAIL PROTECTED];
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED];
> linux-kernel@vger.kernel.org; [EMAIL PROTECTED]
> Subject: Re: kernel + gcc 4.1 = several problems
> 
> 
> 
> On Wed, 3 Jan 2007, Grzegorz Kulewski wrote:
> >
> > Could you explain why CMOV is pointless now? Are there any benchmarks 
> > proving
> > that?
> 
> CMOV (and, more generically, any "predicated instruction") tends to
> generally a bad idea on an aggressively out-of-order CPU. It doesn't
> always have to be horrible, but in practice it is seldom very nice, and
> (as usual) on the P4 it can be really quite bad.
> 
> On a P4, I think a cmov basically takes 10 cycles.
> 
> But even ignoring the usual P4 "I suck at things that aren't totally
> normal", cmov is actually not a great idea. You can always replace it by
> 
>               j<negated condition> forward
>               mov ..., %reg
>       forward:
> 
> and assuming the branch is AT ALL predictable (and 95+% of all branches
> are), the branch-over will actually be a LOT better for a CPU.
> 
> Why? Becuase branches can be predicted, and when they are predicted they
> basically go away. They go away on many levels, too. Not just the branch
> itself, but the _conditional_ for the branch goes away as far as the
> critical path of code is concerned: the CPU still has to calculate it and
> check it, but from a performance angle it "doesn't exist any more",
> because it's not holding anything else up (well, you want to do it in
> _some_ reasonable time, but the point stands..)
> 
> Similarly, whichever side of the branch wasn't taken goes away. Again, in
> an out-of-order machine with register renaming, this means that even if
> the branch isn't taken above, and you end up executing all the non-branch
> instructions, because you now UNCONDITIONALLY over-write the register, the
> old data in the register is now DEAD, so now all the OTHER writes to that
> register are off the critical path too!
> 
> So the end result is that with a conditional branch, ona good CPU, the
> _only_ part of the code that is actually performance-sensitive is the
> actual calculation of the value that gets used!
> 
> In contrast, if you use a predicated instruction, ALL of it is on the
> critical path. Calculating the conditional is on the critical path.
> Calculating the value that gets used is obviously ALSO on the critical
> path, but so is the calculation for the value that DOESN'T get used too.
> So the cmov - rather than speeding things up - actually slows things down,
> because it makes more code be dependent on each other.
> 
> So here's the basic rule:
> 
>  - cmov is sometimes nice for code density. It's not a big win, but it
>    certainly can be a win.
> 
>  - if you KNOW the branch is totally unpredictable, cmov is often good for
>    performance. But a compiler almost never knows that, and even if you
>    train it with input data and profiling, remember that not very many
>    branches _are_ totally unpredictable, so even if you were to know that
>    something is unpredictable, it's going to be very rare.
> 
>  - on a P4, branch mispredictions are expensive, but so is cmov, so all
>    the above is to some degree exaggerated. On nicer microarchitectures
>    (the Intel Core 2 in particular is something I have to say is very nice
>    indeed), the difference will be a lot less noticeable. The loss from
>    cmov isn't very big (it's not as sucky as P4), but neither is the win
>    (branch misprediction isn't that expensive either).
> 
> Here's an example program that you can test and time yourself.
> 
> On my Core 2, I get
> 
>       [EMAIL PROTECTED] ~]$ gcc -DCMOV -Wall -O2 t.c
>       [EMAIL PROTECTED] ~]$ time ./a.out
>       600000000
> 
>       real    0m0.194s
>       user    0m0.192s
>       sys     0m0.000s
> 
>       [EMAIL PROTECTED] ~]$ gcc -Wall -O2 t.c
>       [EMAIL PROTECTED] ~]$ time ./a.out
>       600000000
> 
>       real    0m0.167s
>       user    0m0.168s
>       sys     0m0.000s
> 
> ie the cmov is quite a bit slower. Maybe I did something wrong. But note
> how cmov not only is slower, it's fundamnetally more limited too (ie the
> branch-over can actually do a lot of things cmov simply cannot do).



Hi,
cmov will stall on eflags in your test program.
I think you will see benefit of cmov if you can manage to put some instructions 
which does NOT modify eflags between testl and cmov. 

Thanks
Zou Nan hai
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: kernel + gcc 4.1 = several problems

Reply via email to