------- Comment #36 from whaley at cs dot utsa dot edu 2006-08-06 15:03 ------- Paola,
Thanks for working on this. We are making progres, but I have some mixed results. I timed the assemblies you provided directly. I added a target "asgexe" that builds the same benchmark, assuming assembly source instead of C to make this more reproducable. I ran on the Athlon-64X2, where your new assembly ran *faster* than gcc 3 for double precision. However, you still lost for single precision. I believe the reason is that you still have more fmuls/fmull (fmul from memory) than does gcc 3: >animal>fgrep -i fmuls smm_4.s | wc > 240 480 4051 >animal>fgrep -i fmuls smm_asg.s | wc > 60 120 1020 >animal>fgrep -i fmuls smm_3.s | wc > 0 0 0 >animal>fgrep -i fmull dmm_4.s | wc > 100 200 1739 >animal>fgrep -i fmull dmm_asg.s | wc > 20 40 360 >animal>fgrep -i fmuls dmm_3.s | wc > 0 0 0 I haven't really scoped out the dmm diff, but in single prec anyway, these dreaded fmuls are in the inner loop, and this is probably why you are still losing. I'm guessing your peephole is missing some cases, and for some reason is missing more under single. Any ideas? As for your assembly actually beating gcc 3 for double, my guess is that it is some other optimization that gcc 4 has, and you will beat by even more once the final fmull are removed . . . On the P4e, your double precision code is faster than stock gcc 4, but still slower than gcc3. again, I suspect the remaining fmull. Then comes the thing I cannot explain at all. Your single precision results are horrible. gcc 3 gets 1991MFLOPS, gcc 4 gets 1664, and the assembly you sent gets 34! No chance the mixed fld/fmuls is causing stack overflow, I guess? I think this might account for such a catastrophic drop . . . That's about the only WAG I've got for this behavior. Anyway, I think the first order of business may be to get your peephole to grabbing all the cases, and see if that makes you win everywhere on Athlon, and if it makes single precision P4e better, and we can go from there . . . If you do that, attach the assemblies again, and I'll redo timings. Also, if you could attach (not put in comment) the patch, it'd be nice to get the compiler, so I could test x86-64 code on Athlon, etc. Thanks, Clint -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827
