------- Comment #19 from whaley at cs dot utsa dot edu 2006-06-26 00:55 ------- Thanks for the info. I'm sorry to hear that no performance regression tests are done, but I guess it kind of explains why these problems reoccur :)
As to not unrolling, the fully unrolled case is almost always commandingly better whenever I've looked at it. After your note, I just tried on my P4, using ATLAS's P4 kernel, and I get (ku is inner loop unrolling, and nb=40, so 40 is fully unrolled): GCC 4 ku=1 : 1.65Gflop GCC 4 ku=40 : 1.84Gflop Gcc 3 ku=1 : 1.90Gflop Gcc 3 ku=40: 2.19Gflop This is throwing the -funroll-loops flag. BTW, gcc 4 w/o the -funroll-loops (ku=1) is indeed slower, at roughly 1.54 . . . Anyway, I've never found the performance of gcc ku=1 competitive with ku=<fully unrolled> on any machine. Even in assembly, I have to fully unroll the inner loop to get near peak on all intel machines. On the Opteron, you can get within 5% or so with a rolled loop in assembly, but I've not gotten a C code to do that.I think the gcc unrolling probably defaults to something like 4 or 8 (guess from performance, not verified): unrolling all the way (the loop is over a compile-time constant) is the way to go . . . When you said competitive, did you mean that gcc 4 ku=1 was competitive with gcc 4 ku=40 or gcc 3 ku=1? If the latter, I find it hard to believe unless you use SSE for gcc 4 and something unexpected happens. Even so, if you are using SSE try it with the single precision kernel, where SSE cannot compete with the x87 unit (even the broken one in gcc 4). Thanks, Clint -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827