Feature request: Globalize symbol
Hi! When working with unit tests I frequently have the need to override a function or variable in a shared library. This works just as I want for global symbols, but if the symbol is local (declared static) I have to modify the source (remove the static using a STATIC preprocessor define) to make it work. The setup is as follows: app/Makefile app/src.c app/checktests/Makefile app/checktests/tests.c In the application Makefile I have a target to compile the application as a shared library. This target is invoked from the checktests Makefile and the lib is then linked with the tests. So I compile the application source under test from scratch and can control the flags. (To mess even less with the application under test I may change the setup in the future to include the application Makefile in a wrapper Makefile instead of adding a shared library target to it.) All this makes it possible to override any global symbols in src.c by defining the symbol in the tests.c file. What I miss is the possibility to override local symbols in a similar manner, without touching the source. This problem could be fixed by adding some options to gcc to globalize symbols. My proposal is the following new options: -fglobalize-symbol=SYMBOLNAME -fglobalize-symbols=FILENAME -fglobalize-all-symbols The first option makes the symbol SYMBOLNAME global. The second option makes all symbols in FILENAME global. The third option makes all symbols global. The globalization should apply to all symbols that are visible on the file scope level but not globally visible. E.g. both functions declared 'static' and variables declared 'static' (outside functions). The attribute '__attribute__ ((hidden))' may be overridden too, but for my purposes I don't have the need for this. Waiting hopefully /HUGO.
Re: gcse pass: expression hash table
On Wed, 23 Feb 2005, James E Wilson wrote: Tarun Kawatra wrote: During expression hash table construction in gcse pass(gcc vercion 3.4.1), expressions like a*b does not get included into the expression hash table. Such expressions occur in PARALLEL along with clobbers. You didn't mention the target, or exactly what the mult looks like. Target is i386 and the mult instruction looks like the following in RTL (insn 22 21 23 1 (parallel [ (set (reg/v:SI 62 [ c ]) (mult:SI (reg:SI 66 [ a ]) (reg:SI 67 [ b ]))) (clobber (reg:CC 17 flags)) ]) 172 {*mulsi3_1} (nil) (nil)) However, this isn't hard to answer just by using the source. hash_scan_set calls want_to_cse_p calls can_assign_to_reg_p calls added_clobbers_hard_reg_p which presumably returns true, which prevents the optimization. This makes sense. If the pattern clobbers a hard reg, then we can't safely insert it at any place in the function. It might be clobbering the hard reg at a point where it holds a useful value. If that is the reason, then even plus expression (shown below) should not be subjected to PRE as it also clobbers a hard register(CC). But it is being subjected to PRE. Multiplication expression while it looks same does not get even in hash table. (insn 35 34 36 1 (parallel [ (set (reg/v:SI 74 [ c ]) (plus:SI (reg:SI 78 [ a ]) (reg:SI 79 [ b ]))) (clobber (reg:CC 17 flags)) ]) 138 {*addsi_1} (nil) (nil)) -tarun While looking at this, I noticed can_assign_to_reg_p does something silly. It uses "FIRST_PSEUDO_REGISTER * 2" to try to generate a test pseudo register, but this can fail if a target has less than 4 registers, or if the set of virtual registers increases in the future. This should probably be LAST_VIRTUAL_REGISTER + 1 as used in another recent patch.
Re: C++ math optimization problem...
On Wed, 23 Feb 2005 10:36:07 -0800, Benjamin Redelings I <[EMAIL PROTECTED]> wrote: > Hi, > I have a C++ program that runs slower under 4.0 CVS than 3.4. So, I > am > trying to make some test-cases that might help deduce the reason. > However, when I reduced this testcase sufficiently, it began behaving > badly under BOTH 3.4 and 4.0 but I guess I should start with the > most reduced case first. > > Basically, the code just does a lot of multiplies and adds. However, > if I take the main loop outside of an if-block, it goes 5x faster. > Also, if I implement an array as 'double*' instead of 'vector' > it also goes 5x faster. Using valarray instead of > vector does not give any improvement. I'm sure this is an aliasing problem. The compiler cannot deduce that storing to result does not affect d. Otherwise the generated code looks reasonable. What is interesting though is, that removing the if makes the compiler recognize that result and d do not alias. In fact, the alias analysis seems to be confused by the scope of d - moving it outside of the if fixes the problem, too. Maybe Diego can shed some light on this effect. The testcase looks like #include const int OUTER = 10; const int INNER = 1000; using namespace std; int main(int argn, char *argv[]) { int s = atoi(argv[1]); double result; { vector d(INNER); // move outside of this scope to fix // initialize d for (int i = 0; i < INNER; i++) d[i] = double(1+i) / INNER; // calc result result=0; for (int i = 0; i < OUTER; ++i) for (int j = 1; j < INNER; ++j) result += d[j]*d[j-1] + d[j-1]; } printf("result = %f\n",result); return 0; }
Suggestion: Different exit code for ICE
Regressions that cause ICE's on invalid code often go unnoticed in the testsuite, since regular errors and ICE's both match { dg-error "" }. See for example g++.dg/parse/error16.C which ICE's since yesterday, but the testsuite still reports "PASS": Executing on host: /Work/reichelt/gccbuild/src-4.0/build/gcc/testsuite/../g++ -B/Work/reichelt/gccbuild/src-4.0/build/gcc/testsuite/../ /Work/reichelt/gccbuild/src-4.0/gcc/gcc/testsuite/g++.dg/parse/error16.C -nostdinc++ -I/home/reichelt/Work/gccbuild/src-4.0/build/i686-pc-linux-gnu/libstdc++-v3/include/i686-pc-linux-gnu -I/home/reichelt/Work/gccbuild/src-4.0/build/i686-pc-linux-gnu/libstdc++-v3/include -I/home/reichelt/Work/gccbuild/src-4.0/gcc/libstdc++-v3/libsupc++ -I/home/reichelt/Work/gccbuild/src-4.0/gcc/libstdc++-v3/include/backward -I/home/reichelt/Work/gccbuild/src-4.0/gcc/libstdc++-v3/testsuite -fmessage-length=0 -ansi -pedantic-errors -Wno-long-long -S -o error16.s (timeout = 300) /Work/reichelt/gccbuild/src-4.0/gcc/gcc/testsuite/g++.dg/parse/error16.C:8: error: redefinition of 'struct A::B' /Work/reichelt/gccbuild/src-4.0/gcc/gcc/testsuite/g++.dg/parse/error16.C:5: error: previous definition of 'struct A::B' /Work/reichelt/gccbuild/src-4.0/gcc/gcc/testsuite/g++.dg/parse/error16.C:8: internal compiler error: tree check: expected class 'type', have 'exceptional' (error_mark) in cp_parser_class_specifier, at cp/parser.c:12407 Please submit a full bug report, with preprocessed source if appropriate. See http://gcc.gnu.org/bugs.html> for instructions. compiler exited with status 1 output is: /Work/reichelt/gccbuild/src-4.0/gcc/gcc/testsuite/g++.dg/parse/error16.C:8: error: redefinition of 'struct A::B' /Work/reichelt/gccbuild/src-4.0/gcc/gcc/testsuite/g++.dg/parse/error16.C:5: error: previous definition of 'struct A::B' /Work/reichelt/gccbuild/src-4.0/gcc/gcc/testsuite/g++.dg/parse/error16.C:8: internal compiler error: tree check: expected class 'type', have 'exceptional' (error_mark) in cp_parser_class_specifier, at cp/parser.c:12407 Please submit a full bug report, with preprocessed source if appropriate. See http://gcc.gnu.org/bugs.html> for instructions. PASS: g++.dg/parse/error16.C (test for errors, line 5) PASS: g++.dg/parse/error16.C (test for errors, line 8) PASS: g++.dg/parse/error16.C (test for excess errors) (Btw, Mark, I think the regression was caused by your patch for PR c++/20152, could you please have a look?) The method used right now is to not use "" in the last error message, but that's forgotten too often. This calls for a more robust method IMHO. One way would be to make the testsuite smarter and make it recognize typical ICE patterns itself. This can indeed be done (I for example use it to monitor the testcases in Bugzilla, Phil borrowed the patterns for his regression tester). An easier way IMHO would be to return a different error code when encountering an ICE. That's only a couple of places in diagnostic.c and errors.c where we now have "exit (FATAL_EXIT_CODE);". We could return an (appropriately defined) ICE_ERROR_CODE instead. The testsuite would then just have to check the return value. What do you think? Regards, Volker
Re: gcse pass: expression hash table
On Feb 24, 2005 11:13 AM, Tarun Kawatra <[EMAIL PROTECTED]> wrote: > >> Such expressions occur in PARALLEL along with clobbers. > > > > You didn't mention the target, or exactly what the mult looks like. > > Target is i386 and the mult instruction looks like the following in RTL > > (insn 22 21 23 1 (parallel [ > (set (reg/v:SI 62 [ c ]) > (mult:SI (reg:SI 66 [ a ]) > (reg:SI 67 [ b ]))) > (clobber (reg:CC 17 flags)) > ]) 172 {*mulsi3_1} (nil) > (nil)) Hmm..., Does GCSE look into stuff in PARALLELs at all? From gcse.c: 1804:Single sets in a PARALLEL could be handled, but it's an extra complication 1805:that isn't dealt with right now. The trick is handling the CLOBBERs that 1806:are also in the PARALLEL. Later. IIRC it is one of those things that worked on the cfg-branch or the rtlopt-branch (and probably on the hammer-branch) but that never got merged to the mainline. Honza knows more about it, I think... Gr. Steven
Is 'mfcr' a legal opcode for RS6000 RIOS1?
In the crossgcc list was a problem with gcc-3.4 generating the opcode 'mfcr' with '-mcpu=power' for the second created multilib, when the GCC target is 'rs6000-ibm-aix4.3'. The other multilibs produced as default are for '-pthread', '-mcpu=powerpc' and '-maix64'... The AIX users could judge if all these are normally required, but when the builder also used the '--without-threads', the first sounds being vain or even clashing with something. Building no multilibs using '--disable-multilib' of course is possible... But what is the case with the 'mfcr' and POWER ? Bug in GNU as (the Linux binutils-2.15.94.0.2.2 was tried) or in GCC (both gcc-3.3.5 and gcc-3.4.3 were tried) ?
Inlining and estimate_num_insns
Hi! I'm looking at improving inlining heuristics at the moment, especially by questioning the estimate_num_insns. All uses of that function assume it to return a size cost, not a computation cost - is that correct? If so, why do we penaltize f.i. EXACT_DIV_EXPR compared to MULT_EXPR? Also, for the simple function double foo1(double x) { return x; } we return 4 as a cost, because we have double tmp = x; return tmp; and count the move cost (MODIFY_EXPR) twice. We could fix this by not walking (i.e. ignoring) RETURN_EXPR. Also, INSNS_PER_CALL is rather high (10) - what is this choice based on? Wouldn't it be better to at least make it proportional to the argument chain length? Or even more advanced to the move cost of the arguments? Finally, is there a set of testcases that can be used as a metric on wether improvements are improvements? Thanks, Richard. -- Richard Guenther WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
Re: Inlining and estimate_num_insns
On Feb 24, 2005 01:58 PM, Richard Guenther <[EMAIL PROTECTED]> wrote: > I'm looking at improving inlining heuristics at the moment, > especially by questioning the estimate_num_insns. Good. There is lots of room for improvement there. > All uses > of that function assume it to return a size cost, not a computation > cost - is that correct? Yes. > If so, why do we penaltize f.i. EXACT_DIV_EXPR > compared to MULT_EXPR? Dunno. Because divide usually results in more insns per tree? > Also, for the simple function > > double foo1(double x) > { > return x; > } > > we return 4 as a cost, because we have > >double tmp = x; >return tmp; > > and count the move cost (MODIFY_EXPR) twice. We could fix this > by not walking (i.e. ignoring) RETURN_EXPR. That would be a good idea if all estimate_num_insns ever sees is GIMPLE. Are you sure that is the case (I think it is, but I'm not sure). > Also, INSNS_PER_CALL is rather high (10) - what is this choice > based on? History. That's what it was in the old heuristics. > Wouldn't it be better to at least make it proportional > to the argument chain length? Or even more advanced to the move > cost of the arguments? That is what the RTL inliner used to do. The problem now is that you don't know what gets passes in a register and what is passed on the stack. > Finally, is there a set of testcases that can be used as a metric > on wether improvements are improvements? What I did in early 2003 was to add a mini-pass at the start of rest_of_compilation that just counted the number of real insns created for the current_function_decl, i.e. something like int num_insn = 0; rtx insn; for (insn = get_insns (); insn; insn = NEXT_INSN (insn)) if (INSN_P (insn)) num_insn++; and then compare the result with the estimate of the tree inliner. The results were quite discouraging at the time, which is why Honza rewrote the size estimate. No idea how well or poor we do today ;-) Gr. Steven
Re: Inlining and estimate_num_insns
On Thu, 24 Feb 2005, Steven Bosscher wrote: > On Feb 24, 2005 01:58 PM, Richard Guenther <[EMAIL PROTECTED]> wrote: > > I'm looking at improving inlining heuristics at the moment, > > especially by questioning the estimate_num_insns. > > Good. There is lots of room for improvement there. > > > All uses > > of that function assume it to return a size cost, not a computation > > cost - is that correct? > > Yes. > > > If so, why do we penaltize f.i. EXACT_DIV_EXPR > > compared to MULT_EXPR? > > Dunno. Because divide usually results in more insns per tree? Well, I don't know - but ia32 fdiv and fmul are certainly of the same size ;) Of course for f.i. ia64 inlined FP divide this is not true, which asks for target dependent size estimates. So, pragmatically we should rather count tree nodes than trying to second-guess what the target-specific cost is. > > Also, for the simple function > > > > double foo1(double x) > > { > > return x; > > } > > > > we return 4 as a cost, because we have > > > >double tmp = x; > >return tmp; > > > > and count the move cost (MODIFY_EXPR) twice. We could fix this > > by not walking (i.e. ignoring) RETURN_EXPR. > > That would be a good idea if all estimate_num_insns ever sees > is GIMPLE. Are you sure that is the case (I think it is, but > I'm not sure). Also for GENERIC, at least for what the C and C++ frontends are generating. What is discouraging at the moment is that we do not remove the "abstraction penalty" of inline int foo1(void) { return 0; } int foo(void) { return foo1(); } currently we have a cost of 2 for foo1 and a cost of 5 for foo with foo1 inlined. With the RETURN_EXPR ignore we get to 1 for foo1 and 2 for foo with inlined foo1. I'll think about how to get that down to 1. Richard. -- Richard Guenther WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
Re: __register_frame_info and unwinding shared libraries
Andrew Haley writes: > Jakub Jelinek writes: > > > > > While I still like using dl_iterate_phdr instead of > > > > __register_frame_info_bases for totally aesthetic reasons, there > > > > have been changes made to the dl_iterate_phdr interface since the > > > > gcc support was written that would allow the dl_iterate_phdr > > > > results to be cached. > > > > > > That would be nice. Also, we could fairly easily build a tree of > > > nodes, one for each loaded object, then we wouldn't be doing a linear > > > search through them. We could do that lazily, so it wouldn't kick in > > > 'til needed. > > > > Here is a rough patch for what you can do. > > Thanks very much. I'm working on it. OK, I've roughed out a very simple patch and it certainly seems to improve things. Here's the before: samples cum. samples %cum. % app name symbol name 1796217962 25.8164 25.8164libgcc_s.so.1 _Unwind_IteratePhdrCallback 7019 24981 10.0882 35.9046libc-2.3.3.so dl_iterate_phdr 6966 31947 10.0121 45.9167libgcc_s.so.1 read_encoded_value_with_base 3756 35703 5.3984 51.3151libgcj.so.6.0.0 GC_mark_from 3643 39346 5.2360 56.5511libgcc_s.so.1 search_object 2032 41378 2.9205 59.4717libgcc_s.so.1 __i686.get_pc_thunk.bx 1555 42933 2.2350 61.7066libgcj.so.6.0.0 _Jv_MonitorExit 1413 44346 2.0309 63.7375libgcj.so.6.0.0 _Jv_MonitorEnter 1288 45634 1.8512 65.5887libgcj.so.6.0.0 java::util::IdentityHashMap::hash(java::lang::Object*) And here's the after: samples cum. samples %cum. % app name symbol name 7020 7020 14.7674 14.7674libgcc_s.so.1 read_encoded_value_with_base 3808 10828 8.0106 22.7780libgcc_s.so.1 _Unwind_IteratePhdrCallback 3680 14508 7.7413 30.5194libgcj.so.6.0.0 GC_mark_from 3463 17971 7.2849 37.8042libgcc_s.so.1 search_object 1587 19558 3.3385 41.1427libgcj.so.6.0.0 _Jv_MonitorExit 1577 21135 3.3174 44.4601libc-2.3.3.so dl_iterate_phdr 1288 22423 2.7095 47.1696libgcj.so.6.0.0 _Jv_MonitorEnter 1230 23653 2.5875 49.7570libgcj.so.6.0.0 java::util::IdentityHashMap::hash(java::lang::Object*) So, the time spent unwinding before was about 50% of the total runtime, and after about 28%. I measured the a miss rate of 0.006% with 27 entries used. Still, 28% is a heavy overhead. I think it's because we're doing a great deal of class lookups, and that does a stack trace as a security check. I'll look at caching secirity contexts in libgcj. Andrew.
Benchmark of gcc 4.0
I run for my personal pleasure (since I am a number cruncher) the Scimark2 tests on my P4 Linux machine. I tested GCC 4.0 (today's CVS) vs. GCC 3.4.1 vs. Intel's ICC 8.1 For GCC, I used in both cases the flags -march=pentium4 -mfpmath=sse -O3 -fomit-frame-pointer -ffast-math Should be of some interest, for ICC I used -ipo -tpp7 -xW -align -Zp16 -O3 The results were surprisingly bad, and this is why I am writing this message: GCC 4.0 GCC 3.4.1 ICC Composite Score: 270.51 345.28 430.47 FFT Mflops: 192.10 203.77 206.66 SOR Mflops:257.61 252.88 258.30 MC Mflops: 58.61 67.96 312.13 matmultMflops:376.64557.75 564.97 LUMflops:467.58 644.03 810.29 I leave aside any personal comments, except that being involved in Monte Carlo calculations, I would love if GCC were not outperformed by a factor of ~ 4.5 in MC by ICC. I also would like to ask whether you see anything wrong with those benchmarks and/or you have suggestions to improve them. Thanks, Biagio -- = Biagio Lucini Institut Fuer Theoretische Physik ETH Hoenggerberg CH-8093 Zuerich - Switzerland Tel. +41 (0)1 6332562 =
Re: Is 'mfcr' a legal opcode for RS6000 RIOS1?
> Kai Ruottu writes: Kai> In the crossgcc list was a problem with gcc-3.4 generating the opcode Kai> 'mfcr' with '-mcpu=power' for the second created multilib, when the Kai> GCC target is 'rs6000-ibm-aix4.3'. The other multilibs produced as Kai> default are for '-pthread', '-mcpu=powerpc' and '-maix64'... The AIX Kai> users could judge if all these are normally required, but when the Kai> builder also used the '--without-threads', the first sounds being vain Kai> or even clashing with something. Building no multilibs using Kai> '--disable-multilib' of course is possible... Kai> But what is the case with the 'mfcr' and POWER ? Bug in GNU as (the Kai> Linux binutils-2.15.94.0.2.2 was tried) or in GCC (both gcc-3.3.5 and Kai> gcc-3.4.3 were tried) ? First, AIX assembler is recommended on AIX. This is mentioned in the platform-specific installation information. The use of GNU Assembler on AIX probably is the source of your problems. The mfcr instruction has existed since the original POWER architecture. It always is valid. The instruction was updated in POWER4 and later chips to accept an optional operand to specify which field to move. That variant only is enabled for processors that support the instruction. The variant is not enabled for -mcpu=power. David
Bug in tree-inline.c:estimate_num_insns_1?
Hi! In estimate_num_insns_1 we currently have: /* Recognize assignments of large structures and constructors of big arrays. */ case INIT_EXPR: case MODIFY_EXPR: x = TREE_OPERAND (x, 0); /* FALLTHRU */ case TARGET_EXPR: case CONSTRUCTOR: { HOST_WIDE_INT size; ... shouldn't TARGET_EXPR being moved up before x = TREE_OPERAND (x, 0); ? Richard. -- Richard Guenther WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
Re: Inlining and estimate_num_insns
> Hi! > > I'm looking at improving inlining heuristics at the moment, > especially by questioning the estimate_num_insns. All uses > of that function assume it to return a size cost, not a computation > cost - is that correct? If so, why do we penaltize f.i. EXACT_DIV_EXPR > compared to MULT_EXPR? Well, not really. At least for inlining the idea of cost is mixed - if the function is either slow or big, inlining is not good idea. For post inline on CFG world, I plan to deambiguize these, but in current implementation both quantities seemed so raw, that doing something more precise on them didn't seem to make much sense. But I have the patch for separating code/size computations for tree-profiling around on my notebook (I believe), so I can pass you one in the case you want to help with tunning this. We can do pretty close esitmation there since we can build estimated profile so we know that functions with loop takes longer, while tree like functions can be fast even if they are big. > > Also, for the simple function > > double foo1(double x) > { > return x; > } > > we return 4 as a cost, because we have > >double tmp = x; >return tmp; > > and count the move cost (MODIFY_EXPR) twice. We could fix this > by not walking (i.e. ignoring) RETURN_EXPR. That would work, yes. I was also thinking about ignoring MODIFY_EXPR for var = var as those likely gets propagated later. > > Also, INSNS_PER_CALL is rather high (10) - what is this choice > based on? Wouldn't it be better to at least make it proportional > to the argument chain length? Or even more advanced to the move > cost of the arguments? Probably. The choice of constant is completely arbitrary. It is not too high cycle count wise (at least Athlon spends over 10 cycles per each call), but I never experimented with different values of this. There are two copies of this constant (I believe), one in tree-inline, other in cgraphunit that needs to be in sync. I have to cleanup this. > > Finally, is there a set of testcases that can be used as a metric > on wether improvements are improvements? This is major problem here - I use combination of spec (for C benchmarks), Gerald's applicaqtion and tramp3d, but all of these have very different behaviour and thus they hardly cover "common cases". If someone can come up with some more resonable testing method, I would be very happy - so far I simply test on all those and when results seems to be win in all three tests (or at least no loss), I apply them. Honza > > Thanks, > Richard. > > -- > Richard Guenther > WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
Re: Bug in tree-inline.c:estimate_num_insns_1?
On Feb 24, 2005, at 10:07 AM, Richard Guenther wrote: Hi! In estimate_num_insns_1 we currently have: /* Recognize assignments of large structures and constructors of big arrays. */ case INIT_EXPR: case MODIFY_EXPR: x = TREE_OPERAND (x, 0); /* FALLTHRU */ case TARGET_EXPR: case CONSTRUCTOR: { HOST_WIDE_INT size; ... shouldn't TARGET_EXPR being moved up before x = TREE_OPERAND (x, 0); ? TARGET_EXPR is not in gimple at all so really does not matter. -- Pinski
Re: Bug in tree-inline.c:estimate_num_insns_1?
On Thu, 24 Feb 2005 10:13:11 -0500, Andrew Pinski <[EMAIL PROTECTED]> wrote: > > On Feb 24, 2005, at 10:07 AM, Richard Guenther wrote: > > > Hi! > > > > In estimate_num_insns_1 we currently have: > > > > /* Recognize assignments of large structures and constructors of > >big arrays. */ > > case INIT_EXPR: > > case MODIFY_EXPR: > > x = TREE_OPERAND (x, 0); > > /* FALLTHRU */ > > case TARGET_EXPR: > > case CONSTRUCTOR: > > { > > HOST_WIDE_INT size; > > ... > > > > shouldn't TARGET_EXPR being moved up before x = TREE_OPERAND (x, 0); ? > > TARGET_EXPR is not in gimple at all so really does not matter. Then how do I get bitten by this? I guess cgraph gets feeded GENERIC. Richard.
Re: Benchmark of gcc 4.0
Biagio Lucini wrote: I run for my personal pleasure (since I am a number cruncher) the Scimark2 tests on my P4 Linux machine. I tested GCC 4.0 (today's CVS) vs. GCC 3.4.1 vs. Intel's ICC 8.1 For GCC, I used in both cases the flags -march=pentium4 -mfpmath=sse -O3 -fomit-frame-pointer -ffast-math Should be of some interest, for ICC I used -ipo -tpp7 -xW -align -Zp16 -O3 The results were surprisingly bad, and this is why I am writing this message: GCC 4.0 GCC 3.4.1 ICC Composite Score: 270.51 345.28 430.47 FFT Mflops: 192.10 203.77 206.66 SOR Mflops:257.61 252.88 258.30 MC Mflops: 58.61 67.96 312.13 matmultMflops:376.64557.75 564.97 LUMflops:467.58 644.03 810.29 I leave aside any personal comments, except that being involved in Monte Carlo calculations, I would love if GCC were not outperformed by a factor of ~ 4.5 in MC by ICC. I also would like to ask whether you see anything wrong with those benchmarks and/or you have suggestions to improve them. Thanks for reporting this. Although it would be more usefull if you made some analysis what is wrong with gcc. For example, icc reports loop vectorization. Or may be it is a memory heirarchy optimization, or usage of better standard function like random function (I am not familiar well with MC). Usually vectorization is a reason of such big difference. People in gcc community work on vectorization. Although I don't know when it will be used for x86. We have no such resources as Intel has (several hundred engineers mainly working on optimizations only for 3 their architectures). As for gcc4 vs gcc3.4, degradataion on x86 architecture is most probably because of higher register pressure created with more aggressive SSA optimizations in gcc4. The current register allocator does not deal well with such problem. So code generated by gcc4 can be worse for architectures with few registers. For architectures with many registers (like ia64), gcc4 generates a better code than gcc3.4. Again gcc community works on the register allocator problem too. Vlad
Change in treelang maintainership
It is my pleasure to announce that the steering committee has appointed James A. Morrison maintainer of our treelang frontend; Jim has been working on that for some time now. We'd also like to take the opportunity and thank Tim Josling for the time and effort he has spent on this frontend. Please adjust the MAINTAINERS file accordingly, Jim. Happy hacking! Gerald
Re: Benchmark of gcc 4.0
For GCC, I used in both cases the flags -march=pentium4 -mfpmath=sse -O3 -fomit-frame-pointer -ffast-math > As for gcc4 vs gcc3.4, degradataion on x86 architecture is most probably because of higher register pressure created with more aggressive SSA optimizations in gcc4. Try these five combinations: -O2 -fomit-frame-pointer -ffast-math -O2 -fomit-frame-pointer -ffast-math -fno-tree-pre -O2 -fomit-frame-pointer -ffast-math -fno-tree-pre -fno-gcse -O3 -fomit-frame-pointer -ffast-math -fno-tree-pre -O3 -fomit-frame-pointer -ffast-math -fno-tree-pre -fno-gcse You may also want to try -mfpmath=sse,387 in case your benchmarks use sin, cos and other trascendental functions that GCC knows about when using 387 instructions. Paolo
Re: Benchmark of gcc 4.0
I just got interested and did a test myself. Comparing gcc 4.0 (-O2 -funroll-loops -D__NO_MATH_INLINES -ffast-math -march=pentium4 -mfpmath=sse -ftree-vectorize) and icc 9.0 beta (-O3 -xW -ip): gcc 4.0 icc 9.0 Composite Score: 543.65609.20 FFT Mflops: 313.71 318.29 SOR Mflops: 441.96 426.32 MonteCarlo: Mflops: 105.68 71.20 Sparse matmult Mflops: 574.88 891.65 LU Mflops: 1282.00 1338.56 which looks not too bad ;) Richard.
Re: Benchmark of gcc 4.0
On Thursday 24 February 2005 16.52, Paolo Bonzini wrote: > > Try these five combinations: > [...] > > -O3 -fomit-frame-pointer -ffast-math -fno-tree-pre [...] This + 387 math is the one with the larger impact: it rises MC to around 80, but composite is still 279 (vs. ~ 345 for GCC 3.4). I will test on amd64, just to see whether there is any difference. Thanks, Biagio -- = Biagio Lucini Institut Fuer Theoretische Physik ETH Hoenggerberg CH-8093 Zuerich - Switzerland Tel. +41 (0)1 6332562 =
Re: Suggestion: Different exit code for ICE
> Regressions that cause ICE's on invalid code often go unnoticed in the > testsuite, since regular errors and ICE's both match { dg-error "" }. > See for example g++.dg/parse/error16.C which ICE's since yesterday, > but the testsuite still reports "PASS": > >Executing on host: > /Work/reichelt/gccbuild/src-4.0/build/gcc/testsuite/../g++ > -B/Work/reichelt/gccbuild/src-4.0/build/gcc/testsuite/../ > /Work/reichelt/gccbuild/src-4.0/gcc/gcc/testsuite/g++.dg/parse/error16.C > -nostdinc++ > -I/home/reichelt/Work/gccbuild/src-4.0/build/i686-pc-linux-gnu/libstdc++-v3/include/i686-pc-linux-gnu > > -I/home/reichelt/Work/gccbuild/src-4.0/build/i686-pc-linux-gnu/libstdc++-v3/include > -I/home/reichelt/Work/gccbuild/src-4.0/gcc/libstdc++-v3/libsupc++ > -I/home/reichelt/Work/gccbuild/src-4.0/gcc/libstdc++-v3/include/backward > -I/home/reichelt/Work/gccbuild/src-4.0/gcc/libstdc++-v3/testsuite > -fmessage-length=0 -ansi -pedantic-errors -Wno-long-long -S -o error16.s > (timeout = > 300) > > /Work/reichelt/gccbuild/src-4.0/gcc/gcc/testsuite/g++.dg/parse/error16.C:8: > error: redefinition of 'struct A::B' > > /Work/reichelt/gccbuild/src-4.0/gcc/gcc/testsuite/g++.dg/parse/error16.C:5: > error: previous definition of 'struct A::B' > > /Work/reichelt/gccbuild/src-4.0/gcc/gcc/testsuite/g++.dg/parse/error16.C:8: > internal compiler error: tree check: expected class 'type', have > 'exceptional' (error_mark) in cp_parser_class_specifier, at > cp/parser.c:12407 >Please submit a full bug report, >with preprocessed source if appropriate. >See http://gcc.gnu.org/bugs.html> for instructions. >compiler exited with status 1 >output is: > > /Work/reichelt/gccbuild/src-4.0/gcc/gcc/testsuite/g++.dg/parse/error16.C:8: > error: redefinition of 'struct A::B' > > /Work/reichelt/gccbuild/src-4.0/gcc/gcc/testsuite/g++.dg/parse/error16.C:5: > error: previous definition of 'struct A::B' > > /Work/reichelt/gccbuild/src-4.0/gcc/gcc/testsuite/g++.dg/parse/error16.C:8: > internal compiler error: tree check: expected class 'type', have > 'exceptional' (error_mark) in cp_parser_class_specifier, at > cp/parser.c:12407 >Please submit a full bug report, >with preprocessed source if appropriate. >See http://gcc.gnu.org/bugs.html> for instructions. > >PASS: g++.dg/parse/error16.C (test for errors, line 5) >PASS: g++.dg/parse/error16.C (test for errors, line 8) >PASS: g++.dg/parse/error16.C (test for excess errors) > > (Btw, Mark, I think the regression was caused by your patch for > PR c++/20152, could you please have a look?) > > The method used right now is to not use "" in the last error message, > but that's forgotten too often. > > This calls for a more robust method IMHO. > One way would be to make the testsuite smarter and make it recognize > typical ICE patterns itself. This can indeed be done (I for example > use it to monitor the testcases in Bugzilla, Phil borrowed the patterns > for his regression tester). > > An easier way IMHO would be to return a different error code when > encountering an ICE. That's only a couple of places in diagnostic.c > and errors.c where we now have "exit (FATAL_EXIT_CODE);". > We could return an (appropriately defined) ICE_ERROR_CODE instead. > The testsuite would then just have to check the return value. > > What do you think? That would certantly be a Good Thing. As far as I know, regular errors return exit code 1. I have two suggestions on that: a) use a testsuite that supports regexps and match a 1 exit code agianst /^Please submit a full bug report/ b) make it return a diffrent exit code (say -127 or even 2 ;-). c) make a seperate function for ICEs and make _that_ return exit code indicating ICE. There would be a disadvantage: as with any code moving, there would still be some code that didn't call that function Samuel Lauber -- _ Web-based SMS services available at http://www.operamail.com. From your mailbox to local or overseas cell phones. Powered by Outblaze
Re: Benchmark of gcc 4.0
On Thu, 24 Feb 2005 17:09:46 +0100, Biagio Lucini <[EMAIL PROTECTED]> wrote: > On Thursday 24 February 2005 16.52, Paolo Bonzini wrote: > > > > Try these five combinations: > > > [...] > > > > -O3 -fomit-frame-pointer -ffast-math -fno-tree-pre > > [...] > > This + 387 math is the one with the larger impact: it rises MC to around 80, > but composite is still 279 (vs. ~ 345 for GCC 3.4). I will test on amd64, > just to see whether there is any difference. I think the Intel compiler with -iop will inline Random_nextDouble which should explain the difference you see. The best options for gcc I found were compiling and linking via gcc-4.0 -O3 -funroll-loops -D__NO_MATH_INLINES -ffast-math -march=pentium4 -mfpmath=sse -ftree-vectorize -onestep -o scimark2 scimark2.c FFT.c kernel.c Stopwatch.c Random.c SOR.c SparseCompRow.c array.c MonteCarlo.c LU.c -lm -fomit-frame-pointer -finline-functions Note that gcc with -onestep still cannot inline over unit-boundaries. Richard.
Re: Benchmark of gcc 4.0
On Thursday 24 February 2005 17.06, Richard Guenther wrote: > I just got interested and did a test myself. Comparing gcc 4.0 (-O2 > -funroll-loops -D__NO_MATH_INLINES -ffast-math -march=pentium4 > -mfpmath=sse -ftree-vectorize) > and icc 9.0 beta (-O3 -xW -ip): > gcc 4.0 icc 9.0 > Composite Score: 543.65609.20 > FFT Mflops: 313.71 318.29 > SOR Mflops: 441.96 426.32 > MonteCarlo: Mflops: 105.68 71.20 > Sparse matmult Mflops: 574.88 891.65 > LU Mflops: 1282.00 1338.56 > > which looks not too bad ;) > > Richard. Hi Richard, thanks a lot for your test. I have redone it, the way you suggest, and I do find: GCC4.0ICC 8.1 GCC 3.4.1 Composite Score: 330.18384.53361.55 FFT Mflops: 206.66193.80206.66 SOR Mflops: 264.91 398.13253.55 MC Mflops:63.91 61.29 67.45 Sparse matmult : 348.60 436.91469.79 LU Mflops:767.04 832.52810.29 I would leave aside ICC 8.1 because (as I have showed in my previous message) I can choose other flags and get a speed rise of about 50%. I would take your optimisation flags for GCC better than mine, since they increase the composite score of both (which is what matters to me). Even so, there is at least one place where - if I can say that - we have a regression. Ready to test again, Biagio -- = Biagio Lucini Institut Fuer Theoretische Physik ETH Hoenggerberg CH-8093 Zuerich - Switzerland Tel. +41 (0)1 6332562 =
Re: gcse pass: expression hash table
On Wed, 23 Feb 2005, James E Wilson wrote: Tarun Kawatra wrote: During expression hash table construction in gcse pass(gcc vercion 3.4.1), expressions like a*b does not get included into the expression hash table. Such expressions occur in PARALLEL along with clobbers. You didn't mention the target, or exactly what the mult looks like. However, this isn't hard to answer just by using the source. hash_scan_set calls want_to_cse_p calls can_assign_to_reg_p calls added_clobbers_hard_reg_p which presumably returns true, which prevents the optimization. This makes sense. If the pattern clobbers a hard reg, then we can't safely insert it at any place in the function. It might be clobbering the hard reg at a point where it holds a useful value. While looking at this, I noticed can_assign_to_reg_p does something silly. ^^^ I could not find this function anywhere in gcc 3.4.1 source. Although FIRST_PSEUDO_REGISTER * 2 is being used in make_insn_raw in want_to_gcse_p directly as follows if (test_insn == 0) { test_insn = make_insn_raw (gen_rtx_SET (VOIDmode, gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER * 2), const0_rtx)); NEXT_INSN (test_insn) = PREV_INSN (test_insn) = 0; } It uses "FIRST_PSEUDO_REGISTER * 2" to try to generate a test pseudo register, but this can fail if a target has less than 4 registers, or if the set of virtual registers increases in the future. This should probably be LAST_VIRTUAL_REGISTER + 1 as used in another recent patch. I could not get this point. -tarun
Re: Suggestion: Different exit code for ICE
On Thu, Feb 24, 2005 at 11:46:20AM +0100, Volker Reichelt wrote: > Regressions that cause ICE's on invalid code often go unnoticed in the > testsuite, since regular errors and ICE's both match { dg-error "" }. > See for example g++.dg/parse/error16.C which ICE's since yesterday, > but the testsuite still reports "PASS": > [snip] > > This calls for a more robust method IMHO. > One way would be to make the testsuite smarter and make it recognize > typical ICE patterns itself. This can indeed be done (I for example > use it to monitor the testcases in Bugzilla, Phil borrowed the patterns > for his regression tester). > > An easier way IMHO would be to return a different error code when > encountering an ICE. That's only a couple of places in diagnostic.c > and errors.c where we now have "exit (FATAL_EXIT_CODE);". > We could return an (appropriately defined) ICE_ERROR_CODE instead. > The testsuite would then just have to check the return value. > > What do you think? I don't think that it's appropriate for any test to use { dg-error "" }; there should always be some substring of the expected message there. If the message changes then tests need to be updated, but that's better than not noticing when the message changes unexpectedly or, worse yet, the message is for an ICE. A quick count, however, shows that 1022 tests use { dg-error "" }. Given that, using and detecting a different error code for an ICE is an excellent idea. Janis
Seeking patch for bug in lifetime of __cur in deque::_M_fill_initialize (powerpc dw2 EH gcc 3.4.2)
Is there a patch for the following problem? I am having problems with _M_fill_initialize in deque on the powerpc version compiled at -O2. template void deque<_Tp,_Alloc>:: _M_fill_initialize(const value_type& __value) { _Map_pointer __cur; try { for (__cur = this->_M_impl._M_start._M_node; __cur < this->_M_impl._M_finish._M_node; ++__cur) std::uninitialized_fill(*__cur, *__cur + _S_buffer_size(), __value); / HERE / std::uninitialized_fill(this->_M_impl._M_finish._M_first, this->_M_impl._M_finish._M_cur, __value); } catch(...) { std::_Destroy(this->_M_impl._M_start, iterator(*__cur, __cur)); __throw_exception_again; } } The test code is reproduced below. The assembler output of the salient part of _M_fill_initialize is: .L222: lwz 3,0(31) mr 5,30 addi 6,1,8 addi 4,3,512 .LEHB3: bl _ZSt24__uninitialized_fill_auxIP9TestClassS0_EvT_S2_RKT0_12__false_ty pe lwz 0,36(29) addi 31,31,4 cmplw 7,0,31 bgt+ 7,.L222 li 31,0 .L240: Examining the output .L240 corresponds to /*** HERE ***/. When the for() loop terminates, it appears __cur is in r31 and is zapped with 0. I suspect the optimizer has marked __cur as dead at this point. An exception caught whilst executing the 2nd unitialized_fill() attempts to use __cur, but is thwarted because the value has been lost. I'm working with a port of dw2 based EH for powerpc VxWorks on gcc 3.4.2. The compiler builds and is working. I have ported the EH test from STLport, and most of the tests run. (BTW the ported EH tests run to completion on cygwin.) Earl - bash-2.05b$ /gnu/local/bin/powerpc-wrs-vxworks-g++ -v -S -O2 bug.cpp Reading specs from /gnu/local/lib/gcc/powerpc-wrs-vxworks/3.4.2/specs Configured with: ../gcc-3.4.2/configure --target=powerpc-wrs-vxworks --disable-libstdcxx-pch --disable-shared --with-included-gettext --with-gnu-as --with-gnu-ld --with-ld=powerpc-wrs-vxworks-ld --with-as=powerpc-wrs-vxworks-as --exec-prefix=/gnu/local --prefix=/gnu/local --enable-languages=c,c++ Thread model: vxworks gcc version 3.4.2 /gnu/local/libexec/gcc/powerpc-wrs-vxworks/3.4.2/cc1plus.exe -quiet -v -DCPU_FAMILY=PPC -D__ppc -D__EABI__ -DCPU=PPC604 -D__hardfp bug.cpp -mcpu=604 -mstrict-align -quiet -dumpbase bug.cpp -auxbase bug -O2 -version -o bug.s ignoring nonexistent directory "/gnu/local/lib/gcc/powerpc-wrs-vxworks/3.4.2/../ ../../../powerpc-wrs-vxworks/sys-include" ignoring nonexistent directory "*CYGWIN1512PATH" #include "..." search starts here: #include <...> search starts here: /gnu/local/lib/gcc/powerpc-wrs-vxworks/3.4.2/../../../../include/c++/3.4.2 /gnu/local/lib/gcc/powerpc-wrs-vxworks/3.4.2/../../../../include/c++/3.4.2/powe rpc-wrs-vxworks /gnu/local/lib/gcc/powerpc-wrs-vxworks/3.4.2/../../../../include/c++/3.4.2/back ward /gnu/local/lib/gcc/powerpc-wrs-vxworks/3.4.2/include /gnu/local/lib/gcc/powerpc-wrs-vxworks/3.4.2/../../../../powerpc-wrs-vxworks/in clude End of search list. GNU C++ version 3.4.2 (powerpc-wrs-vxworks) compiled by GNU C version 3.4.1 (cygming special). GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 bug.cpp:6: warning: inline function `TestClass::TestClass()' used but never defi ned bug.cpp:9: warning: inline function `TestClass::~TestClass()' used but never def ined bug.cpp:8: warning: inline function `TestClass::TestClass(const TestClass&)' use d but never defined bug.cpp:11: warning: inline function `TestClass& TestClass::operator=(const Test Class&)' used but never defined - #include class TestClass { public: inline TestClass(); inline TestClass( int value ); inline TestClass( const TestClass& rhs ); inline ~TestClass(); inline TestClass& operator=( const TestClass& rhs ); inline int value() const; inline TestClass operator!() const; bool operator==( const TestClass& rhs ) const; bool operator<( const TestClass& rhs ) const; protected: static inline unsigned int get_random(unsigned range = UINT_MAX); private: inline void Init( int value ); }; template class std::deque;
Re: Suggestion: Different exit code for ICE
Janis Johnson wrote: On Thu, Feb 24, 2005 at 11:46:20AM +0100, Volker Reichelt wrote: Regressions that cause ICE's on invalid code often go unnoticed in the testsuite, since regular errors and ICE's both match { dg-error "" }. See for example g++.dg/parse/error16.C which ICE's since yesterday, but the testsuite still reports "PASS": [snip] This calls for a more robust method IMHO. One way would be to make the testsuite smarter and make it recognize typical ICE patterns itself. This can indeed be done (I for example use it to monitor the testcases in Bugzilla, Phil borrowed the patterns for his regression tester). An easier way IMHO would be to return a different error code when encountering an ICE. That's only a couple of places in diagnostic.c and errors.c where we now have "exit (FATAL_EXIT_CODE);". We could return an (appropriately defined) ICE_ERROR_CODE instead. The testsuite would then just have to check the return value. What do you think? I don't think that it's appropriate for any test to use { dg-error "" }; I actually disagree; I think that sometimes it's important to know that there's some kind of diagnostic, but trying to match the wording seems like overkill to me. I don't feel that strongly about it, but I don't see anything wrong with the empty string. the message is for an ICE. A quick count, however, shows that 1022 tests use { dg-error "" }. Given that, using and detecting a different error code for an ICE is an excellent idea. I definitely agree. I think that would be great. -- Mark Mitchell CodeSourcery, LLC [EMAIL PROTECTED] (916) 791-8304
Re: [wwwdocs] CVS annotate brings me to GNATS
On Sat, 11 Dec 2004, Gerald Pfeifer wrote: >>> http://gcc.gnu.org/cgi-bin/cvsweb.cgi/old-gcc/PROBLEMS?annotate=1.1 >> The thing matching "PR" must be a little overzealous :) > Yup. I think I know how to fix this and hope to do it in the next few > days (after some other technical issues have been clarified). Fixed now, with the following change to httpd.conf. (I believe we actually might be able to remove the Rewrite... stuff.) Sorry for the delay, various things happened in between... Gerald # Support short URLs for referring to PRs. RewriteCond %{QUERY_STRING} ([0-9]+)$ - RewriteRule PR http://gcc.gnu.org/bugzilla/show_bug.cgi?id=%1 [R] + RewriteRule ^PR http://gcc.gnu.org/bugzilla/show_bug.cgi?id=%1 [R] RedirectMatch ^/PR([0-9]+)$ http://gcc.gnu.org/bugzilla/show_bug.cgi?id=$1 include /etc/httpd/conf/spamblock
Quick 4.0 status update
Those of you who read my status reports closely will recognize that today is the day I announced as the day on which I would create the 4.0 release branch. I still plan to do that sometime today, where "today" is generously defined as "before I go to sleep tonight here in California". I've received a lot of good proposals for 4.1, and am working on ordering them as best I can. I'll be posting that information later today -- before I create the branch. FYI, -- Mark Mitchell CodeSourcery, LLC [EMAIL PROTECTED] (916) 791-8304
Re: Inlining and estimate_num_insns
Jan Hubicka wrote: Also, for the simple function double foo1(double x) { return x; } we return 4 as a cost, because we have double tmp = x; return tmp; and count the move cost (MODIFY_EXPR) twice. We could fix this by not walking (i.e. ignoring) RETURN_EXPR. That would work, yes. I was also thinking about ignoring MODIFY_EXPR for var = var as those likely gets propagated later. This looks like a good idea. In fact going even further and ignoring all assigns to DECL_IGNORED_P allows us to have the same size estimates for all functions down the inlining chain for int foo(int x) { return x; } int foo1(int x) { return foo(x); } ... and for the equivalent with int foo(void) { return 0; } and all related functions. Which is what we want. Of course ignoring all stores to artificial variables may have other bad side-effects. This results in a tramp3d-v3 performance increase from 1m56s to 27s (leafify brings us down to 23.5s). Note that we still assign reasonable cost to memory stores: inline void foo(double *x) { *x = 1.0; } has a cost of 2, and double y; void bar(void) { foo(&y); } too, if foo is inlined. Nice. Patch attached (with some unrelated stuff that just is cleanup) for you to play. Any thoughts on this radical approach? A testcase that could be pessimized by this? Of course default inlining limits would need to be adjusted if we do this. Richard. Index: cgraphunit.c === RCS file: /cvs/gcc/gcc/gcc/cgraphunit.c,v retrieving revision 1.93 diff -c -3 -p -r1.93 cgraphunit.c *** cgraphunit.c21 Feb 2005 14:39:46 - 1.93 --- cgraphunit.c24 Feb 2005 19:04:18 - *** Software Foundation, 59 Temple Place - S *** 190,197 #include "function.h" #include "tree-gimple.h" - #define INSNS_PER_CALL 10 - static void cgraph_expand_all_functions (void); static void cgraph_mark_functions_to_output (void); static void cgraph_expand_function (struct cgraph_node *); --- 190,195 Index: tree-inline.h === RCS file: /cvs/gcc/gcc/gcc/tree-inline.h,v retrieving revision 1.14 diff -c -3 -p -r1.14 tree-inline.h *** tree-inline.h 8 Nov 2004 22:40:09 - 1.14 --- tree-inline.h 24 Feb 2005 19:04:18 - *** bool tree_inlinable_function_p (tree); *** 29,34 --- 29,35 tree copy_tree_r (tree *, int *, void *); void clone_body (tree, tree, void *); tree save_body (tree, tree *, tree *); + int estimate_move_cost (tree type); int estimate_num_insns (tree expr); /* 0 if we should not perform inlining. *** int estimate_num_insns (tree expr); *** 38,41 --- 39,47 extern int flag_inline_trees; + /* Instructions per call. Used in estimate_num_insns and in the +inliner to account for removed calls. */ + + #define INSNS_PER_CALL 10 + #endif /* GCC_TREE_INLINE_H */ Index: tree-inline.c === RCS file: /cvs/gcc/gcc/gcc/tree-inline.c,v retrieving revision 1.170 diff -c -3 -p -r1.170 tree-inline.c *** tree-inline.c 27 Jan 2005 14:36:17 - 1.170 --- tree-inline.c 24 Feb 2005 19:04:19 - *** inlinable_function_p (tree fn) *** 1165,1170 --- 1165,1189 return inlinable; } + /* Estimate the number of instructions needed for a move of +the specified type. */ + + int + estimate_move_cost (tree type) + { + HOST_WIDE_INT size; + + if (VOID_TYPE_P (type)) + return 0; + + size = int_size_in_bytes (type); + + if (size < 0 || size > MOVE_MAX_PIECES * MOVE_RATIO) + return INSNS_PER_CALL; + else + return ((size + MOVE_MAX_PIECES - 1) / MOVE_MAX_PIECES); + } + /* Used by estimate_num_insns. Estimate number of instructions seen by given statement. */ *** estimate_num_insns_1 (tree *tp, int *wal *** 1245,1266 /* Recognize assignments of large structures and constructors of big arrays. */ - case INIT_EXPR: case MODIFY_EXPR: x = TREE_OPERAND (x, 0); /* FALLTHRU */ - case TARGET_EXPR: case CONSTRUCTOR: ! { ! HOST_WIDE_INT size; ! ! size = int_size_in_bytes (TREE_TYPE (x)); ! ! if (size < 0 || size > MOVE_MAX_PIECES * MOVE_RATIO) ! *count += 10; ! else ! *count += ((size + MOVE_MAX_PIECES - 1) / MOVE_MAX_PIECES); ! } break; /* Assign cost of 1 to usual operations. --- 1264,1278 /* Recognize assignments of large structures and constructors of big arrays. */ case MODIFY_EXPR: + if (DECL_P (TREE_OPERAND (x, 0)) && DECL_IGNORED_P (TREE_OPERAND (x, 0))) + break; + case INIT_EXPR: + case TARGET_EXPR: x = TREE_OPERAND (x, 0); /* FALLTHRU */ case CONSTRUCTOR: ! *count += estimate_move_cost
Re: Inlining and estimate_num_insns
On Thu, 24 Feb 2005 20:05:37 +0100, Richard Guenther <[EMAIL PROTECTED]> wrote: > Jan Hubicka wrote: > > >>Also, for the simple function > >> > >>double foo1(double x) > >>{ > >>return x; > >>} > >> > >>we return 4 as a cost, because we have > >> > >> double tmp = x; > >> return tmp; > >> > >>and count the move cost (MODIFY_EXPR) twice. We could fix this > >>by not walking (i.e. ignoring) RETURN_EXPR. > > > > > > That would work, yes. I was also thinking about ignoring MODIFY_EXPR > > for var = var as those likely gets propagated later. > > This looks like a good idea. In fact going even further and ignoring > all assigns to DECL_IGNORED_P allows us to have the same size estimates > for all functions down the inlining chain for Note that this behavior also more closely matches the counting of gcc 3.4 that has a cost of zero for inline int foo(void) { return 0; } and a cost of one for int bar(void) { return foo(); } while with the patch we have zero for foo and zero for bar. For inline void foo(double *x) { *x = 1.0; } double y; void bar(void) { foo(&y); } 3.4 has 3 and 5 after inlining, with the patch we get 2 and 2. For inline double foo(double x) { return x*x; } inline double foo1(double x) { return foo(x); } double foo2(double x) { return foo1(x); } 3.4 has 1, 2 and 3, with the patch we get 1, 1 and 1. For a random collection of C files out of scimark2 we get 3.4 4.0 4.0 patched SOR54, 10125, 26 63, 14 FFT 44, 11, 200, 59 65, 10, 406, 11151, 10, 243, 71 so apart from a constant factor 4.0 patched goes back to 3.4 behavior (at least it doesn't show weird numbers). Given that we didn't change inlining limits between 3.4 and 4.0 that looks better anyway. And of course the testcases above show we are better in removing abstraction penalty. Richard.
Re: gcse pass: expression hash table
On Thu, 2005-02-24 at 02:13, Tarun Kawatra wrote: > If that is the reason, then even plus expression (shown below) should not > be subjected to PRE as it also clobbers a hard register(CC). But it is being > subjected to PRE. Multiplication expression while it looks same does not > get even in hash table. My assumption here was that if I gave you a few pointers, you would try to debug the problem yourself. If you want someone else to debug it for you, then you need to give much better info. See for instance http://gcc.gnu.org/bugs.html which gives info on how to properly report a bug. I have the target and gcc version, but I need a testcase, compiler options, and perhaps other info. How do you know that adds are getting optimized? Did you judge this by looking at one of the dump files, or looking at the assembly output? Maybe you are looking at the wrong thing, or misunderstanding what you are looking at? You need to give more details here. If I try compiling a trivial example with -O2 -da -S for both IA-64 and x86, and then looking at the .gcse dump file, I see that both the multiply and the add are in the hash table dump for the IA-64, but neither are in the hash table dump for the x86. The reason why is as I explained, the added_clobbers_hard_reg_p call returns true for both on x86, but not on IA-64. If you are seeing something different, then you need to give more details. Perhaps you are looking at a different part of gcse than I am. -- Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com
Re: gcse pass: expression hash table
On Thu, 2005-02-24 at 09:20, Tarun Kawatra wrote: > On Wed, 23 Feb 2005, James E Wilson wrote: > > While looking at this, I noticed can_assign_to_reg_p does something silly. > ^^^ > I could not find this function anywhere in gcc > 3.4.1 source. I was looking at current gcc sources. >> but this can fail if a target has less than 4 registers > I could not get this point. Don't worry about that, you don't need to understand this bit. -- Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com
Re: gcse pass: expression hash table
On Thu, 2005-02-24 at 03:15, Steven Bosscher wrote: > On Feb 24, 2005 11:13 AM, Tarun Kawatra <[EMAIL PROTECTED]> wrote: > Does GCSE look into stuff in PARALLELs at all? From gcse.c: Shrug. The code in hash_scan_set seems to be doing something reasonable. The problem I saw wasn't with finding expressions to gcse, it was with inserting them later. The insertion would create a cc reg clobber, so we don't bother adding it to the hash table. I didn't look any further, but it seemed reasonable that if it isn't in the hash table, then it isn't going to be optimized. It seems that switching the x86 backend from using cc0 to using a cc hard register has effectively crippled the RTL gcse pass for it. -- Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com
-Wfatal-errors=n
>From here: http://gcc.gnu.org/ml/gcc/2005-02/msg00923.html I so want this. I've created a bugzilla entry for this as an enhancement so this does not get lost. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=20201 -benjamin
Re: gcse pass: expression hash table
On Thu, 24 Feb 2005, James E Wilson wrote: On Thu, 2005-02-24 at 03:15, Steven Bosscher wrote: On Feb 24, 2005 11:13 AM, Tarun Kawatra <[EMAIL PROTECTED]> wrote: Does GCSE look into stuff in PARALLELs at all? From gcse.c: Shrug. The code in hash_scan_set seems to be doing something reasonable. The problem I saw wasn't with finding expressions to gcse, it was with inserting them later. The insertion would create a cc reg clobber, so we don't bother adding it to the hash table. I didn't look any further, but it seemed reasonable that if it isn't in the hash table, then it isn't going to be optimized. You are write here that if some expr doesn't get into hash table, it will not get optimized. But since plus expressions on x86 also clobber CC as shown below (insn 40 61 42 2 (parallel [ (set (reg/v:SI 74 [ c ]) (plus:SI (reg:SI 86) (reg:SI 85))) (clobber (reg:CC 17 flags)) ]) 138 {*addsi_1} (nil) (nil)) then why the same reasoning does not apply to plus expressions. Why will there insertion later will not create any problems? Actually I am trying to extend PRE implementation so that it performs strength reduction as well. it requires multiplication expressions to get into hash table. I am debugging the code to find where the differences for the two kind of expressions occur. Will let you all know if I found anything interesting. If you know this already please share with me. Thanks -tarun It seems that switching the x86 backend from using cc0 to using a cc hard register has effectively crippled the RTL gcse pass for it.
Re: gcse pass: expression hash table
On Feb 24, 2005, at 3:55 PM, Tarun Kawatra wrote: Actually I am trying to extend PRE implementation so that it performs strength reduction as well. it requires multiplication expressions to get into hash table. Why do you want to do that? Strength reduction is done already in loop.c. Thanks, Andrew Pinski
Re: gcse pass: expression hash table
On Thu, 2005-02-24 at 15:59 -0500, Andrew Pinski wrote: > On Feb 24, 2005, at 3:55 PM, Tarun Kawatra wrote: > > > Actually I am trying to extend PRE implementation so that it performs > > strength reduction as well. it requires multiplication expressions to > > get into hash table. > > Why do you want to do that? > Strength reduction is done already in loop.c. > Generally, PRE based strength reduction also includes straight line code strength reduction. Non-ssa based ones don't do much better in terms of redundancy elimination, but the SSA based ones can eliminate many more redundnacies when you integrate strength reduction into them. IE given something like: b = a * b; if (argc) { a = a + 1; } else { a = a + 2; } c = a * b; It will remove the second multiply in favor of additions at the site of the changes of a. --Dan
Re: gcse pass: expression hash table
On Thu, 24 Feb 2005, James E Wilson wrote: On Thu, 2005-02-24 at 03:15, Steven Bosscher wrote: On Feb 24, 2005 11:13 AM, Tarun Kawatra <[EMAIL PROTECTED]> wrote: Does GCSE look into stuff in PARALLELs at all? From gcse.c: Shrug. The code in hash_scan_set seems to be doing something reasonable. The problem I saw wasn't with finding expressions to gcse, it was with inserting them later. The insertion would create a cc reg clobber, so we don't bother adding it to the hash table. I didn't look any further, but it seemed reasonable that if it isn't in the hash table, then it isn't going to be optimized. This is with reference to my latest mail. I found that while doing insertions of plus kinda expressions, the experssions inserted does not contain clobbering of CC, even if it is there in original instruction. For example for the instruction (insn 40 61 42 2 (parallel [ (set (reg/v:SI 74 [ c ]) (plus:SI (reg:SI 86) (reg:SI 85))) (clobber (reg:CC 17 flags)) ]) 138 {*addsi_1} (nil) the instruction inserted is (insn 72 64 36 2 (set (reg:SI 87) (plus:SI (reg:SI 86 [ a ]) (reg:SI 85 [ b ]))) 134 {*lea_1} (nil) (nil)) That is it converts addsi_1 to lea_1. -tarun > It seems that switching the x86 backend from using cc0 to using a cc hard register has effectively crippled the RTL gcse pass for it.
Re: gcse pass: expression hash table
On Thu, 2005-02-24 at 12:55, Tarun Kawatra wrote: > You are write here that if some expr doesn't get into hash table, it will > not get optimized. That was an assumption on my part. You shouldn't take it as the literal truth. I'm not an expert on all implementation details of the gcse.c pass. > But since plus expressions on x86 also clobber CC as > shown below > then why the same reasoning does not apply to plus expressions. Why will > there insertion later will not create any problems? Obviously, plus expressions will have the same problem. That is why I question whether plus expressions are properly getting optimized. Since you haven't provided any example that shows that they are being optimized, or pointed me at anything in the gcse.c file I can look at, there isn't anything more I can do to help you. All I can do is tell you that you need to give more details, or debug the problem yourself. > Actually I am trying to extend PRE implementation so that it performs > strength reduction as well. it requires multiplication expressions to get > into hash table. Current sources have a higher level intermediate language (gimple) and SSA based optimization passes that operate on them. This includes a tree-ssa-pre.c pass. It might be more useful to extend this to do strength reduction that to try to extend the RTL gcse pass. > I am debugging the code to find where the differences for the two kind of > expressions occur. > Will let you all know if I found anything interesting. Good. > If you know this already please share with me. It is unlikely that anyone already knows this info offhand. -- Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com
Re: gcse pass: expression hash table
On Thursday 24 February 2005 21:16, James E Wilson wrote: > On Thu, 2005-02-24 at 03:15, Steven Bosscher wrote: > > On Feb 24, 2005 11:13 AM, Tarun Kawatra <[EMAIL PROTECTED]> wrote: > > Does GCSE look into stuff in PARALLELs at all? From gcse.c: > > Shrug. The code in hash_scan_set seems to be doing something > reasonable. > > The problem I saw wasn't with finding expressions to gcse, it was with > inserting them later. The insertion would create a cc reg clobber, so > we don't bother adding it to the hash table. I didn't look any further, > but it seemed reasonable that if it isn't in the hash table, then it > isn't going to be optimized. > > It seems that switching the x86 backend from using cc0 to using a cc > hard register has effectively crippled the RTL gcse pass for it. Not that it matters so much. GCSE does more harm than good for lots of code (including SPEC - the mean for int and fp goes *up* if you disable GCSE for x86*). The problem indeed appears to be inserting the expressions. I am quite sure there was a patch to allow GCSE to do more with PARALLELs, but I can't find it anywhere. I did stuble into this mail: http://gcc.gnu.org/ml/gcc/2003-07/msg02064.html: " - My code for GCSE on parallels that is actually in cfg branch only and first halve of the changes went into mainline (basic code motion infrastructure) " In one of the replies, rth said: "I'm not sure these are worthwhile long term. I expect the rtl GCSE optimizer to collapse to almost nothing with the tree-ssa merge." Which probably explains why these bits where never merged from the cfg-branch for GCC 3.4. Ah, archeology, so much fun. Gr. Steven
Re: gcse pass: expression hash table
On Thursday 24 February 2005 21:59, Andrew Pinski wrote: > On Feb 24, 2005, at 3:55 PM, Tarun Kawatra wrote: > > Actually I am trying to extend PRE implementation so that it performs > > strength reduction as well. it requires multiplication expressions to > > get into hash table. > > Why do you want to do that? > Strength reduction is done already in loop.c. First, that's a different kind of strength reduction. Second, we'd like to blow away loop.c so replacing it would not be a bad thing ;-) But the kind of strength reduction PRE can do is something different. Didn't Dan already have patches for that in the old tree SSAPRE, and some ideas on how to do it in GVN-PRE? Gr. Steven
Re: gcse pass: expression hash table
On Thu, 24 Feb 2005, Andrew Pinski wrote: On Feb 24, 2005, at 3:55 PM, Tarun Kawatra wrote: Actually I am trying to extend PRE implementation so that it performs strength reduction as well. it requires multiplication expressions to get into hash table. Why do you want to do that? Strength reduction is done already in loop.c. We may then get rid of loop optimization pass if the optimizations captured by extended pre approach is comparable to that of loop.c May be not all, but then this approach can capture straight code strength reduction(which need not depend on any loop, like in case of induction variables based optimization). -tarun Thanks, Andrew Pinski
Re: gcse pass: expression hash table
On Thu, 24 Feb 2005, James E Wilson wrote: On Thu, 2005-02-24 at 12:55, Tarun Kawatra wrote: You are write here that if some expr doesn't get into hash table, it will ^^ right. -tarun not get optimized. That was an assumption on my part. You shouldn't take it as the literal truth. I'm not an expert on all implementation details of the gcse.c pass. But since plus expressions on x86 also clobber CC as shown below then why the same reasoning does not apply to plus expressions. Why will there insertion later will not create any problems? Obviously, plus expressions will have the same problem. That is why I question whether plus expressions are properly getting optimized. Since you haven't provided any example that shows that they are being optimized, or pointed me at anything in the gcse.c file I can look at, there isn't anything more I can do to help you. All I can do is tell you that you need to give more details, or debug the problem yourself. Actually I am trying to extend PRE implementation so that it performs strength reduction as well. it requires multiplication expressions to get into hash table. Current sources have a higher level intermediate language (gimple) and SSA based optimization passes that operate on them. This includes a tree-ssa-pre.c pass. It might be more useful to extend this to do strength reduction that to try to extend the RTL gcse pass. I am debugging the code to find where the differences for the two kind of expressions occur. Will let you all know if I found anything interesting. Good. If you know this already please share with me. It is unlikely that anyone already knows this info offhand.
Re: gcse pass: expression hash table
On Thu, 2005-02-24 at 22:28 +0100, Steven Bosscher wrote: > On Thursday 24 February 2005 21:59, Andrew Pinski wrote: > > On Feb 24, 2005, at 3:55 PM, Tarun Kawatra wrote: > > > Actually I am trying to extend PRE implementation so that it performs > > > strength reduction as well. it requires multiplication expressions to > > > get into hash table. > > > > Why do you want to do that? > > Strength reduction is done already in loop.c. > > First, that's a different kind of strength reduction. Second, > we'd like to blow away loop.c so replacing it would not be a > bad thing ;-) But the kind of strength reduction PRE can do > is something different. Didn't Dan already have patches for > that in the old tree SSAPRE, and some ideas on how to do it > in GVN-PRE? yes and yes. :)
mthumb in specs file
Is it possible by hacking the specs file to change the target for arm-elf-gcc from -marm to -mthumb? I tried a few obvious things like changing marm in *multilib_defaults to mthumb, but this did not have the desired effect. Please cc me in your reply. Thanks! Shaun
Re: gcse pass: expression hash table
My assumption here was that if I gave you a few pointers, you would try to debug the problem yourself. If you want someone else to debug it for you, then you need to give much better info. See for instance http://gcc.gnu.org/bugs.html which gives info on how to properly report a bug. I have the target and gcc version, but I need a testcase, compiler options, and perhaps other info. I will take this into consideration now onwards. The test case I am using (for multiplication expression is) #include void foo(); int main() { foo(); } void foo() { int a, b, c; int cond; scanf(" %d %d %d", &a, &b, &cond); if( cond ) c = a * b; c = a * b; printf("Value of C is %d", c); } -- and for plus, a*b replaced by a+b everywhere. I am compiling it as gcc --param max-gcse-passes=2 -dF -dG -O3 filename.c The reason for max-gcse-passes=2 is that in first pass a+b kind of expressions will be using different sets of pseudo registers at first and second occurance of a+b. After one gcse pass, both will become same (because of intermediate constant/copy propagation passes). Then a+b gets optimized. As can be seen from dumps filename.c.07.addressof and filename.c.08.gcse A part of expression hash table for program containing plus is Expression hash table (11 buckets, 11 entries) Index 0 (hash value 3) (plus:SI (reg/f:SI 20 frame) (const_int -4 [0xfffc])) Index 8 (hash value 1) (mem/f:SI (plus:SI (reg/f:SI 20 frame) (const_int -8 [0xfff8])) [2 b+0 S4 A32]) Index 9 (hash value 6) (plus:SI (reg:SI 78 [ a ]) (reg:SI 79 [ b ])) Index 10 (hash value 10) (plus:SI (reg:SI 80 [ a ]) (reg:SI 81 [ b ])) Which clearly shows that clobbering CC in a+b is being ignored if the expressions which requires to be inserted will not be containing clobbering of CC. How do you know that adds are getting optimized? Did you judge this by I am looking at dump files. looking at one of the dump files, or looking at the assembly output? Maybe you are looking at the wrong thing, or misunderstanding what you are looking at? You need to give more details here. Regards, -tarun
-Ttext with -mthumb causes relocation truncated to fit
When -Ttext is used in combination with -mthumb it causes a relocation truncated to fit message. What does this mean, and how do I fix it? Please cc me in your reply. Thanks, Shaun $ arm-elf-gcc --version | head -1 arm-elf-gcc (GCC) 3.4.0 $ cat hello.c int main() { return 0; } $ arm-elf-gcc -Ttext 0x200 -mthumb hello.c /opt/pathport/lib/gcc/arm-elf/3.4.0/thumb/crtbegin.o(.init+0x0): In function `$t': : relocation truncated to fit: R_ARM_THM_PC22 frame_dummy /opt/pathport/lib/gcc/arm-elf/3.4.0/../../../../arm-elf/lib/thumb/crt0.o(.text+0x9a):../../../../../../../gcc-3.4.0/newlib/libc/sys/arm/crt0.S:200: relocation truncated to fit: R_ARM_THM_PC22 _init /opt/pathport/lib/gcc/arm-elf/3.4.0/thumb/crtend.o(.init+0x0): In function `$t': : relocation truncated to fit: R_ARM_THM_PC22 __do_global_ctors_aux collect2: ld returned 1 exit status
Re: C++ math optimization problem...
Hello, Regarding the testcase I mentioned before, I have been checking out the Intel compiler to see if it would generate better code. Interestingly enough, it displays EXACTLY the same run-times as gcc for the two tests (0.2s for math in if-block, 1.0s for math out of if-block). So this is rather strange. Shall I file a PR if it doesn't become clear what is going on? thanks, -BenRI include const int OUTER = 10; const int INNER = 1000; using namespace std; int main(int argn, char *argv[]) { int s = atoi(argv[1]); double result; { vector d(INNER); // move outside of this scope to fix // initialize d for (int i = 0; i < INNER; i++) d[i] = double(1+i) / INNER; // calc result result=0; for (int i = 0; i < OUTER; ++i) for (int j = 1; j < INNER; ++j) result += d[j]*d[j-1] + d[j-1]; } printf("result = %f\n",result); return 0; } P.S. Um, is the gcc listserv intelligent enough not to send you all a second copy of this e-mail?
Re: -Ttext with -mthumb causes relocation truncated to fit
On Thu, Feb 24, 2005 at 03:23:53PM -0800, Shaun Jackman wrote: > When -Ttext is used in combination with -mthumb it causes a relocation > truncated to fit message. What does this mean, and how do I fix it? > > Please cc me in your reply. Thanks, > Shaun Don't use -Ttext with an ELF toolchain; use a linker script instead. -- Daniel Jacobowitz CodeSourcery, LLC
Specifying a linker script from the specs file
I have had no trouble specifiying the linker script using the -T switch to gcc. I am now trying to specify the linker script from a specs file like so: %rename link old_link *link: -Thello.ld%s %(old_link) gcc complains though about linking Thumb code against ARM libraries -- I've specified -mthumb to gcc -- /opt/pathport/lib/gcc/arm-elf/3.4.0/../../../../arm-elf/bin/ld: /opt/pathport/arm-elf/lib/libc.a(memset.o)(memset): warning: interworking not enabled. Why does the above specs snippet cause gcc to forget it's linking against thumb libraries? Please cc me in your reply. Thanks, Shaun $ arm-elf-gcc --version | head -1 arm-elf-gcc (GCC) 3.4.0 $ cat hello.c int main() { return 0; } $ cat hello.specs %rename link old_link *link: -Thello.ld%s %(old_link) $ diff /opt/pathport/arm-elf/lib/ldscripts/armelf.xc hello.ld 12c12 < PROVIDE (__executable_start = 0x8000); . = 0x8000; --- > PROVIDE (__executable_start = 0x200); . = 0x200; 181c181 < .stack 0x8 : --- > .stack 0x2100 : $ arm-elf-gcc -mthumb -Thello.ld hello.c $ arm-elf-gcc -mthumb -specs=hello.specs hello.c 2>&1 | head -1 /opt/pathport/lib/gcc/arm-elf/3.4.0/../../../../arm-elf/bin/ld: /opt/pathport/arm-elf/lib/libc.a(memset.o)(memset): warning: interworking not enabled.
what's the proper way to configure/enable/disable C exception handling?
In attempting to configure a target limited to 32-bit C type support, it became obvious that exception support seems to be unconditionally required, and defaults to assuming target support for 64-bit data types although not necessarily configured to support data types this large? - is this intentional/necessary for C language compilation? - is not, what's the recommended way to specify the configuration to either eliminate the necessity, or select an exception model which doesn't require 64-bit type target type support. - might forcing sjlj exceptions help? With what consequences? - or might it be best for me to attempt to refine the baseline exception data structure definitions to be more target type size support aware? - if so, which target configuration header files or facilities would be the officially most ideal/correct ones to use to convey the target's supported type size configuration information to the exception handling implementations files? Any insight/recommendations would be appreciated, Thanks, -paul-
GNU INTERCAL front-end for GCC?
I am thinking of including a front-end for INTERCAL for GCC. INTERCAL is an estoric programming langauge that was created in 1972 with the goal of having nothing in common with other langauges (see http://catb.org/~esr/intercal). There is a C implementation of INTERCAL (called C-INTERCAL) that is avalible there. I think it would be a good project(1) as a front-end(2) to GCC. Samuel Lauber (1) -> Don't say that I'm crazy. (2) -> Some of us would like DO .1 <- #0 to be translated into movl $0, v1 -- _ Web-based SMS services available at http://www.operamail.com. From your mailbox to local or overseas cell phones. Powered by Outblaze
Re: Benchmark of gcc 4.0
Hello! I just got interested and did a test myself. Comparing gcc 4.0 (-O2 -funroll-loops -D__NO_MATH_INLINES -ffast-math -march=pentium4 -mfpmath=sse -ftree-vectorize) and icc 9.0 beta (-O3 -xW -ip): Here are the results of scimark with '-O3 -march=pentium4 -mfpmath=... -funroll-loops -ftree-vectorize -ffast-math -D__NO_MATH_INLINES -fomit-frame-pointer' and various -mfpmath settings: -mpfmath=sse: Composite Score: 664.47 FFT Mflops: 371.12(N=1024) SOR Mflops: 511.13(100 x 100) MonteCarlo: Mflops: 130.94 Sparse matmult Mflops: 856.68(N=1000, nz=5000) LU Mflops: 1452.48(M=100, N=100) -mfpmath=387: Composite Score: 624.14 FFT Mflops: 391.09(N=1024) SOR Mflops: 465.45(100 x 100) MonteCarlo: Mflops: 188.38 Sparse matmult Mflops: 811.59(N=1000, nz=5000) LU Mflops: 1264.20(M=100, N=100) -mfpmath=sse,387: Composite Score: 665.51 FFT Mflops: 372.70(N=1024) SOR Mflops: 509.78(100 x 100) MonteCarlo: Mflops: 148.72 Sparse matmult Mflops: 832.20(N=1000, nz=5000) LU Mflops: 1464.16(M=100, N=100) I think that the results will be even better once PR18463 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18463) is fixed. The LU benchmark is one of tescases where these problems were found. You can check asm code for sequences like: leal0(,%ecx,8), %edx movsd(%ebx,%edx), %xmm0 instead of: movsd (%ebx,%ecx,8), %xmm0 Uros.