[Bug tree-optimization/54717] New: Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 Bug #: 54717 Summary: Runtime regression: polyhedron test "rnflow" degraded Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: sergos@gmail.com commit 024fee2c369096e6fe6cde620243df5843893004 Author: rguenth Date: Thu Sep 13 12:43:58 2012 + 2012-09-13 Richard Guenther * tree-ssa-sccvn.h (enum vn_kind): New. (vn_get_stmt_kind): Likewise. * tree-ssa-sccvn.c (vn_get_stmt_kind): New function, adjust ADDR_EXPR handling. (visit_use): Use it. * tree-ssa-pre.c (compute_avail): Likewise, simplify further. * gcc.dg/tree-ssa/ssa-fre-37.c: New testcase. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@191253 138bc75d-0d04-0410-961f-82ee72b054a4 caused a 20% degradation on polyhedron's "rnflow" commit 780bedc1ccae5ae85fb99afed8a1ac1cc598121b Geometric Mean Execution Time = 18.28 seconds commit 024fee2c369096e6fe6cde620243df5843893004 Geometric Mean Execution Time = 24.82 seconds compilation options used: gfortran -march=native -ffast-math -funroll-loops -O3 -ftree-vectorize %n.f90 -static -o %n
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #3 from Sergey Ostanevich 2012-09-26 15:11:38 UTC --- adding -### gives (in part of options) /export/users/syostane/pb11/gcc120914/libexec/gcc/x86_64-unknown-linux-gnu/4.8.0/f951 air.f90 "-march=corei7" -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mno-avx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx --param "l1-cache-size=32" --param "l1-cache-line-size=64" --param "l2-cache-size=12288" "-mtune=corei7" -quiet -dumpbase air.f90 -auxbase air -fintrinsic-modules-path /export/users/syostane/pb11/gcc120914/lib/gcc/x86_64-unknown-linux-gnu/4.8.0/finclude -o /tmp/ccmW82c1.s
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #5 from Sergey Ostanevich 2012-09-26 20:07:26 UTC --- for 093t.pre I see the following missing in cptrf2 function, first is good, second is degraded: *** *** 8947,8966 goto ; : - pretmp_325 = (integer(kind=8)) ival2_80; - pretmp_326 = pretmp_325 + -1; - pretmp_327 = *xxtrt_25(D)[pretmp_326]; : # ival2_136 = PHI # ival2_140 = PHI - # prephitmp_328 = PHI _137 = (integer(kind=8)) ival2_136; _138 = _137 + -1; _139 = *xxtrt_25(D)[_138]; _141 = (integer(kind=8)) ival2_140; _142 = _141 + -1; ! _143 = prephitmp_328; if (_139 < _143) goto ; else --- 8838,8853 goto ; : : # ival2_136 = PHI # ival2_140 = PHI _137 = (integer(kind=8)) ival2_136; _138 = _137 + -1; _139 = *xxtrt_25(D)[_138]; _141 = (integer(kind=8)) ival2_140; _142 = _141 + -1; ! _143 = *xxtrt_25(D)[_142]; if (_139 < _143) goto ; else *** but more surprising to me is that first diff is in 020t.inline_param1 *** *** 16790,16794 calls: dtrti2/26 function not considered for inlining ! loop depth: 0 freq:1000 size: 9 time: 18 callee size:82 stack:28 dtrsm/21 function not considered for inlining loop depth: 0 freq:1000 size:16 time: 25 callee size:324 stack: 4 --- 16790,16794 calls: dtrti2/26 function not considered for inlining ! loop depth: 0 freq:1000 size: 9 time: 18 callee size:81 stack:28 dtrsm/21 function not considered for inlining loop depth: 0 freq:1000 size:16 time: 25 callee size:324 stack: 4 ***
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #9 from Sergey Ostanevich 2012-10-08 08:55:25 UTC --- Thanks for the reduced test, Dominique! I see that vectorized did not manage to generate MIN after the change. Also, it is looks pretty similar to what I posted at first: there was no prephitmp created for the xxtrt_[] > ival2_15 = _85 < prephitmp_266 ? ival2_10 : iva > prephitmp_237 = MIN_EXPR <_85, prephitmp_266>; --- < _86 = (integer(kind=8)) ival2_14; < _87 = _86 + -1; < _88 = *xxtrt_46(D)[_87]; < ival2_15 = _85 < _88 ? ival2_10 : ival2_14; I suspect that one of the iterator you removed - possibly VEC_iterate - made more traverse than that you created? I also double check that for the reduced test MIN did not generated and not appears in assembly. PMU measurements (Vtune) confirms that BBLOCKs missing min contributes the difference in clocks.
[Bug target/49206] [4.5/4.6/4.7 Regression] RA failure in spill_failure, at reload1.c:2113
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49206 --- Comment #3 from Sergey Ostanevich 2011-08-22 16:37:54 UTC --- is it right that while() is an infinite loop? at least some phases can rely on this?
[Bug c/50315] Regression on Atom after fix #49958
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50315 Sergey Ostanevich changed: What|Removed |Added CC||sergos.gnu at gmail dot com --- Comment #4 from Sergey Ostanevich 2011-09-07 13:56:30 UTC --- Richard, Will it be a good idea to have a twos-complement architecture hook? In case of x86 we can reassociate since the architecture itself always behave as twos-complement. So introducing such a flag can help with this particular reassociation and another one that Ilya Enkovich implemented recently. What's your opinion?
[Bug middle-end/50315] Regression on Atom after fix #49958
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50315 --- Comment #7 from Sergey Ostanevich 2011-09-15 11:24:27 UTC --- Richard, I believe your test should be reading as > So you can go from (a +no b) +no c to a + no (b + c), dropping overflow knowledge on re-association. And let me re-phrase what's Joseph said (just to be sure I got the idea): we have to preserve the overflow semantics at GIMPLE level to avoid possible problems during translation into RTL. Consider we have situation without overflow in 32-bit with particular calculation order and can use either 32-bit or 64-bit operations to perform that. But after reassociation in GIMPLE we can introduce overflow for 32-bit, that will lead to wrong result in case we use 64-bit operations. Being aware of such situation during traslation we can evade error, but it requires too much effort (or even impossible) to provide this data to the translator. Is it right?
[Bug target/50572] New: unstable performance on Atom due to loop alignment
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50572 Bug #: 50572 Summary: unstable performance on Atom due to loop alignment Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: major Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: sergos@gmail.com After monitoring of Atom performance on trunk for some period of time I figured out that we have a significant (up to 15%) instability because of loop alignment. Currently for Atom we have the following alignments: {&atom_cost, 16, 7, 16, 7, 16} for struct ptt { const struct processor_costs *cost; /* Processor costs */ const int align_loop; /* Default alignments. */ const int align_loop_max_skip; const int align_jump; const int align_jump_max_skip; const int align_func; }; Which means we try to align by 16, although if it takes no more than 7 bytes to insert. This 'if' is the source of instability. For a reduction loop I observed almost twice slowdown because it did not fit into 16bytes after being aligned by 8. I used the -falign-loops=16 option to measure code size impact using -m32-O2 -msse2 -mfpmath=sse -ffast-math -march=atom for SPEC2000: SPEC2000 Test.text section size - AlignedCurrentIncreas%% increase wupwise6303246300842400,04% swim_602612602548640,01% mgrid_6083886082121760,03% applu_6416846414122720,04% mesa_94144493811633280,35% galgel_81350881176417440,21% art_4375724374121600,04% equake_4422284420841440,03% facerec6949486945963520,05% ammp_56142856029211360,20% lucas_6632366629482880,04% fma3d_1565348156022851200,33% sixtrac1537844153422836160,24% apsi_7191727183408320,12% gzip_4804524800204320,09% vpr_54816454715610080,18% cc1_1554052154653275200,49% mcf_4340364339081280,03% crafty_59208459083612480,21% parser_50947650827612000,24% eon_118934811888524960,04% perlbmk89429289126830240,34% gap_84563684112445120,54% vortex_96998896878812000,12% bzip2_4725964722603360,07% twolf_60714060504420960,35% Will it be acceptable to put -falign-loops=16 under -mtune=atom for O2?
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #12 from Sergey Ostanevich 2012-11-14 18:56:22 UTC --- Actually, it is not. I found that PRE did not collected a memory access within the loop that caused later missing vectorization. Here is dump before (good one) and after the commit (bad one) : pretmp_263 = (integer(kind=8)) ival2_82; pretmp_264 = pretmp_263 + -1; pretmp_265 = *xxtrt_46(D)[pretmp_264]; : # ival2_10 = PHI # ival2_14 = PHI # prephitmp_266 = PHI _83 = (integer(kind=8)) ival2_10; _84 = _83 + -1; _85 = *xxtrt_46(D)[_84]; _86 = (integer(kind=8)) ival2_14; _87 = _86 + -1; _88 = prephitmp_266; if (_85 < _88) goto ; else goto ; : goto ; : : # ival2_15 = PHI # prephitmp_237 = PHI <_88(90), _85(29)> ival2_89 = ival2_10 + -1; if (ival2_10 == ipos1_12) goto ; else goto ; : goto ; - : : # ival2_10 = PHI # ival2_14 = PHI _83 = (integer(kind=8)) ival2_10; _84 = _83 + -1; _85 = *xxtrt_46(D)[_84]; _86 = (integer(kind=8)) ival2_14; _87 = _86 + -1; _88 = *xxtrt_46(D)[_87]; if (_85 < _88) goto ; else goto ; : goto ; : : # ival2_15 = PHI ival2_89 = ival2_10 + -1; if (ival2_10 == ipos1_12) goto ; else goto ; : goto ; - So for the loop that starting at bb 28 you can see the xxtrt_46 access was not put into pretemp. Possible reason is exactly as it was mentioned by Richard - there were extra candidates collected and this one become less anticipatable Skipping partial partial redundancy for expression {array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165) not partially anticipated on any to be optimized for speed edges --- Found partial partial redundancy for expression {array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165) Created phi prephitmp_237 = PHI <_88(90), _85(29)> in block 30
[Bug rtl-optimization/64286] New: Redundant extend removal ignores vector element type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64286 Bug ID: 64286 Summary: Redundant extend removal ignores vector element type Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: major Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: sergos.gnu at gmail dot com Created attachment 34266 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=34266&action=edit reproducer, taken from public sources The problem is reproducible starting 4.9 and on trunk also. The line 29 contains a load into V16QI vector 29:p2 = _mm_loadu_si128((__m128i *) (s - 3 * p)); later used at 60:work = _mm_or_si128(_mm_subs_epu8(p2, p1), _mm_subs_epu8(p1, p2)); and later sign extended into V16HI vector 151: p256_2 = _mm256_cvtepu8_epi16(p2); At the phase 217 split2 we have: (insn 207 204 209 2 (set (reg:V16QI 21 xmm0 [447]) (mem:V16QI (plus:DI (reg/f:DI 6 bp) (const_int -114 [0xff8e])) [0 S16 A16])) GCC_Bug.p.c:2609 1136 {*movv16qi_internal} (expr_list:REG_EQUIV (mem:V16QI (plus:DI (reg/f:DI 20 frame) (const_int -66 [0xffbe])) [0 S16 A16]) (nil))) ... (insn 236 235 238 2 (set (reg:V16QI 22 xmm1 [462]) (us_minus:V16QI (reg:V16QI 23 xmm2 [450]) (reg:V16QI 21 xmm0 [447]))) GCC_Bug.p.c:2925 2096 {*sse2_ussubv16qi3} (nil)) ... (and number of other operations with xmm0 as V16QI) (insn 871 869 873 2 (set (reg:V16HI 21 xmm0 [orig:573 D.17673 ] [573]) (zero_extend:V16HI (reg:V16QI 21 xmm0 [447]))) GCC_Bug.p.c:5280 2521 {avx2_zero_extendv16qiv16hi2} (nil)) After that REE reports: --- Trying to eliminate extension: (insn 871 869 873 2 (set (reg:V16HI 21 xmm0 [orig:573 D.17673 ] [573]) (zero_extend:V16HI (reg:V16QI 21 xmm0 [447]))) GCC_Bug.p.c:5280 2521 {avx2_zero_extendv16qiv16hi2} (nil)) Tentatively merged extension with definition : (insn 207 204 209 2 (set (reg:V16HI 21 xmm0) (zero_extend:V16HI (mem:V16QI (plus:DI (reg/f:DI 6 bp) (const_int -114 [0xff8e])) [0 S16 A16]))) GCC_Bug.p.c:2609 -1 (nil)) deferring rescan insn with uid = 207. All merges were successful. Eliminated the extension. - That renders all V16QI insns using xmm0 invalid. The test should be compiled with gcc -O2 GCC_Bug_min.c -mavx2 And run on an avx2-enabled platform. Correct output: Is valid: 1 Incorrect output: Is valid: 0