http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54394
Bug #: 54394 Summary: fatigue2 -flto run time regression Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end AssignedTo: unassig...@gcc.gnu.org ReportedBy: jamb...@gcc.gnu.org CC: rgue...@gcc.gnu.org Host: x86_64-linux-gnu Target: x86_64-linux-gnu Revision 190346 caused a large run time regression of fatigue2 polyhedron benchmark when run with -Ofast -flto. On a x86_64-linux box, the run time went from 150 seconds to 215 seconds and there is a similar percentage increase on my i686-linux desktop. The commit leading to that revision is: 2012-08-13 Richard Guenther <rguent...@suse.de> * basic-block.h (struct basic_block): Remove loop_depth member, move flags and index members next to each other. * cfgloop.h (bb_loop_depth): New inline function. * cfghooks.c (split_block): Do not set loop_depth. (duplicate_block): Likewise. * cfgloop.c (flow_loop_nodes_find): Likewise. (flow_loops_find): Likewise. (add_bb_to_loop): Likewise. (remove_bb_from_loops): Likewise. * cfgrtl.c (force_nonfallthru_and_redirect): Likewise. * gimple-streamer-in.c (input_bb): Do not stream loop_depth. * gimple-streamer-out.c (output_bb): Likewise. * bt-load.c: Include cfgloop.h. (migrate_btr_defs): Use bb_loop_depth. * cfg.c (dump_bb_info): Likewise. * final.c (compute_alignments): Likewise. * ira.c (update_equiv_regs): Likewise. * tree-ssa-copy.c (init_copy_prop): Likewise. * tree-ssa-dom.c (loop_depth_of_name): Likewise. * tree-ssa-forwprop.c: Include cfgloop.h. (forward_propagate_addr_expr): Use bb_loop_depth. * tree-ssa-pre.c (insert_into_preds_of_block): Likewise. * tree-ssa-sink.c (select_best_block): Likewise. * ipa-inline-analysis.c: Include cfgloop.h. (estimate_function_body_sizes): Use bb_loop_depth. * Makefile.in (tree-ssa-forwprop.o): Depend on $(CFGLOOP_H). (ipa-inline-analysis.o): Likewise. (bt-load.o): Likewise. * gcc.dg/tree-prof/update-loopch.c: Adjust. I believe the patch was not supposed to alter compiler output in any (significant) way. However, inlining decisions are different (file 1 is the dump before the patch, file 2 with it): In file 1: extra inlining into function MAIN__.2477/17 Function __computer_time_m_MOD_computer_time/13 inlined 1 times (as opposed to 0 times) Function __perdida_m_MOD_perdida/16 inlined 1 times (as opposed to 0 times) In file 2: extra inlining into function MAIN__.2477/17 Function __free_input_MOD_convert_lower_case/9 inlined 1 times (as opposed to 0 times) Function __free_input_MOD_convert_lower_case.part.2.2390/62 inlined 1 times (as opposed to 0 times) Function __read_input_m_MOD_read_input/12 inlined 1 times (as opposed to 0 times) In file 2: extra un-inlined function __perdida_m_MOD_perdida/16 Callers: 1, Callees: 27, Inlinees: 0 In file 1: extra un-inlined function __read_input_m_MOD_read_input.constprop.0/122 Originally a clone of __read_input_m_MOD_read_input/12 Callers: 1, Callees: 530, Inlinees: 22 At the same time this does not seem to be an LTO issue because the inline dump of the compilation (as opposed to linking) before the patch contains lines: __perdida_m_MOD_perdida/9 function not considered for inlining loop depth: 2 freq:53666 size:21 time: 30 callee size: 0 stack: 0 which the patch changes to: __perdida_m_MOD_perdida/9 function not considered for inlining loop depth: 0 freq:53666 size:21 time: 30 callee size: 0 stack: 0 LTO only makes the heuristics inline perdida as a function called just once. Loop depth 0 makes the candidate look not beneficial/cold even when we know there are no other callees. Loop depth is zero because at the time of inlining analysis, the bb->loop_father is NULL. So it seems we need to compute loops at the beginning of inline summary generation?