On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
Deciding what blocks should run with a certain component active so that
the total cost of executing the prologues (and epilogues) is optimal, is
not a computationally feasible problem.
Really? It's just a dataflow problem is it not and one that ought to
converge very quickly I'd think. Or is it more a function of having to
run it on so many independent components? I'm still pondering the value
of having every GPR be an independent component :-)
ISTM this adds a fair amount of complexity to the implementation in that
prologues and epilogues for a particular component can run more than
once. Can you give a concrete example where this happens so that we can
all understand it better?
If we keep this aspect of the implementation it seems like a note in the
developer section would be in order. It's certainly non-intuitive.
I only glanced over the code that seems related to this aspect of the
implementation. If we decide to go forward, I'd like to look at it
again more closely.
Now all that is left is inserting prologues and epilogues on all edges
that jump into resp. out of the "active" set of blocks. Often we need
to insert some components' prologues (or epilogues) on all edges into
(or out of) a block. In theory cross-jumping can unify all such, but
in practice that often fails; besides, that is a lot of work. So in
this case we insert the prologue and epilogue components at the "head"
or "tail" of a block, instead.
Cross jumping is rather simplistic, so I'm not surprised that it doesn't
catch all this stuff.
As a final optimisation, if a block needs a prologue and its immediate
dominator has the block as a post-dominator, the dominator gets the
prologue as well.
So why not just put it in the idom and not in the dominated block?
Doesn't this just end up running that component's prologue twice?
2016-06-07 Segher Boessenkool <seg...@kernel.crashing.org>
* function.c (thread_prologue_and_epilogue_insns): Recompute the
live info. Call try_shrink_wrapping_separate. Compute the
prologue_seq afterwards, if it has possibly changed. Compute the
split_prologue_seq and epilogue_seq later, too.
* shrink-wrap.c: #include cfgbuild.h.
(dump_components): New function.
(struct sw): New struct.
(SW): New function.
(init_separate_shrink_wrap): New function.
(fini_separate_shrink_wrap): New function.
(place_prologue_for_one_component): New function.
(spread_components): New function.
(disqualify_problematic_components): New function.
(emit_common_heads_for_components): New function.
(emit_common_tails_for_components): New function.
(insert_prologue_epilogue_for_components): New function.
(try_shrink_wrapping_separate): New function.
* shrink-wrap.h: Declare try_shrink_wrapping_separate.
---
gcc/function.c | 15 +-
gcc/shrink-wrap.c | 715 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
gcc/shrink-wrap.h | 1 +
3 files changed, 729 insertions(+), 2 deletions(-)
diff --git a/gcc/function.c b/gcc/function.c
index bba0705..390e9a6 100644
--- a/gcc/function.c
+++ b/gcc/function.c
@@ -5912,6 +5912,12 @@ make_epilogue_seq (void)
void
thread_prologue_and_epilogue_insns (void)
{
+ if (optimize > 1)
+ {
+ df_live_add_problem ();
+ df_live_set_all_dirty ();
+ }
Perhaps conditional on separate shrink wrapping?
@@ -5922,9 +5928,7 @@ thread_prologue_and_epilogue_insns (void)
edge entry_edge = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (cfun));
edge orig_entry_edge = entry_edge;
- rtx_insn *split_prologue_seq = make_split_prologue_seq ();
rtx_insn *prologue_seq = make_prologue_seq ();
- rtx_insn *epilogue_seq = make_epilogue_seq ();
/* Try to perform a kind of shrink-wrapping, making sure the
prologue/epilogue is emitted only around those parts of the
@@ -5932,6 +5936,13 @@ thread_prologue_and_epilogue_insns (void)
try_shrink_wrapping (&entry_edge, prologue_seq);
+ df_analyze ();
+ try_shrink_wrapping_separate (entry_edge->dest);
+ if (crtl->shrink_wrapped_separate)
+ prologue_seq = make_prologue_seq ();
Perhaps push the df_analyze call into try_shrink_wrapping_separate?
ANd if it was successful, do you need to update the DF information? Is
that perhaps the cause of some of the issues we're seeing with DCE,
regrename, the scheduler?
diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c
index b85b1c3..643e375 100644
--- a/gcc/shrink-wrap.c
+++ b/gcc/shrink-wrap.c
@@ -34,6 +34,7 @@ along with GCC; see the file COPYING3. If not see
#include "output.h"
#include "tree-pass.h"
#include "cfgrtl.h"
+#include "cfgbuild.h"
#include "params.h"
#include "bb-reorder.h"
#include "shrink-wrap.h"
@@ -1006,3 +1007,717 @@ try_shrink_wrapping (edge *entry_edge, rtx_insn
*prologue_seq)
BITMAP_FREE (bb_with);
free_dominance_info (CDI_DOMINATORS);
}
+
+/* Separate shrink-wrapping
+
+ Instead of putting all of the prologue and epilogue in one spot, we
+ can put parts of it in places where those components are executed less
+ frequently. The following code does this, for prologue and epilogue
+ components that can be put in more than one location, and where those
+ components can be executed more than once (the epilogue component will
+ always be executed before the prologue component is executed a second
+ time).
+
+ What exactly is a component is target-dependent. The more usual
+ components are simple saves/restores to/from the frame of callee-saved
+ registers. This code treats components abstractly (as an sbitmap),
+ letting the target handle all details.
+
+ Prologue components are placed in such a way that for every component
+ the prologue is executed as infrequently as possible. We do this by
+ walking the dominator tree, comparing the cost of placing a prologue
+ component before a block to the sum of costs determined for all subtrees
+ of that block.
+
+ From this placement, we then determine for each component all blocks
+ where at least one of this block's dominators (including itself) will
+ get a prologue inserted. That then is how the components are placed.
+ We could place the epilogue components a bit smarter (we can save a
+ bit of code size sometimes); this is a possible future improvement.
+
+ Prologues and epilogues are preferably placed into a block, either at
+ the beginning or end of it, if it is needed for all predecessor resp.
+ successor edges; or placed on the edge otherwise.
+
+ If the placement of any prologue/epilogue leads to a situation we cannot
+ handle (for example, an abnormal edge would need to be split, or some
+ targets want to use some specific registers that may not be available
+ where we want to put them), separate shrink-wrapping for the components
+ in that prologue/epilogue is aborted. */
+
+
+/* Print the sbitmap COMPONENTS to the DUMP_FILE if not empty, with the
+ label LABEL. */
+static void
+dump_components (const char *label, sbitmap components)
+{
+ if (bitmap_empty_p (components))
+ return;
+
+ fprintf (dump_file, " [%s", label);
+
+ for (unsigned int j = 0; j < components->n_bits; j++)
+ if (bitmap_bit_p (components, j))
+ fprintf (dump_file, " %u", j);
+
+ fprintf (dump_file, "]");
Consider allowing the target to provide a mapping from the component to
a symbolic name of some kind and using that in the dumps. Related,
consider using an enum rather than magic constants in the target bits (I
noted seeing component #0 as a magic constant in the ppc implementation).
+
+/* A helper function for accessing the pass-specific info. */
+static inline struct sw *
+SW (basic_block bb)
+{
+ gcc_assert (bb->aux);
+ return (struct sw *) bb->aux;
+}
+
+/* Create the pass-specific data structures for separately shrink-wrapping
+ with components COMPONENTS. */
+static void
+init_separate_shrink_wrap (sbitmap components)
So it seems like there's a toplevel list of components that's owned by
the target and state at each block components that are owned by the
generic code. That's fine. Just make sure we doc that the toplevel
list of components is allocated by the backend (and where does it get
freed?)
Consider using auto_sbitmap rather than manually managing
allocation/releasing of the per-block structures. In fact, can't all of
SW become a class and we lose the explicit init/fini routines in favor
of a ctor/dtor?
+
+/* Place code for prologues and epilogues for COMPONENTS where we can put
+ that code at the start of basic blocks. */
+static void
+emit_common_heads_for_components (sbitmap components)
+{
+ sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
+ sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
+ sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components));
+
+ basic_block bb;
+ FOR_EACH_BB_FN (bb, cfun)
+ {
+ /* Find which prologue resp. epilogue components are needed for all
+ predecessor edges to this block. */
Nit. Avoid ".resp". I actually had to look that abbreviation up :-)
Being a bit more verbose in the comments would be preferred over ".resp".
Having looked at this in more detail now, I'm wondering if we want to
talk at all in the docs about when selection vs disqualifying happens.
ISTM that we build the set of components we want to insert for each
edge/block. When that's complete, we then prune those results with the
disqualifying sets.
For the PPC R0 vs LR is the only thing that causes disqualification
right? Can't that be handled when we build the set of components we
want to insert for each edge/block? Is there some advantage to handling
disqualifications after all the potential insertion points have been
handled?
Jeff