Re: Generic hwloop support library

Richard Sandiford Sat, 02 Jul 2011 06:03:47 -0700

I suppose my main question about this is: did you consider extending
the cfgloop machinery instead of writing new loop discovery code?

Also, loop discovery seems to be based on the liveness of the iterator,
but what happens if the iterator is a general register that is spilled
for part of the loop?  Obviously it's best if that doesn't happen, but
if it does, it looks like you might have a "loop" in which the recorded
head isn't actually considered to be a member of the loop.

I just wondered if it would be better to discover the loop
"normally" and then analyse the use of the iterator.

Bernd Schmidt <ber...@codesourcery.com> writes:
> +typedef struct loop_info *loop_info;

That'll fall foul of -Wc++-compat.

I wonder whether we should use a more specialised name, such as
"hw_doloop", and similarly in the function names.  "loop_info"
and "bb_in_loop" sound more like things that belong in cfgloop.

> +struct hw_doloop_hooks
> +{
> +  /* Examine INSN.  If it is a suitable doloop_end pattern, return the
> +     iteration register.  Otherwise, return NULL_RTX.  */
> +  rtx (*end_pattern_reg) (rtx insn);
> +  /* Called after the hw-doloop pass has finished reordering the basic block
> +     structure.  The target can defer some of its initializations to this
> +     hook if necessary.  This function should return true if the CFG has been
> +     changed and loops must be rediscovered.  */
> +  bool (*after_reorder) (void);
> +  /* Optimize LOOP.  Return true if successful, false if the loop should be
> +     marked bad.  */
> +  bool (*opt) (loop_info loop);
> +  /* Handle a loop that was marked bad for any reason.  This could be used to
> +     split the doloop_end pattern.  */
> +  void (*fail) (loop_info loop);
> +};

For the record, I wondered at first whether these should be target hooks,
but I agree they shouldn't.  md_reorg is special enough that this new
infrastructure really does need to be a subroutine of md_reorg rather
than a new pass.  Having separate hooks is more flexible in that case,
because it allows the hooks to rely on data that is set up by md_reorg
beforehand.

However, I think there needs to be a bit more overview commentary
somewhere.  The main comment is above reorg_loops():

> +/* Run from machine_dependent_reorg, this pass looks for doloop_end
> +   insns, analyzes the loop structure, and tries to rewrite the RTL of
> +   these loops so that the target can create whatever hardware looping
> +   instructions are available.
> +   
> +   DO_REORDER should be set to true if we should try to use the
> +   reorder_loops function to ensure the loop end occurs after the loop
> +   start.  HOOKS is used to pass in target specific hooks; see
> +   hw-doloop.h.  */

I think the combination of this comment and the comments in the hooks
ought to be self-contained enough that a back-end writer could tell from
them alone how to call the subroutine, and how the hooks need to be
defined.  Would you mind expanding it a bit?  E.g. I found the
after_reorder comment a bit hard to follow because there's no single
comment that describes what reordering is done, and how the backend
might want to react to it.  I think it would also help to describe
(broadly) what kind of hardware instruction you're trying to support.

The current code seems to assume that the iterator is a single hard
register.  It might be worth stating (and asserting?)  that somewhere.

I also wondered whether the comment ought to be at the head of the file,
but given that it's a subroutine rather than a stand-alone pass,
there's not really a clear answer...

> +/* Return true if BB is part of LOOP.  */
> +bool
> +bb_in_loop_p (loop_info loop, basic_block bb)

Silly nitlet 1, but the file seems a bit inconsistent about the
number of lines between the comment and the start of function.
(0 here and a few other places, but mostly 1).

> +/* Scan the blocks of LOOP (and its inferiors), and record things such as
> +   hard registers set, jumps out of and within the loop.  */
> +
> +static void
> +scan_loop (loop_info loop)
> +{
> +  unsigned ix;
> +  basic_block bb;
> +
> +  if (loop->head == NULL)
> +    {
> +      gcc_assert (loop->bad);
> +      return;
> +    }
> +  CLEAR_HARD_REG_SET (loop->regs_set_in_loop);
> +  loop->iter_reg_used = false;
> +  loop->jumps_within = false;
> +  loop->jumps_outof = false;
> +  loop->has_call = false;
> +  loop->has_asm = false;
> +  loop->iter_reg_used_outside = false;
> +  loop->timode_insns = 0;

Since the initialisation is split over two places, the zero initialisations
above and the mixed initialisations here:

> +  loop->tail = tail_bb;
> +  loop->loop_end = tail_insn;
> +  loop->last_insn = NULL_RTX;
> +  loop->iter_reg = reg;
> +  loop->depth = loop->length = 0;
> +  loop->visited = false;
> +  loop->bad = false;
> +  loop->outer = NULL;
> +  loop->loops = NULL;
> +  loop->incoming = VEC_alloc (edge, gc, 2);
> +  loop->end_label = NULL_RTX;
> +  loop->start_label = JUMP_LABEL (tail_insn);

it might be better to allocate using XCNEW instead.

> +       if (GET_MODE (insn) == TImode)
> +         loop->timode_insns++;

I assume this is used by the C6X port?  The information is only going
to be meaningful to ports like C6X that treat scheduling as a subroutine
of md_reorg, so I wonder whether it belongs here.  If it stays, I think
there should be a comment above the field definition saying that it is
only meaningful in that case.

> +       CLEAR_HARD_REG_SET (set_this_insn);
> +       note_stores (PATTERN (insn), record_hard_reg_sets,
> +                    &set_this_insn);
> +          for (x = REG_NOTES (insn); x; x = XEXP (x, 1))
> +            if (REG_NOTE_KIND (x) == REG_INC)
> +           record_hard_reg_sets (XEXP (x, 0), NULL, &set_this_insn);
> +
> +       if (CALL_P (insn))
> +         {
> +           loop->has_call = true;
> +           for (x = CALL_INSN_FUNCTION_USAGE (insn); x; x = XEXP (x, 1))
> +             if (GET_CODE (XEXP (x, 0)) == CLOBBER)
> +               record_hard_reg_sets (XEXP (XEXP (x, 0), 0), NULL,
> +                                     &set_this_insn);
> +           
> +         }

I think this should be iterating over DF_INSN_DEFS instead.  If that
isn't accurate for some reason, it needs to be fixed...

> +  if (EDGE_COUNT (tail_bb->succs) != 2)
> +    {
> +      loop->bad = true;
> +      loop->head = loop->successor = NULL;

What's the significance of clearing "head" and "successor" as well as
setting "bad"?

> +  works = VEC_alloc (basic_block,heap,20);

Missing spaces.

> +           if (!VEC_space (basic_block, works, 1))
> +             {
> +               if (dwork)
> +                 {
> +                   VEC_block_remove (basic_block, works, 0, dwork);
> +                   dwork = 0;
> +                 }
> +               else
> +                 VEC_reserve (basic_block, heap, works, 1);
> +             }
> +           VEC_quick_push (basic_block, works, succ);

I'm being dense, sorry, but could you walk me through what this
is doing (i.e. why it isn't a simple VEC_safe_push)?

> +      for (pass = 0, retry = 1; retry && pass < 2; pass++)
> +     {
> +       edge e;
> +       edge_iterator ei;
> +       bool first = true;
> +       retry = 0;
> +
> +       FOR_EACH_EDGE (e, ei, loop->incoming)
> +         {
> +           if (first)
> +             {
> +               loop->incoming_src = e->src;
> +               loop->incoming_dest = e->dest;
> +               first = false;
> +             }
> +           else
> +             {
> +               if (e->dest != loop->incoming_dest)
> +                 loop->incoming_dest = NULL;
> +               if (e->src != loop->incoming_src)
> +                 loop->incoming_src = NULL;
> +             }
> +           if (loop->incoming_src == NULL && loop->incoming_dest == NULL)
> +             {
> +               if (pass == 0)
> +                 {
> +                   if (dump_file)
> +                     fprintf (dump_file,
> +                              ";; retrying loop %d with forwarder blocks\n",
> +                              loop->loop_no);
> +                   retry = 1;
> +                   break;
> +                 }
> +               loop->bad = 1;
> +               if (dump_file)
> +                 fprintf (dump_file,
> +                          ";; can't find suitable entry for loop %d\n",
> +                          loop->loop_no);
> +               goto out;
> +             }
> +         }
> +       if (retry)
> +         {
> +           retry = 0;
> +           FOR_EACH_EDGE (e, ei, loop->incoming)
> +             {
> +               if (forwarder_block_p (e->src))
> +                 {
> +                   edge e2;
> +                   edge_iterator ei2;
> +
> +                   if (dump_file)
> +                     fprintf (dump_file,
> +                              ";; Adding forwarder block %d to loop %d and 
> retrying\n",
> +                              e->src->index, loop->loop_no);
> +                   VEC_safe_push (basic_block, heap, loop->blocks, e->src);
> +                   bitmap_set_bit (loop->block_bitmap, e->src->index);
> +                   FOR_EACH_EDGE (e2, ei2, e->src->preds)
> +                     VEC_safe_push (edge, gc, loop->incoming, e2);
> +                   VEC_unordered_remove (edge, loop->incoming, ei.index);
> +                   retry = 1;
> +                   break;
> +                 }
> +             }
> +           if (!retry)
> +             {
> +               if (dump_file)
> +                 fprintf (dump_file, ";; No forwarder blocks found\n");
> +               loop->bad = 1;
> +             }
> +         }
> +     }
> +    }

I think this would be easier to follow if the discovery part of the
loop_incoming loop was split out:

static bool
process_incoming_edges_p (loop_info loop)
{
  edge e;
  edge_iterator ei;
  bool first = true;

  FOR_EACH_EDGE (e, ei, loop->incoming)
    {
      if (first)
        {
          loop->incoming_src = e->src;
          loop->incoming_dest = e->dest;
          first = false;
        }
      else
        {
          if (e->dest != loop->incoming_dest)
            loop->incoming_dest = NULL;
          if (e->src != loop->incoming_src)
            loop->incoming_src = NULL;
          if (loop->incoming_src == NULL && loop->incoming_dest == NULL)
            return false;
        }
    }
  return true;
}

static bool
add_forwarder_block (loop_info loop)
{
  edge e;
  edge_iterator ei;

  FOR_EACH_EDGE (e, ei, loop->incoming)
    if (forwarder_block_p (e->src))
      {
        edge e2;
        edge_iterator ei2;

        if (dump_file)
          fprintf (dump_file,
                   ";; Adding forwarder block %d to loop %d and retrying\n",
                   e->src->index, loop->loop_no);
        VEC_safe_push (basic_block, heap, loop->blocks, e->src);
        bitmap_set_bit (loop->block_bitmap, e->src->index);
        FOR_EACH_EDGE (e2, ei2, e->src->preds)
          VEC_safe_push (edge, gc, loop->incoming, e2);
        VEC_unordered_remove (edge, loop->incoming, ei.index);
      }
  return false;
}

and then:

    if (!process_incoming_edges_p (loop))
      {
        /* Try again with forwarder blocks.  */
        if (dump_file)
          fprintf (dump_file,
                   ";; retrying loop %d with forwarder blocks\n",
                   loop->loop_no);

        if (!add_forwarder_block (loop))
          {
            if (dump_file)
              fprintf (dump_file, ";; No forwarder blocks found\n");
            loop->bad = 1;
          }
        else if (!process_incoming_edges_p (loop))
          {
            if (dump_file)
              fprintf (dump_file,
                       ";; can't find suitable entry for loop %d\n",
                       loop->loop_no);
            loop->bad = 1;
          }
      }

Could you add commentary explaining what this case is trying to handle?
It seems like adding the forwarder adds more incoming_dests,
so presumably the idea is to unify the incoming_srcs.  Something like
(in insn order):

    +---- B1
    |     |
    |     V
    | +-- B2      
    | |
    | |   L1 <-+
    | |   |    |
    | |   V    |
    | +-> L2   |
    |     |    |
    |     V    |
    +---> L3   |
          |    |
          V    |
          L4 --+
          |
          V
         ...

So we'd then consider B2 to be part of the loop.  Is that right?
If so, how does it help?  B2 still seems a bit of an odd-man-out,
even if we pretend it's part of the loop.

> +      /* There's a degenerate case we can handle - an empty loop consisting
> +      of only a back branch.  Handle that by deleting the branch.  */
> +      insn = JUMP_LABEL (tail);
> +      while (insn && !NONDEBUG_INSN_P (insn))
> +     insn = NEXT_INSN (insn);
> +      if (insn == tail)
> +     {
> +       if (dump_file)
> +         {
> +           fprintf (dump_file, ";; degenerate loop ending at\n");
> +           print_rtl_single (dump_file, tail);
> +         }
> +       delete_insn_and_edges (tail);
> +       continue;
> +     }

Do we really need to handle that here?  It sounds like something that
should be optimised by an earlier pass.  Don't you also need to check
that the iterator isn't live on fallthrough?

> +      bb->aux = loop;

Where is this used?  Is it provided for the backend hooks?
If so, that doesn't seem to be documented.

> +       bitmap_and (tmp_bitmap, other->block_bitmap, loop->block_bitmap);
> +       if (bitmap_empty_p (tmp_bitmap))
> +         continue;
> +       if (bitmap_equal_p (tmp_bitmap, other->block_bitmap))
> +         {
> +           other->outer = loop;
> +           VEC_safe_push (loop_info, heap, loop->loops, other);
> +         }
> +       else if (bitmap_equal_p (tmp_bitmap, loop->block_bitmap))
> +         {
> +           loop->outer = other;
> +           VEC_safe_push (loop_info, heap, other->loops, loop);
> +         }

->outer doesn't seem to be used anywhere, and the choice of loop
looks somewhat arbitrary.

We can avoid the temporary here with:

        if (!bitmap_intersect_p (other->block_bitmap, loop->block_bitmap))
          continue;
        if (!bitmap_intersect_compl_p (other->block_bitmap,
                                       loop->block_bitmap))
          {
            other->outer = loop;
            VEC_safe_push (loop_info, heap, loop->loops, other);
          }
        else if (!bitmap_intersect_compl_p (loop->block_bitmap,
                                            other->block_bitmap))
          {
            loop->outer = other;
            VEC_safe_push (loop_info, heap, other->loops, loop);
          }

> +#define BB_AUX_INDEX(BB) ((unsigned)(BB)->aux)

I think this'll produce warnings on targets with sizeof(int) !=
sizeof (void *).

> +      /* Recreate an index for basic blocks that represents their order.  */
> +      for (bb = ENTRY_BLOCK_PTR->next_bb, index = 0;
> +        bb != EXIT_BLOCK_PTR;
> +        bb = bb->next_bb, index++)
> +     bb->aux = (PTR) index;

I think this would be clearer with FOR_EACH_BB.  It seems a shame
to recompute this for every loop though, especially if no changes
are needed.

> +      if (BB_AUX_INDEX (loop->head) < BB_AUX_INDEX (loop->tail))
> +     continue;

<=?  A self loop isn't a problem.

> +      FOR_EACH_EDGE (e, ei, loop->head->succs)
> +     {
> +       if (bitmap_bit_p (loop->block_bitmap, e->dest->index)
> +           && BB_AUX_INDEX (e->dest) < BB_AUX_INDEX (loop->tail))
> +         {
> +           basic_block start_bb = e->dest;
> +           basic_block start_prev_bb = start_bb->prev_bb;
> +
> +           if (dump_file)
> +             fprintf (dump_file, ";; Moving block %d before block %d\n",
> +                      loop->head->index, start_bb->index);
> +           loop->head->prev_bb->next_bb = loop->head->next_bb;
> +           loop->head->next_bb->prev_bb = loop->head->prev_bb;
> +
> +           loop->head->prev_bb = start_prev_bb;
> +           loop->head->next_bb = start_bb;
> +           start_prev_bb->next_bb = start_bb->prev_bb = loop->head;
> +           break;
> +         }
> +     }

How confident are we at this stage that the loop can use the hardware
instruction (i.e. won't be marked bad)?  It might be nice to have a
comment about the impact of this change in terms of branching,
both when we can and can't do the actual loop optimisation.

Is there no possibility of running the optimisation in cfglayout mode
instead?  It seems from this and the forwarder block stuff above as
though it might make things easier, but maybe not.

> +  /* Every loop contains in its list of inner loops every loop nested inside
> +     it, even if there are intermediate loops.  This works because we're 
> doing
> +     a depth-first search here and never visit a loop more than once.  */
> +  for (ix = 0; VEC_iterate (loop_info, loop->loops, ix, inner); ix++)
> +    {
> +      optimize_loop (inner, hooks);

Perhaps use a worklist here, to avoid O(loops)-deep recursion?

> +  df_live_add_problem ();
> +  df_live_set_all_dirty ();
> +  df_analyze ();
> +
> +  bitmap_obstack_initialize (&stack);
> +
> +  if (dump_file)
> +    fprintf (dump_file, ";; Find loops, first pass\n\n");
> +
> +  loops = discover_loops (&stack, hooks);

Is it a required part of the interface that we call things like
after_reorder even if there are no loops?  If so, I think that
needs to be stated, otherwise we should early out if loops is null.

> +  if (do_reorder)
> +    {
> +      reorder_loops (loops);
> +      free_loops (loops);
> +
> +      if (dump_file)
> +     fprintf (dump_file, ";; Find loops, second pass\n\n");
> +
> +      loops = discover_loops (&stack, hooks);
> +    }
> +
> +  if (hooks->after_reorder)
> +    {
> +      if (hooks->after_reorder ())
> +     {
> +       free_loops (loops);
> +       if (dump_file)
> +         fprintf (dump_file, ";; Find loops after target after_reorder\n\n");
> +
> +       loops = discover_loops (&stack, hooks);
> +     }
> +    }

>From a naive reading, this looks a bit inefficient: the loops created
after the reorder aren't passed to the after_reorder hook, and could
be immediately discarded if after_reorder returns true.  Maybe just
set loops to null if they become invalid, and recreate them with:

  if (!loops)
    loops = discover_loops ();

Depends on the answer to the early-out question above though.

Richard

Re: Generic hwloop support library

Reply via email to