Hi Segher,

Thanks a lot for your comments!

on 2019/4/25 下午8:16, Segher Boessenkool wrote:
 
> Does it create worse code now?  What we have before your patch isn't
> so super either (it has an sldi in the loop, it has two mtctr too).
> Maybe you can show the generated code?

It's a good question! From the generated codes for the core loop, the 
code after my patch doesn't have bdnz to leverage hardware CTR, it has
extra cmpld and branch instead, looks worse.  But I wrote a tiny case
to invoke the foo and evaluated the running time, they are equal.

* Measured time:
  After:
    real 199.47
    user 198.35
    sys 1.11
  Before:
    real 199.19
    user 198.56
    sys 0.62

* Command I used to compile:
${local_gcc} ${case_src}/20050830-1.c  -fno-diagnostics-show-caret 
-fno-diagnostics-show-line-numbers -fdiagnostics-color=never  -O2 
-ffat-lto-objects -fno-ident -c

* main file (main.c):
extern void foo();
#define LOOPNUM 10000
#define N 1000000*256
int a[N];
int main() {
  for(int i = 0; i < LOOPNUM; i++) {
    foo(N);
  }
}


* Generated code sequence:

Before my patch:

        cmpwi 0,3,511
        blelr 0
        addi 10,3,-256
        addi 9,3,-512
        addis 8,2,.LC0@toc@ha
        ld 8,.LC0@toc@l(8)
        cmpwi 0,10,256
        rldicl 9,9,56,40
        sldi 3,3,2
        addi 9,9,1
        mtctr 9
        add 8,3,8
        li 10,42
        blt 0,.L7     # jump to L7 it's less than 256
.L3:                  # core loop
        stw 10,0(8)   
        addi 8,8,-1024
        bdnz .L3
        blr
.L7:
        li 9,1        # special case, iteration only 1
        mtctr 9
        b .L3

After my patch:

        cmpwi 0,3,511
        blelr 0
        addis 7,2,.LC0@toc@ha
        ld 7,.LC0@toc@l(7)
        addi 10,3,-512
        sldi 9,3,2
        rlwinm 10,10,0,0,23
        li 8,42
        subf 10,10,3
        sldi 10,10,2
        addi 6,7,-1024
        add 9,9,7
        add 10,10,6
        .p2align 4,,15
.L3:                      # core loop
        stw 8,0(9)
        addi 9,9,-1024
        cmpld 0,9,10      # cmp 
        beqlr 0           # if eq, return
        stw 8,0(9)     
        addi 9,9,-1024   
        cmpld 0,9,10      # cmp again 
        bne 0,.L3         # if ne, jump to L3.
        blr

------------------------

I practiced whether we can adjust the decision made in ivopts.
For one compare iv use with zero cost, if one iv cand whose base
is from memory has costly elim_cost before adjust_setup_cost,
it's possible to make later analysis unable to find it's finite,
then we try to avoid it.

The diff looks like:

--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -5126,6 +5126,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
   tree *control_var, *bound_cst;
   enum tree_code comp = ERROR_MARK;
   struct iv_use *use = group->vuses[0];
+  bool dont_elim_p = false;

   /* Extract condition operands.  */
   rewrite_type = extract_cond_operands (data, use->stmt, &control_var,
@@ -5152,6 +5153,16 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
          inv_expr_elim = get_loop_invariant_expr (data, bound);
          bitmap_clear (inv_vars_elim);
        }
+
+      /* zero cost use makes it easier to select memory based iv cand
+         for replacement of non memory based iv and its use.   But if
+         the setup sequence are too costly, loop iv analysis can NOT
+         easily figure out it's finite, it's possible to stop the
+         low-overhead loop transformation and get unexpected code.  */
+      if (use->zero_cost_p && cand->iv->base_object && !use->iv->base_object
+         && elim_cost.cost >= 30)
+       dont_elim_p = true;
+
       /* The bound is a loop invariant, so it will be only computed
         once.  */
       elim_cost.cost = adjust_setup_cost (data, elim_cost.cost);
@@ -5184,7 +5195,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
   express_cost += bound_cost;

   /* Choose the better approach, preferring the eliminated IV. */
-  if (elim_cost <= express_cost)
+  if (elim_cost <= express_cost && !dont_elim_p)
     {


I was thinking whether this zero cost change just exposed
an existing problem, then this kind of check should be for all 
cases not only for zero cost use, similar to 
expression_expensive_p?  But why doesn't it happen before?
Need more investigation.

> 
>> Btw, this is for GCC10.
> 
> *Phew* ;-)
> 
> 
> Some trivial comments:
> 
>> +static bool
>> +invalid_insn_for_doloop_p (struct loop *loop)
>> +{
>> +  basic_block *body = get_loop_body (loop);
>> +  unsigned num_nodes = loop->num_nodes;
>> +  gimple_stmt_iterator gsi;
>> +  unsigned i;
>> +
>> +  for (i = 0; i < num_nodes; i++)
> 
>   for (unsigned i = 0; i < num_nodes; i++)
> 
> (and maybe you can just say loop->num_nodes here; I don't know if that
> generates worse code, or if that even matters).

Good idea, will fix it.

> 
>> +        if (dump_file && (dump_flags & TDF_DETAILS))
>> +          fprintf (
>> +            dump_file,
>> +            "predict doloop failure due to finding computed jump.\n");
> 
> We don't normally end lines in (.  There are other solutions to why you
> did that here; you can use string pasting, to break the string into two,
> or factor out (some part of) the loop body to reduce indentation.
> 

Will adjust it.

> 
> Segher
> 

Reply via email to