Hi Segher, Thanks a lot for your comments!
on 2019/4/25 下午8:16, Segher Boessenkool wrote: > Does it create worse code now? What we have before your patch isn't > so super either (it has an sldi in the loop, it has two mtctr too). > Maybe you can show the generated code? It's a good question! From the generated codes for the core loop, the code after my patch doesn't have bdnz to leverage hardware CTR, it has extra cmpld and branch instead, looks worse. But I wrote a tiny case to invoke the foo and evaluated the running time, they are equal. * Measured time: After: real 199.47 user 198.35 sys 1.11 Before: real 199.19 user 198.56 sys 0.62 * Command I used to compile: ${local_gcc} ${case_src}/20050830-1.c -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 -ffat-lto-objects -fno-ident -c * main file (main.c): extern void foo(); #define LOOPNUM 10000 #define N 1000000*256 int a[N]; int main() { for(int i = 0; i < LOOPNUM; i++) { foo(N); } } * Generated code sequence: Before my patch: cmpwi 0,3,511 blelr 0 addi 10,3,-256 addi 9,3,-512 addis 8,2,.LC0@toc@ha ld 8,.LC0@toc@l(8) cmpwi 0,10,256 rldicl 9,9,56,40 sldi 3,3,2 addi 9,9,1 mtctr 9 add 8,3,8 li 10,42 blt 0,.L7 # jump to L7 it's less than 256 .L3: # core loop stw 10,0(8) addi 8,8,-1024 bdnz .L3 blr .L7: li 9,1 # special case, iteration only 1 mtctr 9 b .L3 After my patch: cmpwi 0,3,511 blelr 0 addis 7,2,.LC0@toc@ha ld 7,.LC0@toc@l(7) addi 10,3,-512 sldi 9,3,2 rlwinm 10,10,0,0,23 li 8,42 subf 10,10,3 sldi 10,10,2 addi 6,7,-1024 add 9,9,7 add 10,10,6 .p2align 4,,15 .L3: # core loop stw 8,0(9) addi 9,9,-1024 cmpld 0,9,10 # cmp beqlr 0 # if eq, return stw 8,0(9) addi 9,9,-1024 cmpld 0,9,10 # cmp again bne 0,.L3 # if ne, jump to L3. blr ------------------------ I practiced whether we can adjust the decision made in ivopts. For one compare iv use with zero cost, if one iv cand whose base is from memory has costly elim_cost before adjust_setup_cost, it's possible to make later analysis unable to find it's finite, then we try to avoid it. The diff looks like: --- a/gcc/tree-ssa-loop-ivopts.c +++ b/gcc/tree-ssa-loop-ivopts.c @@ -5126,6 +5126,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data, tree *control_var, *bound_cst; enum tree_code comp = ERROR_MARK; struct iv_use *use = group->vuses[0]; + bool dont_elim_p = false; /* Extract condition operands. */ rewrite_type = extract_cond_operands (data, use->stmt, &control_var, @@ -5152,6 +5153,16 @@ determine_group_iv_cost_cond (struct ivopts_data *data, inv_expr_elim = get_loop_invariant_expr (data, bound); bitmap_clear (inv_vars_elim); } + + /* zero cost use makes it easier to select memory based iv cand + for replacement of non memory based iv and its use. But if + the setup sequence are too costly, loop iv analysis can NOT + easily figure out it's finite, it's possible to stop the + low-overhead loop transformation and get unexpected code. */ + if (use->zero_cost_p && cand->iv->base_object && !use->iv->base_object + && elim_cost.cost >= 30) + dont_elim_p = true; + /* The bound is a loop invariant, so it will be only computed once. */ elim_cost.cost = adjust_setup_cost (data, elim_cost.cost); @@ -5184,7 +5195,7 @@ determine_group_iv_cost_cond (struct ivopts_data *data, express_cost += bound_cost; /* Choose the better approach, preferring the eliminated IV. */ - if (elim_cost <= express_cost) + if (elim_cost <= express_cost && !dont_elim_p) { I was thinking whether this zero cost change just exposed an existing problem, then this kind of check should be for all cases not only for zero cost use, similar to expression_expensive_p? But why doesn't it happen before? Need more investigation. > >> Btw, this is for GCC10. > > *Phew* ;-) > > > Some trivial comments: > >> +static bool >> +invalid_insn_for_doloop_p (struct loop *loop) >> +{ >> + basic_block *body = get_loop_body (loop); >> + unsigned num_nodes = loop->num_nodes; >> + gimple_stmt_iterator gsi; >> + unsigned i; >> + >> + for (i = 0; i < num_nodes; i++) > > for (unsigned i = 0; i < num_nodes; i++) > > (and maybe you can just say loop->num_nodes here; I don't know if that > generates worse code, or if that even matters). Good idea, will fix it. > >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> + fprintf ( >> + dump_file, >> + "predict doloop failure due to finding computed jump.\n"); > > We don't normally end lines in (. There are other solutions to why you > did that here; you can use string pasting, to break the string into two, > or factor out (some part of) the loop body to reduce indentation. > Will adjust it. > > Segher >