On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandez <al...@redhat.com> wrote: > On 04/13/12 03:46, Richard Guenther wrote: >> >> On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez<al...@redhat.com> wrote: > > > Richard. Thanks so much for reviewing and providing an alternative > approach, which AFAICT provides superior results. > > >> A similar effect could be obtained by keeping a flag whether we entered >> the >> path that stored the value before the transform. Thus, do >> >> lsm = g2; // technically not needed, if you also handle loads >> flag = false; >> for (...) >> { >> if (g1) >> { >> if (flag) >> g2 = lsm; >> } >> else >> { >> lsm = 0; >> flag = true; >> } >> } >> if (flag) >> g2 = lsm; > > > I have implemented this in the attached patch, and it seems to be generating > better code than my other if-changed approach. It also avoids generating > unnecessary loads that may trap. > > For the original testcase: > > > int g_1 = 1; > int g_2 = 0; > > int func_1(void) > { > int l; > for (l = 0; l < 1234; l++) > { > if (g_1) > return l; > else > g_2 = 0; > } > return 999; > } > > After all optimizations are done, we are now generating the following for > the flags approach: > > new: > func_1: > movl g_1(%rip), %edx > xorl %eax, %eax > testl %edx, %edx > je .L5 > rep > ret > .L5: > movl $0, g_2(%rip) > movw $999, %ax > ret > > Which is significantly better than the unmodified trunk of: > > func_1: > movl g_1(%rip), %edx > movl g_2(%rip), %eax > testl %edx, %edx > je .L5 > movl %eax, g_2(%rip) > xorl %eax, %eax > ret > .L5: > movl $0, g_2(%rip) > movl $999, %eax > ret > > And in other less contrived cases, we generate correct code that avoids > writes on code paths that did not have it. For example, in Hans register > promotion example: > > int count; > > struct obj { > int data; > struct obj *next; > } *q; > > void func() > { > struct obj *p; > for (p = q; p; p = p->next) > if (p->data > 0) > count++; > } > > Under the new memory model we should avoid promoting "count" to a register > and unilaterally writing to it upon exiting the loop. > > With the attached patch, we now generate: > > func: > .LFB0: > movq q(%rip), %rax > xorl %ecx, %ecx > movl count(%rip), %edx > testq %rax, %rax > je .L1 > .L9: > movl (%rax), %esi > testl %esi, %esi > jle .L3 > addl $1, %edx > movl $1, %ecx > .L3: > movq 8(%rax), %rax > testq %rax, %rax > jne .L9 > testb %cl, %cl > jne .L12 > .L1: > rep > ret > .L12: > movl %edx, count(%rip) > ret > > Which is as expected. > > I don't understand your suggestion of: > > >> lsm = g2; // technically not needed, if you also handle loads >> >> that would allow to extend this to cover the cases where the access may >> >> eventually trap (if you omit the load before the loop). > > > Can't I just remove the lsm=g2? What's this about also handling loads? > > >> >> Not sure which is more efficient (you can eventually combine flag >> variables >> for different store locations, combining the if-changed compare is not so >> easy >> I guess). > > > So far, I see better code generated with the flags approach, so... > > >>> 1. No pass can figure out the equality (or inequality) of g_2_lsm and >>> g_2. >>> In the PR, Richard mentions that both FRE/PRE can figure this out, but >>> they >>> are not run after store motion. DOM should also be able to, but >>> apparently >>> does not :(. Tips? >> >> >> Well. Schedule FRE after loop opts ... >> >> DOM is not prepared to handle loads/stores with differing VOPs - it was >> never >> updated really after the alias-improvements merge. >> >>> 2. The GIMPLE_CONDs I create in this patch are currently causing problems >>> with complex floats/doubles. do_jump is complaining that it can't >>> compare >>> them, so I assume it is too late (in tree-ssa-loop-im.c) to compare >>> complex >>> values since complex lowering has already happened (??). Is there a more >>> canonical way of creating a GIMPLE_COND that may contain complex floats >>> at >>> this stage? >> >> >> As you are handling redundant stores you want a bitwise comparison anyway, >> but yes, complex compares need to be lowered. But as said, you want a >> bitwise comparison, not a value-comparison. You can try using > > > I can ignore all this. I have implemented both alternatives, with a target > hook as suggested, but I'm thinking of removing it altogether and just > leaving the flags approach. > >> Few comments on the patch itself. >> >> + new_bb = split_edge (ex); >> + then_bb = create_empty_bb (new_bb); >> + if (current_loops&& new_bb->loop_father) >> >> + add_bb_to_loop (then_bb, new_bb->loop_father); >> >> + gsi = gsi_start_bb (new_bb); >> + t1 = make_rename_temp (TREE_TYPE (ref->mem), NULL); >> >> Hmm, why do you need to re-load the value? Don't you have both >> the value as it was loaded before the loop and the new value (the lsm tmp >> reg) >> already? Why not simply remember the SSA name used? Ah, because >> store motion also uses make_rename_temp. If you don't want to change >> that (which would be good but inconvenient because of the required PHI >> insertions), I'd change the code that performs the load in the pre-header >> to do >> >> SSA_NAME = ref->mem; >> rename-lsm-tmp = SSA_NAME; >> >> so you can remember SSA_NAME and re-use it here. That probably solves >> your optimization issue, too. > > > Fixed. > > >> + for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi); >> gsi_next (&gsi)) >> + { >> + gimple phi = gsi_stmt (gsi); >> + unsigned i; >> + >> + for (i = 0; i< gimple_phi_num_args (phi); i++) >> + if (gimple_phi_arg_edge (phi, i)->src == new_bb) >> + { >> + tree arg = gimple_phi_arg_def (phi, i); >> + add_phi_arg (phi, arg, then_old_edge, >> UNKNOWN_LOCATION); >> + update_stmt (phi); >> + } >> + } >> >> Hmm. This is of course doing unnecessary work in case there are multiple >> moved stores. In fact splitting the exit edge is only necessary if there >> are >> more than one predecessor. Otherwise you can simply split the exit block >> after inserting the new conditional after its labels. > > > Is this still the case with the current patch? If so, I'm a bit confused as > to what you want here. > > >> >> + if (opts->x_flag_tm) >> + { >> + if (opts->x_flag_non_call_exceptions) >> + sorry ("transactional memory is not supported with non-call >> exceptions"); >> + >> + set_param_value ("allow-load-data-races", 0, >> + opts->x_param_values, opts_set->x_param_values); >> + set_param_value ("allow-store-data-races", 0, >> + opts->x_param_values, opts_set->x_param_values); >> >> Unless the user explicitely set those params? Which means using params >> wasn't a very good idea ... ideally you'd diagnose an error when the user >> would mix -fgnu-tm and -fallow-store-data-races. So - can we remove >> allow-load-data-races (we are never going to implement that) and make >> allow-store-data-races a proper -f flag please? > > > Sorry, that was not meant for public consumption. I have rewritten this to > either take the param value as is, and in the case of flag_tm, only restrict > the optimization if the loop is inside an actual transaction (see the > changes to trans-mem.c, cause I know you'll hate bb_in_transaction() :-)). > > >> >> + } >> >> And that should be a separate patch > > > I can certainly reimplement --param=allow-{load,store}-data-races as proper > -f flags. I don't care either way. But I remember Ian Taylor having a > different opinion. If y'all agree, I can implement whatever. > > >> It would be nice if you could think about LIM/SM for possibly trapping >> loads/stores >> while you are doing this work. > > > We are no longer adding additional trapping loads (as with the if-changed > approach). Is this something else you'd like me to concentrate on? > > Please take a look at the attached patch. I tested the flags alternative on > a full bootstrap, and there's only one regression which I will look at, but > I'd like to know if the current WIP is aligned with what you had in mind.
+ /* ?? Perhaps we should cache this somewhere in the BB, or are + multiple levels of empty BB's common. ?? */ + FOR_EACH_EDGE (e, ei, bb->preds) + { + int res = bb_in_transaction_1 (e->src); + if (res != -1) + return (bool) res; + } + return false; that's weird code - if predecessors are not all in or not in a transaction you return a random value. So it seems this is somewhat fragile. If bb status can be derived from looking at a single stmt why is the transaction-ness not recorded as a basic-block flag then? Thus, why do we have a bit in gimple stmts? Why can nobody merge a transaction and a non-transaction basic-block making your predicate above wrong because only the 2nd stmt is in a transaction? + bool single_exit_p = single_pred_p (ex->dest); that's a strange variable name ;) +/* Helper function for execute_sm. On every path that sets REF, set + an appropriate flag indicating the store. */ + +static tree +execute_sm_if_changed_flag_set (struct loop *loop, mem_ref_p ref) +{ ... + FOR_EACH_VEC_ELT (mem_ref_loc_p, locs, i, loc) + { + gimple_stmt_iterator gsi; + gimple stmt; + + gsi = gsi_start_bb (gimple_bb (loc->stmt)); + for (; !gsi_end_p (gsi); gsi_next (&gsi)) + if (gsi_stmt (gsi) == loc->stmt) does not seem to do it on every path but on every REF setter. And instead of the innermost loop just use gsi_for_stmt (loc->stmt). + /* Emit the load code into the latch, so that we are sure it will + be processed after all dependencies. */ + latch_edge = loop_latch_edge (loop); + load = gimple_build_assign (mem_ssa, unshare_expr (ref->mem)); lim_data = init_lim_data (load); lim_data->max_loop = loop; lim_data->tgt_loop = loop; + gsi_insert_on_edge (latch_edge, load); + load = gimple_build_assign (tmp_var, mem_ssa); + lim_data = init_lim_data (load); + lim_data->max_loop = loop; + lim_data->tgt_loop = loop; + gsi_insert_on_edge (latch_edge, load); the 2nd assign is no longer a "load", so I suppose you don't need any lim_data for it? + else if (store == STORE_IF_CHANGED_FLAG) + execute_sm_if_changed (ex, ref->mem, mem_ssa, tmp_var, + store_flag); and this can pass NULL instead of mem_ssa? Overall this looks reasonable. With also handling trapping things I meant that we should be able to apply store-motion to int a[256]; void foo (int *p) { int i; for (i= 0;i<256;i++) if (flag) *p = a[i]; } but we cannot, because the transform lsm = *p; for (i = 0; i < 256; ++i) if (flag) lsm = a[i]; *p = lsm; even when the store is properly conditional the load is not (and you do not change that). The unconditional load may trap here. What I meant to say is that we do not need the load at all unless there is also a load involved inside the loop - if we use the flag-based approach. If there is also a load inside the loop we'd need to perform the load there (or not handle this case) But overall this looks nice! Thanks, Richard. > Thanks again. > Aldy