Forget to mention that the alias check merger can reduce the number of checks from 7 to 2 for this example:
struct A { int *base; int offset; int offset2; int offset3; int offset4; int offset5; int offset6; int offset7; int offset8; }; void foo (struct A * ar1, struct A* ar2) { int i; for (i = 0; i < 10000; i++) { ar1->base[i] = 2*ar2->base[i] + ar2->offset + ar2->offset2 + ar2->offset3 + ar2->offset4 + ar2->offset5 + ar2->offset6; /* + ar2->offset7 + ar2->offset8;*/ } } thanks, Cong On Wed, Oct 2, 2013 at 2:18 PM, Xinliang David Li <davi...@google.com> wrote: > On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener <rguent...@suse.de> wrote: >> On Tue, 1 Oct 2013, Cong Hou wrote: >> >>> When alias exists between data refs in a loop, to vectorize it GCC >>> does loop versioning and adds runtime alias checks. Basically for each >>> pair of data refs with possible data dependence, there will be two >>> comparisons generated to make sure there is no aliasing between them >>> in each iteration of the vectorized loop. If there are many such data >>> refs pairs, the number of comparisons can be very large, which is a >>> big overhead. >>> >>> However, in some cases it is possible to reduce the number of those >>> comparisons. For example, for the following loop, we can detect that >>> b[0] and b[1] are two consecutive member accesses so that we can >>> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into >>> checking a[0:100]&b[0:2]: >>> >>> void foo(int*a, int* b) >>> { >>> for (int i = 0; i < 100; ++i) >>> a[i] = b[0] + b[1]; >>> } >>> >>> Actually, the requirement of consecutive memory accesses is too >>> strict. For the following loop, we can still combine the alias checks >>> between a[0:100]&b[0] and a[0:100]&b[100]: >>> >>> void foo(int*a, int* b) >>> { >>> for (int i = 0; i < 100; ++i) >>> a[i] = b[0] + b[100]; >>> } >>> >>> This is because if b[0] is not in a[0:100] and b[100] is not in >>> a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need >>> to check a[0:100] and b[0:101] don't overlap. >>> >>> More generally, consider two pairs of data refs (a, b1) and (a, b2). >>> Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2; >>> offset_b1 and offset_b2 (offset_b1 < offset_b2) are offsets of b1 and >>> b2, and segment_length_a, segment_length_b1, and segment_length_b2 are >>> segment length of a, b1, and b2. Then we can combine the two >>> comparisons into one if the following condition is satisfied: >>> >>> offset_b2- offset_b1 - segment_length_b1 < segment_length_a >>> >>> >>> This patch detects those combination opportunities to reduce the >>> number of alias checks. It is tested on an x86-64 machine. >> >> Apart from the other comments you got (to which I agree) the patch >> seems to do two things, namely also: >> >> + /* Extract load and store statements on pointers with zero-stride >> + accesses. */ >> + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) >> + { >> >> which I'd rather see in a separate patch (and done also when >> the loop doesn't require versioning for alias). > > yes. > >> >> Also combining the alias checks in vect_create_cond_for_alias_checks >> is nice but doesn't properly fix the use of the >> vect-max-version-for-alias-checks param > > Yes. The handling of this should be moved to > 'vect_prune_runtime_alias_test_list' to avoid premature decisions. > > > >>which currently inhibits >> vectorization of the HIMENO benchmark by default (and make us look bad >> compared to LLVM). > > Here is a small reproducible: > > struct A { > int *base; > int offset; > int offset2; > int offset3; > int offset4; > int offset5; > int offset6; > int offset7; > int offset8; > }; > > void foo (struct A * ar1, struct A* ar2) > { > int i; > for (i = 0; i < 10000; i++) > { > ar1->base[i] = 2*ar2->base[i] + ar2->offset + ar2->offset2 > + ar2->offset3 + ar2->offset4 + ar2->offset5 + ar2->offset6; /* + > ar2->offset7 + ar2->offset8;*/ > } > } > > GCC trunk won't vectorize it at O2 due to the limit. > > > There is another problem we should be tracking: GCC no longer > vectorize the loop (with large > --param=vect-max-version-for-alias-checks=40) when -fno-strict-alias > is specified. However with additional runtime alias check, the loop > should be vectorizable. > > David > > >> >> So I believe this merging should be done incrementally when >> we collect the DDRs we need to test in vect_mark_for_runtime_alias_test. >> >> Thanks for working on this, >> Richard. >> >>> >>> thanks, >>> Cong >>> >>> >>> >>> Index: gcc/tree-vect-loop-manip.c >>> =================================================================== >>> --- gcc/tree-vect-loop-manip.c (revision 202662) >>> +++ gcc/tree-vect-loop-manip.c (working copy) >>> @@ -19,6 +19,10 @@ You should have received a copy of the G >>> along with GCC; see the file COPYING3. If not see >>> <http://www.gnu.org/licenses/>. */ >>> >>> +#include <vector> >>> +#include <utility> >>> +#include <algorithm> >>> + >>> #include "config.h" >>> #include "system.h" >>> #include "coretypes.h" >>> @@ -2248,6 +2252,74 @@ vect_vfa_segment_size (struct data_refer >>> return segment_length; >>> } >>> >>> +namespace >>> +{ >>> + >>> +/* struct dr_addr_with_seg_len >>> + >>> + A struct storing information of a data reference, including the data >>> + ref itself, its basic address, the access offset and the segment length >>> + for aliasing checks. */ >>> + >>> +struct dr_addr_with_seg_len >>> +{ >>> + dr_addr_with_seg_len (data_reference* d, tree addr, tree off, tree len) >>> + : dr (d), basic_addr (addr), offset (off), seg_len (len) {} >>> + >>> + data_reference* dr; >>> + tree basic_addr; >>> + tree offset; >>> + tree seg_len; >>> +}; >>> + >>> +/* Operator == between two dr_addr_with_seg_len objects. >>> + >>> + This equality operator is used to make sure two data refs >>> + are the same one so that we will consider to combine the >>> + aliasing checks of those two pairs of data dependent data >>> + refs. */ >>> + >>> +bool operator == (const dr_addr_with_seg_len& d1, >>> + const dr_addr_with_seg_len& d2) >>> +{ >>> + return operand_equal_p (d1.basic_addr, d2.basic_addr, 0) >>> + && operand_equal_p (d1.offset, d2.offset, 0) >>> + && operand_equal_p (d1.seg_len, d2.seg_len, 0); >>> +} >>> + >>> +typedef std::pair <dr_addr_with_seg_len, dr_addr_with_seg_len> >>> + dr_addr_with_seg_len_pair_t; >>> + >>> + >>> +/* Operator < between two dr_addr_with_seg_len_pair_t objects. >>> + >>> + This operator is used to sort objects of dr_addr_with_seg_len_pair_t >>> + so that we can combine aliasing checks during one scan. */ >>> + >>> +bool operator < (const dr_addr_with_seg_len_pair_t& p1, >>> + const dr_addr_with_seg_len_pair_t& p2) >>> +{ >>> + const dr_addr_with_seg_len& p11 = p1.first; >>> + const dr_addr_with_seg_len& p12 = p1.second; >>> + const dr_addr_with_seg_len& p21 = p2.first; >>> + const dr_addr_with_seg_len& p22 = p2.second; >>> + >>> + if (p11.basic_addr != p21.basic_addr) >>> + return p11.basic_addr < p21.basic_addr; >>> + if (p12.basic_addr != p22.basic_addr) >>> + return p12.basic_addr < p22.basic_addr; >>> + if (TREE_CODE (p11.offset) != INTEGER_CST >>> + || TREE_CODE (p21.offset) != INTEGER_CST) >>> + return p11.offset < p21.offset; >>> + if (int_cst_value (p11.offset) != int_cst_value (p21.offset)) >>> + return int_cst_value (p11.offset) < int_cst_value (p21.offset); >>> + if (TREE_CODE (p12.offset) != INTEGER_CST >>> + || TREE_CODE (p22.offset) != INTEGER_CST) >>> + return p12.offset < p22.offset; >>> + return int_cst_value (p12.offset) < int_cst_value (p22.offset); >>> +} >>> + >>> +} >>> >>> /* Function vect_create_cond_for_alias_checks. >>> >>> @@ -2292,20 +2364,51 @@ vect_create_cond_for_alias_checks (loop_ >>> if (may_alias_ddrs.is_empty ()) >>> return; >>> >>> + >>> + /* Basically, for each pair of dependent data refs store_ptr_0 >>> + and load_ptr_0, we create an expression: >>> + >>> + ((store_ptr_0 + store_segment_length_0) <= load_ptr_0) >>> + || (load_ptr_0 + load_segment_length_0) <= store_ptr_0)) >>> + >>> + for aliasing checks. However, in some cases we can decrease >>> + the number of checks by combining two checks into one. For >>> + example, suppose we have another pair of data refs store_ptr_0 >>> + and load_ptr_1, and if the following condition is satisfied: >>> + >>> + load_ptr_0 < load_ptr_1 && >>> + load_ptr_1 - load_ptr_0 - load_segment_length_0 < >>> store_segment_length_0 >>> + >>> + (this condition means, in each iteration of vectorized loop, >>> + the accessed memory of store_ptr_0 cannot be between the memory >>> + of load_ptr_0 and load_ptr_1.) >>> + >>> + we then can use only the following expression to finish the >>> + alising checks between store_ptr_0 & load_ptr_0 and >>> + store_ptr_0 & load_ptr_1: >>> + >>> + ((store_ptr_0 + store_segment_length_0) <= load_ptr_0) >>> + || (load_ptr_1 + load_segment_length_1 <= store_ptr_0)) >>> + >>> + Note that we only consider that load_ptr_0 and load_ptr_1 have the >>> + same basic address. */ >>> + >>> + std::vector<dr_addr_with_seg_len_pair_t> ddrs_with_seg_len; >>> + >>> + /* First, we collect all data ref pairs for aliasing checks. */ >>> + >>> FOR_EACH_VEC_ELT (may_alias_ddrs, i, ddr) >>> { >>> struct data_reference *dr_a, *dr_b; >>> gimple dr_group_first_a, dr_group_first_b; >>> - tree addr_base_a, addr_base_b; >>> tree segment_length_a, segment_length_b; >>> gimple stmt_a, stmt_b; >>> - tree seg_a_min, seg_a_max, seg_b_min, seg_b_max; >>> >>> dr_a = DDR_A (ddr); >>> stmt_a = DR_STMT (DDR_A (ddr)); >>> dr_group_first_a = GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt_a)); >>> if (dr_group_first_a) >>> - { >>> + { >>> stmt_a = dr_group_first_a; >>> dr_a = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt_a)); >>> } >>> @@ -2314,20 +2417,11 @@ vect_create_cond_for_alias_checks (loop_ >>> stmt_b = DR_STMT (DDR_B (ddr)); >>> dr_group_first_b = GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt_b)); >>> if (dr_group_first_b) >>> - { >>> + { >>> stmt_b = dr_group_first_b; >>> dr_b = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt_b)); >>> } >>> >>> - addr_base_a >>> - = fold_build_pointer_plus (DR_BASE_ADDRESS (dr_a), >>> - size_binop (PLUS_EXPR, DR_OFFSET (dr_a), >>> - DR_INIT (dr_a))); >>> - addr_base_b >>> - = fold_build_pointer_plus (DR_BASE_ADDRESS (dr_b), >>> - size_binop (PLUS_EXPR, DR_OFFSET (dr_b), >>> - DR_INIT (dr_b))); >>> - >>> if (!operand_equal_p (DR_STEP (dr_a), DR_STEP (dr_b), 0)) >>> length_factor = scalar_loop_iters; >>> else >>> @@ -2335,24 +2429,149 @@ vect_create_cond_for_alias_checks (loop_ >>> segment_length_a = vect_vfa_segment_size (dr_a, length_factor); >>> segment_length_b = vect_vfa_segment_size (dr_b, length_factor); >>> >>> + dr_addr_with_seg_len_pair_t dr_with_seg_len_pair >>> + (dr_addr_with_seg_len >>> + (dr_a, DR_BASE_ADDRESS (dr_a), >>> + size_binop (PLUS_EXPR, DR_OFFSET (dr_a), DR_INIT (dr_a)), >>> + segment_length_a), >>> + dr_addr_with_seg_len >>> + (dr_b, DR_BASE_ADDRESS (dr_b), >>> + size_binop (PLUS_EXPR, DR_OFFSET (dr_b), DR_INIT (dr_b)), >>> + segment_length_b)); >>> + >>> + if (dr_with_seg_len_pair.first.basic_addr > >>> + dr_with_seg_len_pair.second.basic_addr) >>> + std::swap (dr_with_seg_len_pair.first, dr_with_seg_len_pair.second); >>> + >>> + ddrs_with_seg_len.push_back (dr_with_seg_len_pair); >>> + } >>> + >>> + /* Second, we sort the collected data ref pairs so that we can scan >>> + them once to combine all possible aliasing checks. */ >>> + >>> + std::sort (ddrs_with_seg_len.begin(), ddrs_with_seg_len.end()); >>> + >>> + /* Remove duplicate data ref pairs. */ >>> + ddrs_with_seg_len.erase (std::unique (ddrs_with_seg_len.begin(), >>> + ddrs_with_seg_len.end()), >>> + ddrs_with_seg_len.end()); >>> + >>> + /* We then scan the sorted dr pairs and check if we can combine >>> + alias checks of two neighbouring dr pairs. */ >>> + >>> + for (size_t i = 1; i < ddrs_with_seg_len.size (); ++i) >>> + { >>> + dr_addr_with_seg_len& dr_a1 = ddrs_with_seg_len[i-1].first; >>> + dr_addr_with_seg_len& dr_b1 = ddrs_with_seg_len[i-1].second; >>> + dr_addr_with_seg_len& dr_a2 = ddrs_with_seg_len[i].first; >>> + dr_addr_with_seg_len& dr_b2 = ddrs_with_seg_len[i].second; >>> + >>> + if (dr_a1 == dr_a2) >>> + { >>> + if (dr_b1.basic_addr != dr_b2.basic_addr >>> + || TREE_CODE (dr_b1.offset) != INTEGER_CST >>> + || TREE_CODE (dr_b2.offset) != INTEGER_CST) >>> + continue; >>> + >>> + int diff = int_cst_value (dr_b2.offset) - >>> + int_cst_value (dr_b1.offset); >>> + >>> + gcc_assert (diff > 0); >>> + >>> + if (diff <= vect_factor >>> + || (TREE_CODE (dr_b1.seg_len) == INTEGER_CST >>> + && diff - int_cst_value (dr_b1.seg_len) < vect_factor) >>> + || (TREE_CODE (dr_b1.seg_len) == INTEGER_CST >>> + && TREE_CODE (dr_a1.seg_len) == INTEGER_CST >>> + && diff - int_cst_value (dr_b1.seg_len) < >>> + int_cst_value (dr_a1.seg_len))) >>> + { >>> + if (dump_enabled_p ()) >>> + { >>> + dump_printf_loc >>> + (MSG_NOTE, vect_location, >>> + "combining two runtime checks for data references "); >>> + dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_b1.dr)); >>> + dump_printf (MSG_NOTE, " and "); >>> + dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_b2.dr)); >>> + dump_printf (MSG_NOTE, "\n"); >>> + } >>> + >>> + dr_b1.seg_len = size_binop (PLUS_EXPR, >>> + dr_b2.seg_len, size_int (diff)); >>> + ddrs_with_seg_len.erase (ddrs_with_seg_len.begin () + i); >>> + --i; >>> + } >>> + } >>> + else if (dr_b1 == dr_b2) >>> + { >>> + if (dr_a1.basic_addr != dr_a2.basic_addr >>> + || TREE_CODE (dr_a1.offset) != INTEGER_CST >>> + || TREE_CODE (dr_a2.offset) != INTEGER_CST) >>> + continue; >>> + >>> + int diff = int_cst_value (dr_a2.offset) - >>> + int_cst_value (dr_a1.offset); >>> + >>> + gcc_assert (diff > 0); >>> + >>> + if (diff <= vect_factor >>> + || (TREE_CODE (dr_a1.seg_len) == INTEGER_CST >>> + && diff - int_cst_value (dr_a1.seg_len) < vect_factor) >>> + || (TREE_CODE (dr_a1.seg_len) == INTEGER_CST >>> + && TREE_CODE (dr_b1.seg_len) == INTEGER_CST >>> + && diff - int_cst_value (dr_a1.seg_len) < >>> + int_cst_value (dr_b1.seg_len))) >>> + { >>> + if (dump_enabled_p ()) >>> + { >>> + dump_printf_loc >>> + (MSG_NOTE, vect_location, >>> + "combining two runtime checks for data references "); >>> + dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_a1.dr)); >>> + dump_printf (MSG_NOTE, " and "); >>> + dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_a2.dr)); >>> + dump_printf (MSG_NOTE, "\n"); >>> + } >>> + >>> + dr_a1.seg_len = size_binop (PLUS_EXPR, >>> + dr_a2.seg_len, size_int (diff)); >>> + ddrs_with_seg_len.erase (ddrs_with_seg_len.begin () + i); >>> + --i; >>> + } >>> + } >>> + } >>> + >>> + for (size_t i = 0, s = ddrs_with_seg_len.size (); i < s; ++i) >>> + { >>> + const dr_addr_with_seg_len& dr_a = ddrs_with_seg_len[i].first; >>> + const dr_addr_with_seg_len& dr_b = ddrs_with_seg_len[i].second; >>> + tree segment_length_a = dr_a.seg_len; >>> + tree segment_length_b = dr_b.seg_len; >>> + >>> + tree addr_base_a >>> + = fold_build_pointer_plus (dr_a.basic_addr, dr_a.offset); >>> + tree addr_base_b >>> + = fold_build_pointer_plus (dr_b.basic_addr, dr_b.offset); >>> + >>> if (dump_enabled_p ()) >>> { >>> dump_printf_loc (MSG_NOTE, vect_location, >>> - "create runtime check for data references "); >>> - dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_a)); >>> + "create runtime check for data references "); >>> + dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_a.dr)); >>> dump_printf (MSG_NOTE, " and "); >>> - dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_b)); >>> - dump_printf (MSG_NOTE, "\n"); >>> + dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_b.dr)); >>> + dump_printf (MSG_NOTE, "\n"); >>> } >>> >>> - seg_a_min = addr_base_a; >>> - seg_a_max = fold_build_pointer_plus (addr_base_a, segment_length_a); >>> - if (tree_int_cst_compare (DR_STEP (dr_a), size_zero_node) < 0) >>> + tree seg_a_min = addr_base_a; >>> + tree seg_a_max = fold_build_pointer_plus (addr_base_a, >>> segment_length_a); >>> + if (tree_int_cst_compare (DR_STEP (dr_a.dr), size_zero_node) < 0) >>> seg_a_min = seg_a_max, seg_a_max = addr_base_a; >>> >>> - seg_b_min = addr_base_b; >>> - seg_b_max = fold_build_pointer_plus (addr_base_b, segment_length_b); >>> - if (tree_int_cst_compare (DR_STEP (dr_b), size_zero_node) < 0) >>> + tree seg_b_min = addr_base_b; >>> + tree seg_b_max = fold_build_pointer_plus (addr_base_b, >>> segment_length_b); >>> + if (tree_int_cst_compare (DR_STEP (dr_b.dr), size_zero_node) < 0) >>> seg_b_min = seg_b_max, seg_b_max = addr_base_b; >>> >>> part_cond_expr = >>> @@ -2477,6 +2696,81 @@ vect_loop_versioning (loop_vec_info loop >>> adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi)); >>> } >>> >>> + /* Extract load and store statements on pointers with zero-stride >>> + accesses. */ >>> + if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo)) >>> + { >>> + >>> + /* In the loop body, we iterate each statement to check if it is a >>> load >>> + or store. Then we check the DR_STEP of the data reference. If >>> + DR_STEP is zero, then we will hoist the load statement to the loop >>> + preheader, and move the store statement to the loop exit. */ >>> + >>> + for (gimple_stmt_iterator si = gsi_start_bb (loop->header); >>> + !gsi_end_p (si); ) >>> + { >>> + gimple stmt = gsi_stmt (si); >>> + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); >>> + struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info); >>> + >>> + >>> + if (dr && integer_zerop (DR_STEP (dr))) >>> + { >>> + if (DR_IS_READ (dr)) >>> + { >>> + if (dump_file) >>> + { >>> + fprintf (dump_file, >>> + "Hoist the load to outside of the loop:\n"); >>> + print_gimple_stmt (dump_file, stmt, 0, >>> + TDF_VOPS|TDF_MEMSYMS); >>> + } >>> + >>> + basic_block preheader = loop_preheader_edge (loop)->src; >>> + gimple_stmt_iterator si_dst = gsi_last_bb (preheader); >>> + gsi_move_after (&si, &si_dst); >>> + } >>> + else >>> + { >>> + gimple_stmt_iterator si_dst = >>> + gsi_last_bb (single_exit (loop)->dest); >>> + gsi_move_after (&si, &si_dst); >>> + } >>> + continue; >>> + } >>> + else if (!dr) >>> + { >>> + bool hoist = true; >>> + for (size_t i = 0; i < gimple_num_ops (stmt); i++) >>> + { >>> + tree op = gimple_op (stmt, i); >>> + if (TREE_CODE (op) == INTEGER_CST >>> + || TREE_CODE (op) == REAL_CST) >>> + continue; >>> + if (TREE_CODE (op) == SSA_NAME) >>> + { >>> + gimple def = SSA_NAME_DEF_STMT (op); >>> + if (def == stmt >>> + || gimple_nop_p (def) >>> + || !flow_bb_inside_loop_p (loop, gimple_bb (def))) >>> + continue; >>> + } >>> + hoist = false; >>> + break; >>> + } >>> + >>> + if (hoist) >>> + { >>> + basic_block preheader = loop_preheader_edge (loop)->src; >>> + gimple_stmt_iterator si_dst = gsi_last_bb (preheader); >>> + gsi_move_after (&si, &si_dst); >>> + continue; >>> + } >>> + } >>> + gsi_next (&si); >>> + } >>> + } >>> + >>> /* End loop-exit-fixes after versioning. */ >>> >>> if (cond_expr_stmt_list) >>> Index: gcc/ChangeLog >>> =================================================================== >>> --- gcc/ChangeLog (revision 202663) >>> +++ gcc/ChangeLog (working copy) >>> @@ -1,3 +1,8 @@ >>> +2013-10-01 Cong Hou <co...@google.com> >>> + >>> + * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks): Combine >>> + alias checks if it is possible to amortize the runtime overhead. >>> + >>> >>> >> >> -- >> Richard Biener <rguent...@suse.de> >> SUSE / SUSE Labs >> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 >> GF: Jeff Hawn, Jennifer Guild, Felix Imend