On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>> Hi, >>> This is the main patch rewriting loop distribution in order to handle hmmer. >>> It improves loop distribution by versioning loop under runtime alias check >>> conditions. >>> As described in comments, the patch basically implements distribution in >>> the following >>> steps: >>> >>> 1) Seed partitions with specific type statements. For now we support >>> two types seed statements: statement defining variable used outside >>> of loop; statement storing to memory. >>> 2) Build reduced dependence graph (RDG) for loop to be distributed. >>> The vertices (RDG:V) model all statements in the loop and the edges >>> (RDG:E) model flow and control dependencies between statements. >>> 3) Apart from RDG, compute data dependencies between memory references. >>> 4) Starting from seed statement, build up partition by adding depended >>> statements according to RDG's dependence information. Partition is >>> classified as parallel type if it can be executed paralleled; or as >>> sequential type if it can't. Parallel type partition is further >>> classified as different builtin kinds if it can be implemented as >>> builtin function calls. >>> 5) Build partition dependence graph (PG) based on data dependencies. >>> The vertices (PG:V) model all partitions and the edges (PG:E) model >>> all data dependencies between every partitions pair. In general, >>> data dependence is either compilation time known or unknown. In C >>> family languages, there exists quite amount compilation time unknown >>> dependencies because of possible alias relation of data references. >>> We categorize PG's edge to two types: "true" edge that represents >>> compilation time known data dependencies; "alias" edge for all other >>> data dependencies. >>> 6) Traverse subgraph of PG as if all "alias" edges don't exist. Merge >>> partitions in each strong connected component (SCC) correspondingly. >>> Build new PG for merged partitions. >>> 7) Traverse PG again and this time with both "true" and "alias" edges >>> included. We try to break SCCs by removing some edges. Because >>> SCCs by "true" edges are all fused in step 6), we can break SCCs >>> by removing some "alias" edges. It's NP-hard to choose optimal >>> edge set, fortunately simple approximation is good enough for us >>> given the small problem scale. >>> 8) Collect all data dependencies of the removed "alias" edges. Create >>> runtime alias checks for collected data dependencies. >>> 9) Version loop under the condition of runtime alias checks. Given >>> loop distribution generally introduces additional overhead, it is >>> only useful if vectorization is achieved in distributed loop. We >>> version loop with internal function call IFN_LOOP_DIST_ALIAS. If >>> no distributed loop can be vectorized, we simply remove distributed >>> loops and recover to the original one. >>> >>> Also, there are some more to improve in the future (which isn't difficult I >>> think): >>> TODO: >>> 1) We only distribute innermost loops now. This pass should handle >>> loop >>> nests in the future. >>> 2) We only fuse partitions in SCC now. A better fusion algorithm is >>> desired to minimize loop overhead, maximize parallelism and maximize >>> >>> Bootstrap and test on x86_64 and AArch64. Is it OK? >>> >> Trivial updated due to changes in previous patches. Also fixed issues >> mentioned by Kugan. > Rebased V3 for changes in previous patches. Bootstap and test on > x86_64 and aarch64.
why is ldist-12.c no longer distributed? your comment says it doesn't expose more "parallelism" but the point is to reduce memory bandwith requirements which it clearly does. Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording of "parallelism" still confuses me. Can you elaborate on that. Now onto the patch: + Loop distribution is the dual of loop fusion. It separates statements + of a loop (or loop nest) into multiple loops (or loop nests) with the + same loop header. The major goal is to separate statements which may + be vectorized from those that can't. This pass implements distribution + in the following steps: misses the goal of being a memory stream optimization, not only a vectorization enabler. distributing a loop can also reduce register pressure. You introduce ldist_alias_id in struct loop (probably in 01/n which I didn't look into yet). If you don't use that please introduce it separately. + /* Be conservative. If data references are not well analyzed, + or the two data references have the same base address and + offset, add dependence and consider it alias to each other. + In other words, the dependence can not be resolved by + runtime alias check. */ + if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2) + || !DR_OFFSET (dr1) || !DR_OFFSET (dr2) + || !DR_INIT (dr1) || !DR_INIT (dr2) + || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1)) + || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2)) + || res == 0) ISTR a helper that computes whether we can handle a runtime alias check for a specific case? + /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */ + if (flag_tree_loop_vectorize) + { so at this point I'd condition the whole runtime-alias check generating on flag_tree_loop_vectorize. You seem to support versioning w/o that here but in other places disable versioning w/o flag_tree_loop_vectorize. That looks somewhat inconsistent... + /* Don't version loop if any partition is builtin. */ + for (i = 0; partitions->iterate (i, &partition); ++i) + { + if (partition->kind != PKIND_NORMAL) + break; + } why's that? Do you handle the case where only a subset of partitions require a runtime alias check to be generated? Thus from a loop generate three loops, only two of them being versioned? The complication would be that both the runtime alias and non-alias versions would be "transformed". Or do we rely on recursively distributing the distribution result, thus if we have partitions that can be handled w/o runtime alias check fuse the remaining partitions and recurse on those? Otherwise the patch looks good to me. Thanks, Richard. > Thanks, > bin >> >> Thanks, >> bin >> >> 2017-06-17 Bin Cheng <bin.ch...@arm.com> >> >> * tree-loop-distribution.c: Add general explanantion on the pass. >> (pg_add_dependence_edges): New parameter. handle alias data >> dependence specially and record it in the parameter if asked. >> (struct pg_vdata, pg_edata, pg_edge_callback_data): New structs. >> (init_partition_graph_vertices, add_partition_graph_edge): New. >> (pg_skip_alias_edge, free_partition_graph_edata_cb): New. >> (free_partition_graph_vdata, build_partition_graph): New. >> (sort_partitions_by_post_order, merge_dep_scc_partitions): New. >> (pg_collect_alias_ddrs, break_alias_scc_partitions): New. >> (data_ref_segment_size, latch_dominated_by_data_ref): New. >> (compute_alias_check_pairs, version_loop_by_alias_check): New. >> (version_for_distribution_p, finalize_partitions): New. >> (distribute_loop): Handle alias data dependence specially. Factor >> out loop fusion code as functions and call these functions. >> >> gcc/testsuite/ChangeLog >> 2017-06-17 Bin Cheng <bin.ch...@arm.com> >> >> * gcc.dg/tree-ssa/ldist-4.c: Adjust test string. >> * gcc.dg/tree-ssa/ldist-12.c: Ditto. >> * gcc.dg/tree-ssa/ldist-13.c: Ditto. >> * gcc.dg/tree-ssa/ldist-14.c: Ditto.