On Wed, Jun 7, 2017 at 11:03 AM, Richard Biener <richard.guent...@gmail.com> wrote: > On Fri, Jun 2, 2017 at 1:51 PM, Bin Cheng <bin.ch...@arm.com> wrote: >> Hi, >> This is the main patch of the change. It improves loop distribution by >> versioning loop under >> runtime alias check conditions, as well as better partition fusion. As >> described in comments, >> the patch basically implements distribution in the following steps: >> >> 1) Seed partitions with specific type statements. For now we support >> two types seed statements: statement defining variable used outside >> of loop; statement storing to memory. >> 2) Build reduced dependence graph (RDG) for loop to be distributed. >> The vertices (RDG:V) model all statements in the loop and the edges >> (RDG:E) model flow and control dependences between statements. >> 3) Apart from RDG, compute data dependences between memory references. >> 4) Starting from seed statement, build up partition by adding depended >> statements according to RDG's dependence information. Partition is >> classified as parallel type if it can be executed parallelly; or as >> sequential type if it can't. Parallel type partition is further >> classified as different builtin kinds if it can be implemented as >> builtin function calls. >> 5) Build partition dependence graph (PG) based on data dependences. >> The vertices (PG:V) model all partitions and the edges (PG:E) model >> all data dependences between every partitions pair. In general, >> data dependence is either compilation time known or unknown. In C >> family languages, there exists quite amount compilation time unknown >> dependences because of possible alias relation of data references. >> We categorize PG's edge to two types: "true" edge that represents >> compilation time known data dependences; "alias" edge for all other >> data dependences. >> 6) Traverse subgraph of PG as if all "alias" edges don't exist. Merge >> partitions in each strong connected commponent (SCC) correspondingly. >> Build new PG for merged partitions. >> 7) Traverse PG again and this time with both "true" and "alias" edges >> included. We try to break SCCs by removing some edges. Because >> SCCs by "true" edges are all fused in step 6), we can break SCCs >> by removing some "alias" edges. It's NP-hard to choose optimal >> edge set, fortunately simple approximation is good enough for us >> given the small problem scale. >> 8) Collect all data dependences of the removed "alias" edges. Create >> runtime alias checks for collected data dependences. >> 9) Version loop under the condition of runtime alias checks. Given >> loop distribution generally introduces additional overhead, it is >> only useful if vectorization is achieved in distributed loop. We >> version loop with internal function call IFN_LOOP_DIST_ALIAS. If >> no distributed loop can be vectorized, we simply remove distributed >> loops and recover to the original one. >> >> Also, there are some more to improve in the future (which shouldn't be >> difficult): >> TODO: >> 1) We only distribute innermost loops now. This pass should handle loop >> nests in the future. >> 2) We only fuse partitions in SCC now. A better fusion algorithm is >> desired to minimize loop overhead, maximize parallelism and maximize >> >> This patch also fixes couple of latent bugs in the original implementation. > > It would be nice to split this patch up, for example fixing the latent > bugs first > (so we can backport that part). I only remember the major one in which topological order is needed when sorting statements, dominance order is not enough. I will try to split it as much as possible, but the change falls in a big chunk to some extend.
> > You now compute _all_ dependences in the loop while I originally designed loop > distribution to do less dependence computation by only computing dependences > between datarefs in different partitions (and delaying that until > after partition fusing > required by things not requiring dependence info). > Please do not remove this optimization. Right, I did that and cache it in hash table because we need to query dependence for two references multiple times. I will restore the on-demand computing behavior. Thanks, bin > > Richard. > >> After this change, kernel loop in hmmer can be distributed and vectorized as >> a result. >> This gives obvious performance improvement. There is still inefficient code >> generation >> issue which I will try to fix in loop split. Apart from this, the next >> opportunity in hmmer >> is to eliminate number of dead stores under proper alias information. >> Bootstrap and test at O2/O3 on x86_64 and AArch64. is it OK? >> >> Thanks, >> bin >> 2017-05-31 Bin Cheng <bin.ch...@arm.com> >> >> * cfgloop.h (struct loop): New field ldist_alias_id. >> * cfgloopmanip.c (lv_adjust_loop_entry_edge): Refine comment for >> new internal function. >> * internal-fn.c (expand_LOOP_DIST_ALIAS): New function. >> * internal-fn.def (IFN_LOOP_DIST_ALIAS): New internal function. >> * tree-loop-distribution.c: Add general explanantion on the pass. >> Include header file. >> (struct ddr_entry, struct ddr_entry_hasher): New structs. >> (ddr_entry_hasher::hash, ddr_entry_hasher::equal): New functions. >> (bb_top_order_index, bb_top_order_index_size): New static vars. >> (bb_top_order_cmp): New function. >> (stmts_from_loop): Get basic blocks in topological order. Don't >> free data references. >> (build_rdg): New parameter pointing to vector of data references. >> Store data references in it. >> (enum partition_type): New enum. >> (enum partition_kind, struct partition): Add comments. New fields. >> (partition_alloc, partition_free): Handle new fields of partition. >> (enum fuse_type, fuse_message): New enum and varaible. >> (partition_merge_into): New parameter. Update implementation wrto >> new fields of partition. >> (generate_loops_for_partition): Set ldist_alias_id information. >> (record_ddr, get_ddr, possible_data_dep_cycle_p): New functions. >> (build_rdg_partition_for_vertex): New parameter. Compute type info >> for partition. >> (classify_partition): New parameter. Only classify partition as >> reduction if the reduction is not included by all partitions. >> Retrieve cached ddr, rather than compute it on the fly. >> (ref_base_address): Delete. >> (similar_memory_accesses): Rename to ... >> (share_memory_accesses): ... this. >> (rdg_build_partitions): New parameter and update uses. >> (pg_add_dependence_edges): New parameter. Store ddr in parameter >> vector if it could be resolved by runtime alias check. >> (rdg_compute_data_dependence): New function. >> (struct pg_vdata, pg_edata, pg_edge_callback_data): New structs. >> (init_partition_graph_vertices, add_partition_graph_edge): New. >> (pg_skip_alias_edge, free_partition_graph_edata_cb): New. >> (free_partition_graph_vdata, build_partition_graph): New. >> (sort_partitions_by_post_order, merge_dep_scc_partitions): New. >> (pg_collect_alias_ddrs, break_alias_scc_partitions): New. >> (data_ref_segment_size, latch_dominated_by_data_ref): New. >> (compute_alias_check_pairs, version_loop_by_alias_check): New. >> (version_for_distribution_p, finalize_partitions): New. >> (distribute_loop): Update uses. Break SCC and version loop >> with runtime alias checks. >> (pass_loop_distribution::execute): Compute topological order for >> basic blocks. Update uses. >> * tree-vectorizer.c (vect_loop_dist_alias_call): New. >> (fold_loop_dist_alias_call): New. >> (vectorize_loops): Fold IFN_LOOP_DIST_ALIAS call depending on >> successful vectorization or not. >> >> gcc/testsuite/ChangeLog >> 2017-05-31 Bin Cheng <bin.ch...@arm.com> >> >> * gcc.dg/tree-ssa/ldist-4.c: Adjust test string. >> * gcc.dg/tree-ssa/ldist-12.c: Ditto. >> * gcc.dg/tree-ssa/ldist-13.c: Ditto. >> * gcc.dg/tree-ssa/ldist-14.c: Ditto.