On Mon, Jun 5, 2017 at 5:21 AM, Kugan Vivekanandarajah
<kugan.vivekanandara...@linaro.org> wrote:
> Hi Bin,
>
> Thanks for posting the patch. I haven't looked in detail yet but have
> couple of quick questions.
>
> 1. Shouldn’t the run time alias check for versioning happen only when
> vectorisation is enabled? You seems to be using the
> IFN_LOOP_DIST_ALIAS when vectoring but seems to be versioning
> otherwise too.
It's to keep behavior consistent when user explicitly specify the
option at for example O2 optimization level.  It shouldn't matter much
since in most cases vectorizer is enabled along with distribution.
One case it could be useful is in test cases, i.e, we can test
distribution without depending on vectorization.

>
> 2. As I understand, loop distribution will be mostly beneficial when
> enabling transformations succeeds.  Like resulting loop being
> vectorized.  I am wondering if we should only version when the loop’s
> control flow is simple that it can be vectorizable. I know that you
> are adding the internal function for this to some extend but for some
> cases we should be able to say while in loop distribution itself that
> the control flow will not result in loop being vectorized.
So far distribution is mainly for vectorization, though it could be
beneficial to other parallelization optimization.  OTOH, cost model on
control flow graph only may not be enough, for example, we may still
want to distribute case consisting of a vectorizable partition and a
complicated CFG partition.  One possible change is we may want to
restrict the transformation to case with limited number partitions,
which is more or less for compilation time concern.  That being said,
we don't have a cost model now.

>
> Btw, did you run Spec2006 with this? Any notable changes ?
Hmmer.

Thanks,
bin
>
> Thanks,
> Kugan
>
>
>
>
> On 2 June 2017 at 21:51, Bin Cheng <bin.ch...@arm.com> wrote:
>> Hi,
>> This is the main patch of the change.  It improves loop distribution by 
>> versioning loop under
>> runtime alias check conditions, as well as better partition fusion.  As 
>> described in comments,
>> the patch basically implements distribution in the following steps:
>>
>>      1) Seed partitions with specific type statements.  For now we support
>>         two types seed statements: statement defining variable used outside
>>         of loop; statement storing to memory.
>>      2) Build reduced dependence graph (RDG) for loop to be distributed.
>>         The vertices (RDG:V) model all statements in the loop and the edges
>>         (RDG:E) model flow and control dependences between statements.
>>      3) Apart from RDG, compute data dependences between memory references.
>>      4) Starting from seed statement, build up partition by adding depended
>>         statements according to RDG's dependence information.  Partition is
>>         classified as parallel type if it can be executed parallelly; or as
>>         sequential type if it can't.  Parallel type partition is further
>>         classified as different builtin kinds if it can be implemented as
>>         builtin function calls.
>>      5) Build partition dependence graph (PG) based on data dependences.
>>         The vertices (PG:V) model all partitions and the edges (PG:E) model
>>         all data dependences between every partitions pair.  In general,
>>         data dependence is either compilation time known or unknown.  In C
>>         family languages, there exists quite amount compilation time unknown
>>         dependences because of possible alias relation of data references.
>>         We categorize PG's edge to two types: "true" edge that represents
>>         compilation time known data dependences; "alias" edge for all other
>>         data dependences.
>>      6) Traverse subgraph of PG as if all "alias" edges don't exist.  Merge
>>         partitions in each strong connected commponent (SCC) correspondingly.
>>         Build new PG for merged partitions.
>>      7) Traverse PG again and this time with both "true" and "alias" edges
>>         included.  We try to break SCCs by removing some edges.  Because
>>         SCCs by "true" edges are all fused in step 6), we can break SCCs
>>         by removing some "alias" edges.  It's NP-hard to choose optimal
>>         edge set, fortunately simple approximation is good enough for us
>>         given the small problem scale.
>>      8) Collect all data dependences of the removed "alias" edges.  Create
>>         runtime alias checks for collected data dependences.
>>      9) Version loop under the condition of runtime alias checks.  Given
>>         loop distribution generally introduces additional overhead, it is
>>         only useful if vectorization is achieved in distributed loop.  We
>>         version loop with internal function call IFN_LOOP_DIST_ALIAS.  If
>>         no distributed loop can be vectorized, we simply remove distributed
>>         loops and recover to the original one.
>>
>> Also, there are some more to improve in the future (which shouldn't be 
>> difficult):
>>    TODO:
>>      1) We only distribute innermost loops now.  This pass should handle loop
>>         nests in the future.
>>      2) We only fuse partitions in SCC now.  A better fusion algorithm is
>>         desired to minimize loop overhead, maximize parallelism and maximize
>>
>> This patch also fixes couple of latent bugs in the original implementation.
>>
>> After this change, kernel loop in hmmer can be distributed and vectorized as 
>> a result.
>> This gives obvious performance improvement.  There is still inefficient code 
>> generation
>> issue which I will try to fix in loop split.  Apart from this, the next 
>> opportunity in hmmer
>> is to eliminate number of dead stores under proper alias information.
>> Bootstrap and test at O2/O3 on x86_64 and AArch64.  is it OK?
>>
>> Thanks,
>> bin
>> 2017-05-31  Bin Cheng  <bin.ch...@arm.com>
>>
>>         * cfgloop.h (struct loop): New field ldist_alias_id.
>>         * cfgloopmanip.c (lv_adjust_loop_entry_edge): Refine comment for
>>         new internal function.
>>         * internal-fn.c (expand_LOOP_DIST_ALIAS): New function.
>>         * internal-fn.def (IFN_LOOP_DIST_ALIAS): New internal function.
>>         * tree-loop-distribution.c: Add general explanantion on the pass.
>>         Include header file.
>>         (struct ddr_entry, struct ddr_entry_hasher): New structs.
>>         (ddr_entry_hasher::hash, ddr_entry_hasher::equal): New functions.
>>         (bb_top_order_index, bb_top_order_index_size): New static vars.
>>         (bb_top_order_cmp): New function.
>>         (stmts_from_loop): Get basic blocks in topological order.  Don't
>>         free data references.
>>         (build_rdg): New parameter pointing to vector of data references.
>>         Store data references in it.
>>         (enum partition_type): New enum.
>>         (enum partition_kind, struct partition): Add comments.  New fields.
>>         (partition_alloc, partition_free): Handle new fields of partition.
>>         (enum fuse_type, fuse_message): New enum and varaible.
>>         (partition_merge_into): New parameter.  Update implementation wrto
>>         new fields of partition.
>>         (generate_loops_for_partition): Set ldist_alias_id information.
>>         (record_ddr, get_ddr, possible_data_dep_cycle_p): New functions.
>>         (build_rdg_partition_for_vertex): New parameter.  Compute type info
>>         for partition.
>>         (classify_partition): New parameter.  Only classify partition as
>>         reduction if the reduction is not included by all partitions.
>>         Retrieve cached ddr, rather than compute it on the fly.
>>         (ref_base_address): Delete.
>>         (similar_memory_accesses): Rename to ...
>>         (share_memory_accesses): ... this.
>>         (rdg_build_partitions): New parameter and update uses.
>>         (pg_add_dependence_edges): New parameter.  Store ddr in parameter
>>         vector if it could be resolved by runtime alias check.
>>         (rdg_compute_data_dependence): New function.
>>         (struct pg_vdata, pg_edata, pg_edge_callback_data): New structs.
>>         (init_partition_graph_vertices, add_partition_graph_edge): New.
>>         (pg_skip_alias_edge, free_partition_graph_edata_cb): New.
>>         (free_partition_graph_vdata, build_partition_graph): New.
>>         (sort_partitions_by_post_order, merge_dep_scc_partitions): New.
>>         (pg_collect_alias_ddrs, break_alias_scc_partitions): New.
>>         (data_ref_segment_size, latch_dominated_by_data_ref): New.
>>         (compute_alias_check_pairs, version_loop_by_alias_check): New.
>>         (version_for_distribution_p, finalize_partitions): New.
>>         (distribute_loop): Update uses.  Break SCC and version loop
>>         with runtime alias checks.
>>         (pass_loop_distribution::execute): Compute topological order for
>>         basic blocks.  Update uses.
>>         * tree-vectorizer.c (vect_loop_dist_alias_call): New.
>>         (fold_loop_dist_alias_call): New.
>>         (vectorize_loops): Fold IFN_LOOP_DIST_ALIAS call depending on
>>         successful vectorization or not.
>>
>> gcc/testsuite/ChangeLog
>> 2017-05-31  Bin Cheng  <bin.ch...@arm.com>
>>
>>         * gcc.dg/tree-ssa/ldist-4.c: Adjust test string.
>>         * gcc.dg/tree-ssa/ldist-12.c: Ditto.
>>         * gcc.dg/tree-ssa/ldist-13.c: Ditto.
>>         * gcc.dg/tree-ssa/ldist-14.c: Ditto.

Reply via email to