Committed, thanks Jeff.

Pan

-----Original Message-----
From: Gcc-patches <gcc-patches-bounces+pan2.li=intel....@gcc.gnu.org> On Behalf 
Of Jeff Law via Gcc-patches
Sent: Sunday, June 11, 2023 12:49 AM
To: juzhe.zh...@rivai.ai; gcc-patches@gcc.gnu.org
Cc: kito.ch...@sifive.com; pal...@rivosinc.com; rdapp....@gmail.com
Subject: Re: [PATCH V3] RISC-V: Rework Phase 5 && Phase 6 of VSETVL PASS



On 6/9/23 17:11, juzhe.zh...@rivai.ai wrote:
> From: Juzhe-Zhong <juzhe.zh...@rivai.ai>
> 
> Address comments from Jeff.
> 
> This patch is to rework Phase 5 && Phase 6 of VSETVL PASS since Phase 
> 5 && Phase 6 are quite messy and cause some bugs discovered by my 
> downstream auto-vectorization test-generator.
> 
> Before this patch.
> 
> Phase 5 is cleanup_insns is the function remove AVL operand dependency from 
> each RVV instruction.
> E.g. vadd.vv (use a5), after Phase 5, ====> vadd.vv (use const_int 0). 
> Since "a5" is used in "vsetvl" instructions and after the correct 
> "vsetvl" instructions are inserted, each RVV instruction doesn't need AVL 
> operand "a5" anymore. Then, we remove this operand dependency helps for the 
> following scheduling PASS.
> 
> Phase 6 is propagate_avl do the following 2 things:
> 1. Local && Global user vsetvl instructions optimization.
>     E.g.
>        vsetvli a2, a2, e8, mf8   ======> Change it into vsetvli a2, a2, e32, 
> mf2
>        vsetvli zero,a2, e32, mf2  ======> eliminate 2. Optimize user 
> vsetvl from "vsetvl a2,a2" into "vsetvl zero,a2" if "a2" is not used by any 
> instructions.
> Since from Phase 1 ~ Phase 4 which inserts "vsetvli" instructions base 
> on LCM which change the CFG, I re-new a new RTL_SSA framework (which is more 
> expensive than just using DF) for Phase 6 and optmize user vsetvli base on 
> the new RTL_SSA.
> 
> There are 2 issues in Phase 5 && Phase 6:
> 1. local_eliminate_vsetvl_insn was introduced by @kito which can do better 
> local user vsetvl optimizations better than
>     Phase 6 do, such approach doesn't need to re-new the RTL_SSA framework. 
> So the local user vsetvli instructions optimizaiton
>     in Phase 6 is redundant and should be removed.
> 2. A bug discovered by my downstream auto-vectorization test-generator (I 
> can't put the test in this patch since we are missing autovec
>     patterns for it so we can't use the upstream GCC directly reproduce such 
> issue but I will remember put it back after I support the
>     necessary autovec patterns). Such bug is causing by using RTL_SSA re-new 
> framework. The issue description is this:
>     
> Before Phase 6:
>     ...
>     insn1: vsetlvi a3, 17 <========== generated by SELECT_VL auto-vec pattern.
>     slli a4,a3,3
>     ...
>     insn2: vsetvli zero, a3, ...
>     load (use const_int 0, before Phase 5, it's using a3, but the use of "a3" 
> is removed in Phase 5)
>     ...
> 
> In Phase 6, we iterate to insn2, then get the def of "a3" which is the insn1.
> insn2 is the vsetvli instruction inserted in Phase 4 which is not 
> included in the RLT_SSA framework even though we renew it (I didn't take a 
> look at it and I don't think we need to now).
> Base on this situation, the def_info of insn2 has the information 
> "set->single_nondebug_insn_use ()"
> which return true. Obviously, this information is not correct, since insn1 
> has aleast 2 uses:
> 1). slli a4,a3,3 2).insn2: vsetvli zero, a3, ... Then, the test 
> generated by my downstream test-generator execution test failed.
> 
> Conclusion of RTL_SSA framework:
> Before this patch, we initialize RTL_SSA 2 times. One is at the 
> beginning of the VSETVL PASS which is absolutely correct, the other is re-new 
> after Phase 4 (LCM) has incorrect information that causes bugs.
> 
> Besides, we don't like to initialize RTL_SSA second time it seems to be a 
> waste since we just need to do a little optimization.
> 
> Base on all circumstances I described above, I rework and reorganize Phase 5 
> && Phase 6 as follows:
> 1. Phase 5 is called ssa_post_optimization which is doing the optimization 
> base on the RTL_SSA information (The RTL_SSA is initialized
>     at the beginning of the VSETVL PASS, no need to re-new it again). This 
> phase includes 3 optimizaitons:
>     1). local_eliminate_vsetvl_insn we already have (no change).
>     2). global_eliminate_vsetvl_insn ---> new optimizaiton splitted from 
> orignal Phase 6 but with more powerful and reliable implementation.
>        E.g.
>        void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
>          size_t avl;
>          if (m > 100)
>            avl = __riscv_vsetvl_e16mf4(vl << 4);
>          else
>            avl = __riscv_vsetvl_e32mf2(vl >> 8);
>          for (size_t i = 0; i < m; i++) {
>            vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
>            v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
>            __riscv_vse8_v_i8mf8(out + i, v0, avl);
>          }
>        }
> 
>        This example failed to global user vsetvl optimize before this patch:
>        f:
>                li      a5,100
>                bleu    a3,a5,.L2
>                slli    a2,a2,4
>                vsetvli a4,a2,e16,mf4,ta,mu
>        .L3:
>                li      a5,0
>                vsetvli zero,a4,e8,mf8,ta,ma
>        .L5:
>                add     a6,a0,a5
>                add     a2,a1,a5
>                vle8.v  v1,0(a6)
>                addi    a5,a5,1
>                vadd.vv v1,v1,v1
>                vse8.v  v1,0(a2)
>                bgtu    a3,a5,.L5
>        .L10:
>                ret
>        .L2:
>                beq     a3,zero,.L10
>                srli    a2,a2,8
>                vsetvli a4,a2,e32,mf2,ta,mu
>                j       .L3
>        With this patch:
>        f:
>                li      a5,100
>                bleu    a3,a5,.L2
>                slli    a2,a2,4
>                vsetvli zero,a2,e8,mf8,ta,ma
>        .L3:
>                li      a5,0
>        .L5:
>                add     a6,a0,a5
>                add     a2,a1,a5
>                vle8.v  v1,0(a6)
>                addi    a5,a5,1
>                vadd.vv v1,v1,v1
>                vse8.v  v1,0(a2)
>                bgtu    a3,a5,.L5
>        .L10:
>                ret
>        .L2:
>                beq     a3,zero,.L10
>                srli    a2,a2,8
>                vsetvli zero,a2,e8,mf8,ta,ma
>                j       .L3
> 
>     3). Remove AVL operand dependency of each RVV instructions.
> 
> 2. Phase 6 is called df_post_optimization: Optimize "vsetvl a3,a2...." into 
> Optimize "vsetvl zero,a2...." base on
>     dataflow analysis of new CFG (new CFG is created by LCM). The reason we 
> need to do use new CFG and after Phase 5:
>     ...
>     vsetvl a3, a2...
>     vadd.vv (use a3)
>     If we don't have Phase 5 which removes the "a3" use in vadd.vv, we will 
> fail to optimize vsetvl a3,a2 into vsetvl zero,a2.
>     
>     This patch passed all tests in rvv.exp with ONLY peformance && codegen 
> improved (no performance decline and no bugs including my
>     downstream tests).
> 
> gcc/ChangeLog:
> 
>          * config/riscv/riscv-vsetvl.cc (available_occurrence_p): Enhance 
> user vsetvl optimization.
>          (vector_insn_info::parse_insn): Add rtx_insn parse.
>          (pass_vsetvl::local_eliminate_vsetvl_insn): Enhance user vsetvl 
> optimization.
>          (get_first_vsetvl): New function.
>          (pass_vsetvl::global_eliminate_vsetvl_insn): Ditto.
>          (pass_vsetvl::cleanup_insns): Remove it.
>          (pass_vsetvl::ssa_post_optimization): New function.
>          (has_no_uses): Ditto.
>          (pass_vsetvl::propagate_avl): Remove it.
>          (pass_vsetvl::df_post_optimization): New function.
>          (pass_vsetvl::lazy_vsetvl): Rework Phase 5 && Phase 6.
>          * config/riscv/riscv-vsetvl.h: Adapt declaration.
> 
> gcc/testsuite/ChangeLog:
> 
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-16.c: Adapt test.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-2.c: Ditto.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-3.c: Ditto.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-21.c: New test.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: New test.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-23.c: New test.
OK
jeff

Reply via email to