Committed, thanks Jeff. Pan
-----Original Message----- From: Gcc-patches <gcc-patches-bounces+pan2.li=intel....@gcc.gnu.org> On Behalf Of Jeff Law via Gcc-patches Sent: Sunday, June 11, 2023 12:49 AM To: juzhe.zh...@rivai.ai; gcc-patches@gcc.gnu.org Cc: kito.ch...@sifive.com; pal...@rivosinc.com; rdapp....@gmail.com Subject: Re: [PATCH V3] RISC-V: Rework Phase 5 && Phase 6 of VSETVL PASS On 6/9/23 17:11, juzhe.zh...@rivai.ai wrote: > From: Juzhe-Zhong <juzhe.zh...@rivai.ai> > > Address comments from Jeff. > > This patch is to rework Phase 5 && Phase 6 of VSETVL PASS since Phase > 5 && Phase 6 are quite messy and cause some bugs discovered by my > downstream auto-vectorization test-generator. > > Before this patch. > > Phase 5 is cleanup_insns is the function remove AVL operand dependency from > each RVV instruction. > E.g. vadd.vv (use a5), after Phase 5, ====> vadd.vv (use const_int 0). > Since "a5" is used in "vsetvl" instructions and after the correct > "vsetvl" instructions are inserted, each RVV instruction doesn't need AVL > operand "a5" anymore. Then, we remove this operand dependency helps for the > following scheduling PASS. > > Phase 6 is propagate_avl do the following 2 things: > 1. Local && Global user vsetvl instructions optimization. > E.g. > vsetvli a2, a2, e8, mf8 ======> Change it into vsetvli a2, a2, e32, > mf2 > vsetvli zero,a2, e32, mf2 ======> eliminate 2. Optimize user > vsetvl from "vsetvl a2,a2" into "vsetvl zero,a2" if "a2" is not used by any > instructions. > Since from Phase 1 ~ Phase 4 which inserts "vsetvli" instructions base > on LCM which change the CFG, I re-new a new RTL_SSA framework (which is more > expensive than just using DF) for Phase 6 and optmize user vsetvli base on > the new RTL_SSA. > > There are 2 issues in Phase 5 && Phase 6: > 1. local_eliminate_vsetvl_insn was introduced by @kito which can do better > local user vsetvl optimizations better than > Phase 6 do, such approach doesn't need to re-new the RTL_SSA framework. > So the local user vsetvli instructions optimizaiton > in Phase 6 is redundant and should be removed. > 2. A bug discovered by my downstream auto-vectorization test-generator (I > can't put the test in this patch since we are missing autovec > patterns for it so we can't use the upstream GCC directly reproduce such > issue but I will remember put it back after I support the > necessary autovec patterns). Such bug is causing by using RTL_SSA re-new > framework. The issue description is this: > > Before Phase 6: > ... > insn1: vsetlvi a3, 17 <========== generated by SELECT_VL auto-vec pattern. > slli a4,a3,3 > ... > insn2: vsetvli zero, a3, ... > load (use const_int 0, before Phase 5, it's using a3, but the use of "a3" > is removed in Phase 5) > ... > > In Phase 6, we iterate to insn2, then get the def of "a3" which is the insn1. > insn2 is the vsetvli instruction inserted in Phase 4 which is not > included in the RLT_SSA framework even though we renew it (I didn't take a > look at it and I don't think we need to now). > Base on this situation, the def_info of insn2 has the information > "set->single_nondebug_insn_use ()" > which return true. Obviously, this information is not correct, since insn1 > has aleast 2 uses: > 1). slli a4,a3,3 2).insn2: vsetvli zero, a3, ... Then, the test > generated by my downstream test-generator execution test failed. > > Conclusion of RTL_SSA framework: > Before this patch, we initialize RTL_SSA 2 times. One is at the > beginning of the VSETVL PASS which is absolutely correct, the other is re-new > after Phase 4 (LCM) has incorrect information that causes bugs. > > Besides, we don't like to initialize RTL_SSA second time it seems to be a > waste since we just need to do a little optimization. > > Base on all circumstances I described above, I rework and reorganize Phase 5 > && Phase 6 as follows: > 1. Phase 5 is called ssa_post_optimization which is doing the optimization > base on the RTL_SSA information (The RTL_SSA is initialized > at the beginning of the VSETVL PASS, no need to re-new it again). This > phase includes 3 optimizaitons: > 1). local_eliminate_vsetvl_insn we already have (no change). > 2). global_eliminate_vsetvl_insn ---> new optimizaiton splitted from > orignal Phase 6 but with more powerful and reliable implementation. > E.g. > void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) { > size_t avl; > if (m > 100) > avl = __riscv_vsetvl_e16mf4(vl << 4); > else > avl = __riscv_vsetvl_e32mf2(vl >> 8); > for (size_t i = 0; i < m; i++) { > vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl); > v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl); > __riscv_vse8_v_i8mf8(out + i, v0, avl); > } > } > > This example failed to global user vsetvl optimize before this patch: > f: > li a5,100 > bleu a3,a5,.L2 > slli a2,a2,4 > vsetvli a4,a2,e16,mf4,ta,mu > .L3: > li a5,0 > vsetvli zero,a4,e8,mf8,ta,ma > .L5: > add a6,a0,a5 > add a2,a1,a5 > vle8.v v1,0(a6) > addi a5,a5,1 > vadd.vv v1,v1,v1 > vse8.v v1,0(a2) > bgtu a3,a5,.L5 > .L10: > ret > .L2: > beq a3,zero,.L10 > srli a2,a2,8 > vsetvli a4,a2,e32,mf2,ta,mu > j .L3 > With this patch: > f: > li a5,100 > bleu a3,a5,.L2 > slli a2,a2,4 > vsetvli zero,a2,e8,mf8,ta,ma > .L3: > li a5,0 > .L5: > add a6,a0,a5 > add a2,a1,a5 > vle8.v v1,0(a6) > addi a5,a5,1 > vadd.vv v1,v1,v1 > vse8.v v1,0(a2) > bgtu a3,a5,.L5 > .L10: > ret > .L2: > beq a3,zero,.L10 > srli a2,a2,8 > vsetvli zero,a2,e8,mf8,ta,ma > j .L3 > > 3). Remove AVL operand dependency of each RVV instructions. > > 2. Phase 6 is called df_post_optimization: Optimize "vsetvl a3,a2...." into > Optimize "vsetvl zero,a2...." base on > dataflow analysis of new CFG (new CFG is created by LCM). The reason we > need to do use new CFG and after Phase 5: > ... > vsetvl a3, a2... > vadd.vv (use a3) > If we don't have Phase 5 which removes the "a3" use in vadd.vv, we will > fail to optimize vsetvl a3,a2 into vsetvl zero,a2. > > This patch passed all tests in rvv.exp with ONLY peformance && codegen > improved (no performance decline and no bugs including my > downstream tests). > > gcc/ChangeLog: > > * config/riscv/riscv-vsetvl.cc (available_occurrence_p): Enhance > user vsetvl optimization. > (vector_insn_info::parse_insn): Add rtx_insn parse. > (pass_vsetvl::local_eliminate_vsetvl_insn): Enhance user vsetvl > optimization. > (get_first_vsetvl): New function. > (pass_vsetvl::global_eliminate_vsetvl_insn): Ditto. > (pass_vsetvl::cleanup_insns): Remove it. > (pass_vsetvl::ssa_post_optimization): New function. > (has_no_uses): Ditto. > (pass_vsetvl::propagate_avl): Remove it. > (pass_vsetvl::df_post_optimization): New function. > (pass_vsetvl::lazy_vsetvl): Rework Phase 5 && Phase 6. > * config/riscv/riscv-vsetvl.h: Adapt declaration. > > gcc/testsuite/ChangeLog: > > * gcc.target/riscv/rvv/vsetvl/vsetvl-16.c: Adapt test. > * gcc.target/riscv/rvv/vsetvl/vsetvl-2.c: Ditto. > * gcc.target/riscv/rvv/vsetvl/vsetvl-3.c: Ditto. > * gcc.target/riscv/rvv/vsetvl/vsetvl-21.c: New test. > * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: New test. > * gcc.target/riscv/rvv/vsetvl/vsetvl-23.c: New test. OK jeff