Hi, Robin. >> Actually no assert necessary, just a comment like: >> /* As NPATTERNS is always a power of two, we can ..." */ Ok.
>> My immediate idea would have been to fall back to the first >> approach, i.e. create the "0x00030002..." constant >>and then >> vid.v v4 >>>> vand.vi v4, v4, -4 >> vadd.vv v4, v4, v6 >>It's one more vector instruction though so possibly worse from a latency >>standpoint. >>Rest looks good to me. This solution I have tried and talked with my colleague deeply, turns out is very obvious consume 1 more vector insn. You may argue that this patch needs 1 more scalar insn (slli) which is relative cheaper than vector insn. I prefer this patch solution. Address comments. I have add comments to send V2. Thanks. juzhe.zh...@rivai.ai From: Robin Dapp Date: 2023-06-20 15:55 To: Juzhe-Zhong; gcc-patches CC: rdapp.gcc; kito.cheng; kito.cheng; palmer; palmer; jeffreyalaw Subject: Re: [PATCH] RISC-V: Optimize codegen of VLA SLP Hi Juzhe, > Case 1: > void > f (uint8_t *restrict a, uint8_t *restrict b) > { > for (int i = 0; i < 100; ++i) > { > a[i * 8] = b[i * 8 + 37] + 1; > a[i * 8 + 1] = b[i * 8 + 37] + 2; > a[i * 8 + 2] = b[i * 8 + 37] + 3; > a[i * 8 + 3] = b[i * 8 + 37] + 4; > a[i * 8 + 4] = b[i * 8 + 37] + 5; > a[i * 8 + 5] = b[i * 8 + 37] + 6; > a[i * 8 + 6] = b[i * 8 + 37] + 7; > a[i * 8 + 7] = b[i * 8 + 37] + 8; > } > } > > We need to generate the stepped vector: > NPATTERNS = 8. > { 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8 } > > Before this patch: > vid.v v4 ;; {0,1,2,3,4,5,6,7,...} > vsrl.vi v4,v4,3 ;; {0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,...} > li a3,8 ;; {8} > vmul.vx v4,v4,a3 ;; {0,0,0,0,0,0,0,8,8,8,8,8,8,8,8,...} > > After this patch: > vid.v v4 ;; {0,1,2,3,4,5,6,7,...} > vand.vi v4,v4,-8(-NPATTERNS) ;; {0,0,0,0,0,0,0,8,8,8,8,8,8,8,8,...} This is a nice improvement. Even though we're in the SLP realm I would still add an assert that documents that we're indeed operating with pow2_p (NPATTERNS) and some comment as to why we can use AND. Sure we're doing exact_log2 et al later anyway, just to make things clearer. > Before this patch: > li a6,134221824 > slli a6,a6,5 > addi a6,a6,3 ;; 64-bit: 0x0003000200010000 > vmv.v.x v6,a6 ;; {3, 2, 1, 0, ... } > vid.v v4 ;; {0, 1, 2, 3, 4, 5, 6, 7, ... } > vsrl.vi v4,v4,2 ;; {0, 0, 0, 0, 1, 1, 1, 1, ... } > li a3,4 ;; {4} > vmul.vx v4,v4,a3 ;; {0, 0, 0, 0, 4, 4, 4, 4, ... } > vadd.vv v4,v4,v6 ;; {3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, > 12, ... } > > After this patch: > li a3,-536875008 > slli a3,a3,4 > addi a3,a3,1 > slli a3,a3,16 > vmv.v.x v2,a3 ;; {3, 1, -1, -3, ... } > vid.v v4 ;; {0, 1, 2, 3, 4, 5, 6, 7, ... } > vadd.vv v4,v4,v2 ;; {3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, > 12, ... } My immediate idea would have been to fall back to the first approach, i.e. create the "0x00030002..." constant > li a6,134221824 > slli a6,a6,5 > addi a6,a6,3 ;; 64-bit: 0x0003000200010000 > vmv.v.x v6,a6 ;; {3, 2, 1, 0, ... } and then vid.v v4 vand.vi v4, v4, -4 vadd.vv v4, v4, v6 It's one more vector instruction though so possibly worse from a latency standpoint. Rest looks good to me. Regards Robin