Re: Re: [PATCH] RISC-V: Optimize codegen of VLA SLP

juzhe.zh...@rivai.ai Tue, 20 Jun 2023 01:08:10 -0700

Hi, Robin.

>> Actually no assert necessary, just a comment like:
>> /* As NPATTERNS is always a power of two, we can ..."  */
Ok.


>> My immediate idea would have been to fall back to the first
>> approach, i.e. create the "0x00030002..." constant
>>and then
>>  vid.v v4
>>>>  vand.vi v4, v4, -4
>>  vadd.vv v4, v4, v6

>>It's one more vector instruction though so possibly worse from a latency
>>standpoint.

>>Rest looks good to me.

This solution I have tried and talked with my colleague deeply, turns out is 
very obvious consume 1 more vector insn.
You may argue that this patch needs 1 more scalar insn (slli) which is relative 
cheaper than vector insn.
I prefer this patch solution.

Address comments. I have add comments to send V2. 
Thanks.


juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-06-20 15:55
To: Juzhe-Zhong; gcc-patches
CC: rdapp.gcc; kito.cheng; kito.cheng; palmer; palmer; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Optimize codegen of VLA SLP
Hi Juzhe,
 
> Case 1:
> void
> f (uint8_t *restrict a, uint8_t *restrict b)
> {
>   for (int i = 0; i < 100; ++i)
>     {
>       a[i * 8] = b[i * 8 + 37] + 1;
>       a[i * 8 + 1] = b[i * 8 + 37] + 2;
>       a[i * 8 + 2] = b[i * 8 + 37] + 3;
>       a[i * 8 + 3] = b[i * 8 + 37] + 4;
>       a[i * 8 + 4] = b[i * 8 + 37] + 5;
>       a[i * 8 + 5] = b[i * 8 + 37] + 6;
>       a[i * 8 + 6] = b[i * 8 + 37] + 7;
>       a[i * 8 + 7] = b[i * 8 + 37] + 8;
>     }
> }
> 
> We need to generate the stepped vector:
> NPATTERNS = 8.
> { 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8 }
> 
> Before this patch:
> vid.v    v4         ;; {0,1,2,3,4,5,6,7,...}
> vsrl.vi  v4,v4,3    ;; {0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,...}
> li       a3,8       ;; {8}
> vmul.vx  v4,v4,a3   ;; {0,0,0,0,0,0,0,8,8,8,8,8,8,8,8,...}
> 
> After this patch:
> vid.v    v4                    ;; {0,1,2,3,4,5,6,7,...}
> vand.vi  v4,v4,-8(-NPATTERNS)  ;; {0,0,0,0,0,0,0,8,8,8,8,8,8,8,8,...}
 
This is a nice improvement.  Even though we're in the SLP realm I would
still add an assert that documents that we're indeed operating with
pow2_p (NPATTERNS) and some comment as to why we can use AND.
Sure we're doing exact_log2 et al later anyway, just to make things
clearer.
 
> Before this patch:
> li       a6,134221824
> slli     a6,a6,5
> addi     a6,a6,3        ;; 64-bit: 0x0003000200010000
> vmv.v.x  v6,a6          ;; {3, 2, 1, 0, ... }
> vid.v    v4             ;; {0, 1, 2, 3, 4, 5, 6, 7, ... }
> vsrl.vi  v4,v4,2        ;; {0, 0, 0, 0, 1, 1, 1, 1, ... }
> li       a3,4           ;; {4}
> vmul.vx  v4,v4,a3       ;; {0, 0, 0, 0, 4, 4, 4, 4, ... }
> vadd.vv  v4,v4,v6       ;; {3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 
> 12, ... }
> 
> After this patch:
> li a3,-536875008
> slli a3,a3,4
> addi a3,a3,1
> slli a3,a3,16
> vmv.v.x v2,a3           ;; {3, 1, -1, -3, ... }
> vid.v v4              ;; {0, 1, 2, 3, 4, 5, 6, 7, ... }
> vadd.vv v4,v4,v2        ;; {3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 
> 12, ... }
 
My immediate idea would have been to fall back to the first
approach, i.e. create the "0x00030002..." constant
 
> li       a6,134221824
> slli     a6,a6,5
> addi     a6,a6,3        ;; 64-bit: 0x0003000200010000
> vmv.v.x  v6,a6          ;; {3, 2, 1, 0, ... }
 
and then
  vid.v v4
  vand.vi v4, v4, -4
  vadd.vv v4, v4, v6
 
It's one more vector instruction though so possibly worse from a latency
standpoint.
 
Rest looks good to me.
 
Regards
Robin

Re: Re: [PATCH] RISC-V: Optimize codegen of VLA SLP

Reply via email to