This point is seletected not because LCM but by Phase 3 (VL/VTYPE demand info
backward fusion and propogation) which
is I introduced into VSETVL PASS to enhance LCM && improve vsetvl instruction
performance.
This patch is to supress the Phase 3 too aggressive backward fusion and
propagation to the top of the function program
when there is no define instruction of AVL (AVL is 0 ~ 31 imm since vsetivli
instruction allows imm value instead of reg).
You may want to ask why we need Phase 3 to the job.
Well, we have so many situations that pure LCM fails to optimize, here I can
show you a simple case to demonstrate it:
void f (void * restrict in, void * restrict out, int n, int m, int cond)
{
size_t vl = 101;
for (size_t j = 0; j < m; j++){
if (cond) {
for (size_t i = 0; i < n; i++)
{
vint8mf8_t v = __riscv_vle8_v_i8mf8 (in + i + j, vl);
__riscv_vse8_v_i8mf8 (out + i, v, vl);
}
} else {
for (size_t i = 0; i < n; i++)
{
vint32mf2_t v = __riscv_vle32_v_i32mf2 (in + i + j, vl);
v = __riscv_vadd_vv_i32mf2 (v,v,vl);
__riscv_vse32_v_i32mf2 (out + i, v, vl);
}
}
}
}
You can see:
The first inner loop needs vsetvli e8 mf8 for vle+vse.
The second inner loop need vsetvli e32 mf2 for vle+vadd+vse.
If we don't have Phase 3 (Only handled by LCM (Phase 4)), we will end up with :
outerloop:
...
vsetvli e8mf8
inner loop 1:
....
vsetvli e32mf2
inner loop 2:
....
However, if we have Phase 3, Phase 3 is going to fuse the vsetvli e32 mf2 of
inner loop 2 into vsetvli e8 mf8, then we will end up with this result after
phase 3:
outerloop:
...
inner loop 1:
vsetvli e32mf2
....
inner loop 2:
vsetvli e32mf2
....
Then, this demand information after phase 3 will be well optimized after phase
4 (LCM), after Phase 4 result is:
vsetvli e32mf2
outerloop:
...
inner loop 1:
....
inner loop 2:
....
You can see this is the optimal codegen after current VSETVL PASS (Phase 3:
Demand backward fusion and propagation + Phase 4: LCM ). This is a known issue
when I start to implement VSETVL PASS.
I leaved it to be fixed after I finished all target GCC 13 features. And Kito
postpone this patch to be merged after GCC 14 is open.
[email protected]
From: Jeff Law
Date: 2023-04-03 03:41
To: juzhe.zhong; gcc-patches
CC: kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Fix PR108279
On 3/27/23 00:59, [email protected] wrote:
> From: Juzhe-Zhong <[email protected]>
>
> PR 108270
>
> Fix bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108270.
>
> Consider the following testcase:
> void f (void * restrict in, void * restrict out, int l, int n, int m)
> {
> for (int i = 0; i < l; i++){
> for (int j = 0; j < m; j++){
> for (int k = 0; k < n; k++)
> {
> vint8mf8_t v = __riscv_vle8_v_i8mf8 (in + i + j, 17);
> __riscv_vse8_v_i8mf8 (out + i + j, v, 17);
> }
> }
> }
> }
>
> Compile option: -O3
>
> Before this patch:
> mv a7,a2
> mv a6,a0
> mv t1,a1
> mv a2,a3
> vsetivli zero,17,e8,mf8,ta,ma
> ...
>
> After this patch:
> mv a7,a2
> mv a6,a0
> mv t1,a1
> mv a2,a3
> ble a7,zero,.L1
> ble a4,zero,.L1
> ble a3,zero,.L1
> add a1,a0,a4
> li a0,0
> vsetivli zero,17,e8,mf8,ta,ma
> ...
>
> It will produce potential bug when:
>
> int main ()
> {
> vsetivli zero, 100,.....
> f (in, out, 0,0,0)
> asm volatile ("csrr a0,vl":::"memory");
>
> // Before this patch the a0 is 17. (Wrong).
> // After this patch the a0 is 100. (Correct).
> ...
> }
So why was that point selected in the first place? I would have
expected LCM to select the loop entry edge as the desired insertion point.
Essentially if LCM selects the point before those branches, then it's
voilating a fundamental principal of LCM, namely that you never put an
evaluation on a path where it didn't have one before.
So not objecting to the patch but it is raising concerns about the LCM
results.
jeff