https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117722
--- Comment #12 from Li Pan <pan2.li at intel dot com> --- (In reply to Robin Dapp from comment #11) > (In reply to Li Pan from comment #9) > > Created attachment 59663 [details] > > before_vs_after when outer loop is 128 > > Ok, that's a different loop then. I'm seeing vmv1rs in the current version, > is that what you're referring to as problematic? Do they result from the > lack of overlap constraints? I'd prefer a bit more context rather than just > code dumps :) Oh, forget this, list code and build option as below for the above png. 1 │ #include <stdint.h> 2 │ #include <stdlib.h> 3 │ 4 │ #define T1 uint8_t 5 │ #define T2 int32_t 6 │ 7 │ T2 8 │ foo (T2 * restrict op_0, T1 * restrict op_1, 9 │ T1 * restrict op_2, T2 op_3, T2 op_4) 10 │ { 11 │ T2 sum = 0; 12 │ for (unsigned i = 0; i < 128; i++) // x264_pixel_sad_4x4 is i < 4. 13 │ { 14 │ for (unsigned k = 0; k < 8; k++) 15 │ sum += abs (op_1[k] - op_2[k]); 16 │ 17 │ op_1 += op_3; 18 │ op_2 += op_4; 19 │ } 20 │ 21 │ return sum; 22 │ } -O3 -march=rv64gcv -mabi=lp64d -c -S u_sad.c -o after.S -fno-schedule-insns -fno-schedule-insns2 -O3 -march=rv64gcv -mabi=lp64d -c -S u_sad.c -mno-vector-strict-align -o before.S -fno-schedule-insns -fno-schedule-insns2