Hi. I re-run the benchmarks and hopefully got the same profit. I also compared the leela's code and figured out the reason.
Actually, my and Manolis's patches do the same thing. The difference is only execution order. Because of f-m-o held after the register allocation it cannot eliminate redundant move 'sp' to another register. Here is an example. int core_bench_state(int *ptr) { > int final_counts[100] = {0}; while (*ptr) { > int id = foo(); > final_counts[id]++; > ptr++; > } return final_counts[0]; > } For this loop, the f-m-o pass generates the following. .L3: call foo * mv a5,sp* sh2add a0,a0,a5 lw a5,0(a0) lw a4,4(s0) addi s0,s0,4 addiw a5,a5,1 sw a5,0(a0) bne a4,zero,.L3 Here '*mv a5, sp*' instruction is redundant. Leela's FastState::try_move() function has a loop that iterates over 1.3 B times and contains 5 memory folding cases (5 redundant moves). Besides that, I have checked the build failure on x264_r. It is already fixed on the third version. On Sat, Jul 15, 2023 at 10:16 AM Jeff Law <jeffreya...@gmail.com> wrote: > > > On 7/12/23 14:59, Jivan Hakobyan via Gcc-patches wrote: > > Accessing local arrays element turned into load form (fp + (index << > > C1)) + C2 address. In the case when access is in the loop we got loop > > invariant computation. For some reason, moving out that part cannot > > be done in loop-invariant passes. But we can handle that in > > target-specific hook (legitimize_address). That provides an > > opportunity to rewrite memory access more suitable for the target > > architecture. > > > > This patch solves the mentioned case by rewriting mentioned case to > > ((fp + C2) + (index << C1)) I have evaluated it on SPEC2017 and got > > an improvement on leela (over 7b instructions, .39% of the dynamic > > count) and dwarfs the regression for gcc (14m instructions, .0012% of > > the dynamic count). > > > > > > gcc/ChangeLog: * config/riscv/riscv.cc (riscv_legitimize_address): > > Handle folding. (mem_shadd_or_shadd_rtx_p): New predicate. > So I still need to give the new version a review. But a high level > question -- did you re-run the benchmarks with this version to verify > that we still saw the same nice improvement in leela? > > The reason I ask is when I use this on Ventana's internal tree I don't > see any notable differences in the dynamic instruction counts. And > probably the most critical difference between the upstream tree and > Ventana's tree in this space is Ventana's internal tree has an earlier > version of the fold-mem-offsets work from Manolis. > > It may ultimately be the case that this work and Manolis's f-m-o patch > have a lot of overlap in terms of their final effect on code generation. > Manolis's pass runs much later (after register allocation), so it's > not going to address the loop-invariant-code-motion issue that > originally got us looking into this space. But his pass is generic > enough that it helps other targets. So we may ultimately want both. > > Anyway, just wanted to verify if this variant is still showing the nice > improvement on leela that the prior version did. > > Jeff > > ps. I know you're on PTO. No rush on responding -- enjoy the time off. > > -- With the best regards Jivan Hakobyan