Hi Segher, Thanks for the comments!
on 2019/11/5 上午4:21, Segher Boessenkool wrote: > Hi! > > On Mon, Nov 04, 2019 at 03:16:06PM +0800, Kewen.Lin wrote: >> To align with rs6000_insn_cost costing more for load type insns, > > (Which itself has history in rs6000_rtx_costs). > >> this patch is to make load insns cost more in vectorization cost >> function. Considering that the result of load usually is used >> somehow later (true-dep) but store won't, we keep the store as >> before. > > The latency of load insns is about twice that of "simple" instructions; > 2 vs. 1 on older cores, and 4 (or so) vs. 2 on newer cores. > Yes, for latest Power9, general load takes 4, vsx load takes 5. >> The SPEC2017 performance evaluation on Power8 shows 525.x264_r >> +9.56%, 511.povray_r +2.08%, 527.cam4_r 1.16% gains, no >> significant degradation, SPECINT geomean +0.88%, SPECFP geomean >> +0.26%. > > Nice :-) > >> The SPEC2017 performance evaluation on Power9 shows no significant >> improvement or degradation, SPECINT geomean +0.04%, SPECFP geomean >> +0.04%. >> >> The SPEC2006 performance evaluation on Power8 shows 454.calculix >> +4.41% gain but 416.gamess -1.19% and 453.povray -3.83% degradation. >> I looked into the two degradation bmks, the degradation were NOT >> due to hotspot changes by vectorization, were all side effects. >> SPECINT geomean +0.10%, SPECFP geomean no changed considering >> the degradation. > > Also nice. > >> --- a/gcc/config/rs6000/rs6000.c >> +++ b/gcc/config/rs6000/rs6000.c >> @@ -4763,15 +4763,22 @@ rs6000_builtin_vectorization_cost (enum >> vect_cost_for_stmt type_of_cost, >> switch (type_of_cost) >> { >> case scalar_stmt: >> - case scalar_load: >> case scalar_store: >> case vector_stmt: >> - case vector_load: >> case vector_store: >> case vec_to_scalar: >> case scalar_to_vec: >> case cond_branch_not_taken: >> return 1; >> + case scalar_load: >> + case vector_load: >> + /* Like rs6000_insn_cost, make load insns cost a bit more. FIXME: the > > (two spaces after full stop). > Good catch! Will fix it (and the others). >> + benefits were observed on Power8 and up, we can unify it if similar >> + profits are measured on Power6 and Power7. */ >> + if (TARGET_P8_VECTOR) >> + return 2; >> + else >> + return 1; > > Hrm, but you showed benchmark improvements for p9 as well? > No significant gains but no degradation as well, so I thought it's fine to align it together. Does it make sense? > What happens if you enable this for everything as well? > My concern was that if we enable it for everything, it's possible to introduce degradation for some benchmarks on P6 or P7 where we didn't evaluate the performance impact. Although it's reasonable from the point view of load latency, it's possible to get worse result in the actual benchmarks based on my fine grain cost adjustment experiment before. Or do you suggest enabling it everywhere and solve the degradation issue if exposed? I'm also fine with that. :) BR, Kewen