On Wed, Jun 30, 2021 at 5:42 PM Kewen.Lin <li...@linux.ibm.com> wrote: > > on 2021/6/30 下午4:53, Hongtao Liu wrote: > > On Mon, Jun 28, 2021 at 3:27 PM Kewen.Lin <li...@linux.ibm.com> wrote: > >> > >> on 2021/6/28 下午3:20, Hongtao Liu wrote: > >>> On Mon, Jun 28, 2021 at 3:12 PM Hongtao Liu <crazy...@gmail.com> wrote: > >>>> > >>>> On Mon, Jun 28, 2021 at 2:50 PM Kewen.Lin <li...@linux.ibm.com> wrote: > >>>>> > >>>>> Hi! > >>>>> > >>>>> on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote: > >>>>>> Hi, > >>>>>> > >>>>>> PR100328 has some details about this issue, I am trying to > >>>>>> brief it here. In the hottest function LBM_performStreamCollideTRT > >>>>>> of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions > >>>>>> (27 FMA, 19 FMS, 11 FNMA). On rs6000, this kind of FMA style > >>>>>> insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg > >>>>>> class have 64 registers whose foregoing 32 ones make up the > >>>>>> whole FLOAT_REG. There are some differences for these two > >>>>>> flavors, taking "*fma<mode>4_fpr" as example: > >>>>>> > >>>>>> (define_insn "*fma<mode>4_fpr" > >>>>>> [(set (match_operand:SFDF 0 "gpc_reg_operand" "=<Ff>,wa,wa") > >>>>>> (fma:SFDF > >>>>>> (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa") > >>>>>> (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0") > >>>>>> (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))] > >>>>>> > >>>>>> // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG. > >>>>>> // <Ff> (f/d) => A floating point register, aka. FLOAT_REG. > >>>>>> > >>>>>> So for VSX_REG, we only have the destructive form, when VSX_REG > >>>>>> alternative being used, the operand 2 or operand 3 is required > >>>>>> to be the same as operand 0. reload has to take care of this > >>>>>> constraint and create some non-free register copies if required. > >>>>>> > >>>>>> Assuming one fma insn looks like: > >>>>>> op0 = FMA (op1, op2, op3) > >>>>>> > >>>>>> The best regclass of them are VSX_REG, when op1,op2,op3 are all dead, > >>>>>> IRA simply creates three shuffle copies for them (here the operand > >>>>>> order matters, since with the same freq, the one with smaller number > >>>>>> takes preference), but IMO both op2 and op3 should take higher priority > >>>>>> in copy queue due to the matching constraint. > >>>>>> > >>>>>> I noticed that there is one function ira_get_dup_out_num, which meant > >>>>>> to create this kind of constraint copy, but the below code looks to > >>>>>> refuse to create if there is an alternative which has valid regclass > >>>>>> without spilled need. > >>>>>> > >>>>>> default: > >>>>>> { > >>>>>> enum constraint_num cn = lookup_constraint (str); > >>>>>> enum reg_class cl = reg_class_for_constraint (cn); > >>>>>> if (cl != NO_REGS > >>>>>> && !targetm.class_likely_spilled_p (cl)) > >>>>>> goto fail > >>>>>> > >>>>>> ... > >>>>>> > >>>>>> I cooked one patch attached to make ira respect this kind of matching > >>>>>> constraint guarded with one parameter. As I stated in the PR, I was > >>>>>> not sure this is on the right track. The RFC patch is to check the > >>>>>> matching constraint in all alternatives, if there is one alternative > >>>>>> with matching constraint and matches the current preferred regclass > >>>>>> (or best of allocno?), it will record the output operand number and > >>>>>> further create one constraint copy for it. Normally it can get the > >>>>>> priority against shuffle copies and the matching constraint will get > >>>>>> satisfied with higher possibility, reload doesn't create extra copies > >>>>>> to meet the matching constraint or the desirable register class when > >>>>>> it has to. > >>>>>> > >>>>>> For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay > >>>>>> as shuffle copies, and later any of A,B,C,D gets assigned by one > >>>>>> hardware register which is a VSX register (VSX_REG) but not a FP > >>>>>> register (FLOAT_REG), which means it has to pay costs once we can NOT > >>>>>> go with VSX alternatives, so at that time it's important to respect > >>>>>> the matching constraint then we can increase the freq for the remaining > >>>>>> copies related to this (A/B, A/C, A/D). This idea requires some side > >>>>>> tables to record some information and seems a bit complicated in the > >>>>>> current framework, so the proposed patch aggressively emphasizes the > >>>>>> matching constraint at the time of creating copies. > >>>>>> > >>>>> > >>>>> Comparing with the original patch (v1), this patch v3 has > >>>>> considered: (this should be v2 for this mail list, but bump > >>>>> it to be consistent as PR's). > >>>>> > >>>>> - Excluding the case where for one preferred register class > >>>>> there can be two or more alternatives, one of them has the > >>>>> matching constraint, while another doesn't have. So for > >>>>> the given operand, even if it's assigned by a hardware reg > >>>>> which doesn't meet the matching constraint, it can simply > >>>>> use the alternative which doesn't have matching constraint > >>>>> so no register move is needed. One typical case is > >>>>> define_insn *mov<mode>_internal2 on rs6000. So we > >>>>> shouldn't create constraint copy for it. > >>>>> > >>>>> - The possible free register move in the same register class, > >>>>> disable this if so since the register move to meet the > >>>>> constraint is considered as free. > >>>>> > >>>>> - Making it on by default, suggested by Segher & Vladimir, we > >>>>> hope to get rid of the parameter if the benchmarking result > >>>>> looks good on major targets. > >>>>> > >>>>> - Tweaking cost when either of matching constraint two sides > >>>>> is hardware register. Before this patch, the constraint > >>>>> copy is simply taken as a real move insn for pref and > >>>>> conflict cost with one hardware register, after this patch, > >>>>> it's allowed that there are several input operands > >>>>> respecting the same matching constraint (but in different > >>>>> alternatives), so we should take it to be like shuffle copy > >>>>> for some cases to avoid over preferring/disparaging. > >>>>> > >>>>> Please check the PR comments for more details. > >>>>> > >>>>> This patch can be bootstrapped & regtested on > >>>>> powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some > >>>>> "XFAIL->XPASS" failures on aarch64-linux-gnu. The failure list > >>>>> was attached in the PR and thought the new assembly looks > >>>>> improved (expected). > >>>>> > >>>>> With option Ofast unroll, this patch can help to improve SPEC2017 > >>>>> bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while > >>>>> 508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any > >>>>> remarkable degradations. > > > > Here's SPEC2017 rate result tested on AMD milan > > option is: -march=znver2 -Ofast -funroll-loops -mfpmath=sse -flto > > > > fprate: > > 503.bwaves_r 0.01 (A) shliclel219 > > 507.cactuBSSN_r -0.19 (A) shliclel219 > > 508.namd_r 0.02 (A) shliclel219 > > 510.parest_r -0.68 (A) shliclel219 > > 511.povray_r 1.59 (A) shliclel219 > > 521.wrf_r 0.19 (A) shliclel219 > > 526.blender_r 0.68 (A) shliclel219 > > 527.cam4_r -0.30 (A) shliclel219 > > 538.imagick_r -3.81 <- (A) shliclel219 > > 544.nab_r 0.02 (A) shliclel219 > > 549.fotonik3d_r 0.02 (A) shliclel219 > > 554.roms_r -0.43 (A) shliclel219 > > 997.specrand_fr -3.80 <- (A) shliclel219 > > Geometric mean: -0.52 > > intrate: > > 500.perlbench_r -1.54 (A) shliclel219 > > 502.gcc_r -0.38 (A) shliclel219 > > 505.mcf_r -0.10 (A) shliclel219 > > 520.omnetpp_r -0.24 (A) shliclel219 > > 523.xalancbmk_r -1.04 (A) shliclel219 > > 525.x264_r 0.31 (A) shliclel219 > > 531.deepsjeng_r -0.02 (A) shliclel219 > > 541.leela_r 0.95 (A) shliclel219 > > 548.exchange2_r 0.08 (A) shliclel219 > > 557.xz_r -0.40 (A) shliclel219 > > Geometric mean: -0.24 > > > Roger, thanks! The result looks not good, I think I'll disable it > for target x86_64 in next version. By the way, bmk 519.lbm_r seemed > missing, just curious whether due to that it failed to build even > with baseline? 519.lbm_r 0 ------ ------ BuildSame on milan
here is fprate on CLX: 503.bwaves_r -0.12 507.cactuBSSN_r -0.02 508.namd_r -0.57 510.parest_r 0.40 511.povray_r -0.37 519.lbm_r 0.10 521.wrf_r 0.61 526.blender_r -0.50 527.cam4_r -0.45 538.imagick_r -6.61 <- 544.nab_r -0.11 549.fotonik3d_r 0.16 554.roms_r 0.22 997.specrand_fr -0.18 And there's something broken on my local cascade lake, so intrate test result for CLX would be later. > > BR, > Kewen -- BR, Hongtao