> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Tuesday, February 28, 2023 11:09 AM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd > <n...@arm.com>; rguent...@suse.de; j...@ventanamicro.com > Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask > by using new optabs [PR108583] > > Tamar Christina <tamar.christ...@arm.com> writes: > >> -----Original Message----- > >> From: Richard Sandiford <richard.sandif...@arm.com> > >> Sent: Monday, February 27, 2023 9:33 PM > >> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> > >> Cc: Tamar Christina <tamar.christ...@arm.com>; nd <n...@arm.com>; > >> rguent...@suse.de; j...@ventanamicro.com > >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of > >> div-bitmask by using new optabs [PR108583] > >> > >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > >> >> -----Original Message----- > >> >> From: Richard Sandiford <richard.sandif...@arm.com> > >> >> Sent: Monday, February 27, 2023 12:12 PM > >> >> To: Tamar Christina <tamar.christ...@arm.com> > >> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd > >> >> <n...@arm.com>; rguent...@suse.de; j...@ventanamicro.com > >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of > >> >> div-bitmask by using new optabs [PR108583] > >> >> > >> >> Tamar Christina <tamar.christ...@arm.com> writes: > >> >> > Hi, > >> >> > > >> >> >> > I avoided open coding it with add and shift because it > >> >> >> > creates a > >> >> >> > 4 instructions (and shifts which are typically slow) > >> >> >> > dependency chain instead of a load and multiply. This > >> >> >> > change, unless the target is known to optimize it further is > >> >> >> > unlikely to be beneficial. And by the time we get to costing > >> >> >> > the only alternative is to undo the existing pattern and > >> >> >> so you lose the general shift optimization. > >> >> >> > > >> >> >> > So it seemed unwise to open code as shifts, given the codegen > >> >> >> > out of the vectorizer would be degenerate for most targets or > >> >> >> > one needs the more complicated route of costing during > >> >> >> > pattern > >> matching already. > >> >> >> > >> >> >> Hmm, OK. That seems like a cost-model thing though, rather > >> >> >> than something that should be exposed through optabs. And I > >> >> >> imagine the open-coded version would still be better than > >> >> >> nothing on targets without > >> >> highpart multiply. > >> >> >> > >> >> >> So how about replacing the hook with one that simply asks > >> >> >> whether division through highpart multiplication is preferred > >> >> >> over the add/shift > >> >> sequence? > >> >> >> (Unfortunately it's not going to be possible to work that out > >> >> >> from existing > >> >> >> information.) > >> >> > > >> >> > So this doesn't work for SVE. For SVE the multiplication > >> >> > widening pass introduces FMAs at gimple level. So in the cases > >> >> > where the operation is fed from a widening multiplication we end > >> >> > up generating > >> FMA. > >> >> If that was it I could have matched FMA. > >> >> > > >> >> > But it also pushes the multiplication in the second operand > >> >> > because it no longer has a mul to share the results with. > >> >> > > >> >> > In any case, the gimple code is transformed into > >> >> > > >> >> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121); > >> >> > vect_patt_57.9_123 = (vector([8,8]) unsigned short) > >> >> > vect__3.8_122; > >> >> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, { > >> >> > 257, ... }); > >> >> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8; > >> >> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124, > >> >> > vect_patt_65.12_128); > >> >> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8; > >> >> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21) > >> >> > vect_patt_62.14_130; > >> >> > > >> >> > This transformation is much worse than the original code, it > >> >> > extended the dependency chain with another expensive > >> >> > instruction. I can try to correct this in RTL by matching FMA > >> >> > and shift and splitting into MUL + > >> >> ADDHNB and hope CSE takes care of the extra mul. > >> >> > > >> >> > But this seems like a hack, and it's basically undoing the > >> >> > earlier transformation. It seems to me that the open coding is a bad > idea. > >> >> > >> >> Could you post the patch that gives this result? I'll have a poke > >> >> around. > >> > > >> > Sure, I'll post the new series, it needs all of them. > >> > >> Thanks. Which testcase did you use to get the above? > >> > > > > #include <stdint.h> > > > > #define N 16 > > #define TYPE uint8_t > > > > void fun3(TYPE* restrict pixel, TYPE level, int n) { > > for (int i = 0; i < (n & -16); i+=1) > > pixel[i] = (pixel[i] * level) / 0xff; } > > Thanks. In that testcase, isn't the FMA handling an anti-optimisation in its > own right though? It's duplicating a multiplication into two points on a > dependency chain.
Most definitely, that's what I meant above. The "optimization" doesn't take into account the effect on the rest of the chain. > > E.g. for: > > unsigned int > f1 (unsigned int a, unsigned int b, unsigned int c) { > unsigned int d = a * b; > return d + ((c + d) >> 1); > } > unsigned int > g1 (unsigned int a, unsigned int b, unsigned int c) { > return a * b + c; > } > > __Uint32x4_t > f2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) { > __Uint32x4_t d = a * b; > return d + ((c + d) >> 1); > } > __Uint32x4_t > g2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) { > return a * b + c; > } > > typedef unsigned int vec __attribute__((vector_size(32))); vec > f3 (vec a, vec b, vec c) > { > vec d = a * b; > return d + ((c + d) >> 1); > } > vec > g3 (vec a, vec b, vec c) > { > return a * b + c; > } > > compiled with -O2 -msve-vector-bits=256 -march=armv8.2-a+sve, all the g > functions use multiply-add (as expected), but the f functions are: > > f1: > mul w1, w0, w1 > add w0, w1, w2 > add w0, w1, w0, lsr 1 > ret > > f2: > mul v0.4s, v0.4s, v1.4s > add v2.4s, v0.4s, v2.4s > usra v0.4s, v2.4s, 1 > ret > > f3: > ... > mla z0.s, p0/m, z1.s, z2.s > lsr z0.s, z0.s, #1 > mad z1.s, p0/m, z2.s, z0.s > ... > > What we do for f3 doesn't seem like a good idea. Agreed, I guess this means I have to fix that as well? ☹ I'll go take a look then.. Tamar. > > I can see that duplicating an integer multiplication might make sense if the > integer FMAs are done in parallel. But if one is a dependency of the other, > then at least for integer FMA, I think we should punt, especially since we > don't > know what the target's late-forwarding restrictions are. I guess fp-contract > comes into play for the FP FMAs though. > > >> But since SVE does have highpart multiply, and since the assumption > >> for SVE is that MULH+shift is better than ADD*3+shift*2, shouldn't > >> SVE just be one of the targets for which the hook that "asks whether > >> division through highpart multiplication is preferred over the add/shift > sequence" returns true? > >> > > > > Yes (it's also two adds not 3), but it's not correct for SVE2, which > > has addhnb, in which case 2x addhnb is much faster than MULH+shift. > > And the problem is that widening_mul will not allow add+shift to reach the > backend because the ADD+shift were open coded. > > > > They are now subjected to further optimization. > > > > To summarize: > > > > Other targets: false > > SVE: false > > SVE2: true > > NEON: true > > Yeah, looks good. > > > SVE2 borked because MUL+ADD+SHIFT -> FMA+SHIFT. > > > > If you're saying you don't want the optimization for SVE2, then sure, happy > to turn it off. > > > > But UMULH+LSR == 6 cycles on Neoverse-N2 and throughput of 1. > > 2x ADDHNB = 4 cycles and throughput of 2. > > No, I meant the same as what you said in the summary above. > > Richard