> -----Original Message----- > From: Jeff Law <jeffreya...@gmail.com> > Sent: Thursday, November 7, 2024 8:08 PM > To: Tamar Christina <tamar.christ...@arm.com>; Li, Pan2 <pan2...@intel.com>; > Richard Biener <richard.guent...@gmail.com> > Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; > rdapp....@gmail.com > Subject: Re: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned > SAT_ADD into branchless > > > > On 11/7/24 8:07 AM, Tamar Christina wrote: > > > >> -----Original Message----- > >> From: Li, Pan2 <pan2...@intel.com> > >> Sent: Thursday, November 7, 2024 12:57 PM > >> To: Tamar Christina <tamar.christ...@arm.com>; Richard Biener > >> <richard.guent...@gmail.com> > >> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; > >> jeffreya...@gmail.com; rdapp....@gmail.com > >> Subject: RE: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned > >> SAT_ADD into branchless > >> > >> I see your point that the backend can leverage condition move to emit the > branch > >> code. > >> > >>> For instance see https://godbolt.org/z/fvrq3aq6K > >>> On ISAs with conditional operations the branch version gets ifconverted. > >>> On AArch64 we get: > >>> sat_add_u_1(unsigned int, unsigned int): > >>> adds w0, w0, w1 > >>> csinv w0, w0, wzr, cc > >>> ret > >>> so just 2 instructions, and also branchless. On x86_64 we get: > >>> sat_add_u_1(unsigned int, unsigned int): > >>> add edi, esi > >>> mov eax, -1 > >>> cmovnc eax, edi > >>> ret > >>> so 3 instructions but a dependency chain of 2. > >>> also branchless. This patch would regress both of these. > >> > >> But the above Godbolt may not be a good example for evidence, because both > >> x86_64 and aarch64 implemented usadd > >> already. > >> Thus, they all go to usadd<QImode>. For example as below, the sat_add_u_1 > and > >> sat_add_u_2 are almost the > >> same when the backend implemented usadd. > >> > >> #include <stdint.h> > >> > >> #define T uint32_t > >> > >> T sat_add_u_1 (T x, T y) > >> { > >> return (T)(x + y) < x ? -1 : (x + y); > >> } > >> > >> T sat_add_u_2 (T x, T y) > >> { > >> return (x + y) | -((x + y) < x); > >> } > >> > >> It will become different when take gcc 14.2 (which doesn’t have .SAT_ADD > GIMPLE > >> IR), the x86_64 > >> will have below asm dump for -O3. Looks like no obvious difference here. > >> > >> sat_add_u_1(unsigned int, unsigned int): > >> add edi, esi > >> mov eax, -1 > >> cmovnc eax, edi > >> ret > >> > >> sat_add_u_2(unsigned int, unsigned int): > >> add edi, esi > >> sbb eax, eax > >> or eax, edi > >> ret > >> > > > > Because CE is able to recognize the idiom back into a conditional move. > > Pick a target that doesn't have conditional instructions, like PowerPC > > https://godbolt.org/z/4bTv18WMv > > > > You'll see that this canonicalization has made codegen worse. > > > > After: > > > > .L.sat_add_u_1(unsigned int, unsigned int): > > add 4,3,4 > > rldicl 9,4,0,32 > > subf 3,3,9 > > sradi 3,3,63 > > or 3,3,4 > > rldicl 3,3,0,32 > > blr > > > > and before > > > > .L.sat_add_u_1(unsigned int, unsigned int): > > add 4,3,4 > > cmplw 0,4,3 > > bge 0,.L2 > > li 4,-1 > > .L2: > > rldicl 3,4,0,32 > > blr > > > > It means now it always has to execute 6 instructions, whereas before it was > > 4 or > 5 depending > > on the order of the branch. So for those architectures, it's always slower. > I'm not sure it's that simple. It'll depend on the micro-architecture. > So things like strength of the branch predictors, how fetch blocks are > handled (can you have embedded not-taken branches, short-forward-branch > optimizations, etc).
It's a pretty safe assumption that most modern branch predictors don't have an issue with a simple diamond node like this. But since you brought up fetch blocks, at 5 instructions you're far more likely to fit the entire thing in a fetch block, especially if within a loop where most targets align loops. Where as the more instructions you add, the larger the chance of you needing multiple fetches for the loop. You then weigh the cost of a misprediction (which will get better as the loop runs) vs the cost of a underutilize fetch, which will not get better. You are also making an assumption that logical operations are cheap on the particular core in the longer sequence. I would agree that removing the branch is universally a good idea if you produced the same or less Instructions, or cheaper instructions, as I said before. I disagree that replacing it with more instructions is a universally better thing, and you yourself said "it'll depend". Tamar > > Jeff