Tamar Christina <tamar.christ...@arm.com> writes: >> -----Original Message----- >> From: Richard Sandiford <richard.sandif...@arm.com> >> Sent: Monday, May 16, 2022 1:18 PM >> To: Tamar Christina <tamar.christ...@arm.com> >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; rguent...@suse.de; >> jeffreya...@gmail.com >> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target decide >> the method of argument promotions. >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes: >> > Tamar Christina <tamar.christ...@arm.com> writes: >> >>> -----Original Message----- >> >>> From: Richard Sandiford <richard.sandif...@arm.com> >> >>> Sent: Monday, May 16, 2022 12:36 PM >> >>> To: Tamar Christina <tamar.christ...@arm.com> >> >>> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; rguent...@suse.de; >> >>> jeffreya...@gmail.com >> >>> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the >> >>> target decide the method of argument promotions. >> >>> >> >>> Tamar Christina <tamar.christ...@arm.com> writes: >> >>> > Hi All, >> >>> > >> >>> > Some targets require function parameters to be promoted to a >> >>> > different type on expand time because the target may not have >> >>> > native instructions to work on such types. As an example the >> >>> > AArch64 port does not have native instructions working on integer >> >>> > 8- or 16-bit values. As such it promotes every parameter of these >> types to 32-bits. >> >>> >> >>> This doesn't seem specific to parameters though. It applies to any >> >>> 8- or 16-bit variable. E.g.: >> >>> >> >>> #include <stdint.h> >> >>> uint8_t foo(uint32_t x, uint32_t y) { >> >>> uint8_t z = x != 0 ? x : y; >> >>> return z + 1; >> >>> } >> >>> >> >>> generates: >> >>> >> >>> foo: >> >>> cmp w0, 0 >> >>> and w1, w1, 255 >> >>> and w0, w0, 255 >> >>> csel w0, w1, w0, eq >> >>> add w0, w0, 1 >> >>> ret >> >>> >> >>> So I think the new behaviour is really a modification of the >> >>> PROMOTE_MODE behaviour rather than the >> PROMOTE_FUNCTION_MODE behaviour. >> >>> >> >>> FWIW, I agree with Richard that it would be better not to add a new >> hook. >> >>> I think we're really making PROMOTE_MODE choose between >> SIGN_EXTEND, >> >>> ZERO_EXTEND or SUBREG (what LLVM would call “any >> >>> extend”) rather than the current SIGN_EXTEND vs. ZERO_EXTEND >> choice. >> >> >> >> Ah, I hadn't realized this also applied to locals.. ok I can modify >> >> PROMOTE_MODE then, but I also need the actual SSA_NAME and not just >> the type so will have to pass this along. >> >> >> >> From a practical point of view.. the actual hook however is >> >> implemented by 34 targets, would I need to CC maintainers for each of >> >> them or would global maintainer approval suffice for these mostly >> mechanical changes? >> > >> > Yeah, single approval should be enough mechanical changes. It would >> > be good to do the interface change and mechanical target changes as a >> > separate prepatch if possible though. >> > >> > I'm not sure about passing the SSA name to the target though, or the >> > way that the aarch64 hook uses the info. It looks like a single cold >> > comparison could defeat the optimisation for hot code. > > I'm not sure I follow why the likelihood of the comparison matters in this > case.. > I'll expand on it below..
I meant the likelihood that the comparison is executed at all, not which outcome is more likely. E.g. suppose the only comparison occurs on a failure path that eventually calls abort, and that there are other paths (without comparisons of the same value) that would benefit from the any-extend optimisation. We'd prioritise the cold comparison over optimising the other (hot) code. I'm just suspicious of heuristics along the lines of “don't do X if there is a single instance of Y”. :-) >> > If we do try to make the decision based on uses at expand time, it >> > might be better for the analysis to be in target-independent code, >> > with help from the target to decide where extensions are cheap. It >> > still feels a bit hacky though. > > I thought about it but can't see most target having this problem. I did go > with an optimistic heuristics. There are of course various ways to defeat it > but looking through the corpus of code I don't see any but the simple cases > in practice. (more below). > >> > >> > What stops us from forming cbz/cbnz when the extension is done close >> > to the comparison (from the comment in 2/3)? If we can solve that, >> > could we simply do an any-extend all the time, and treat removing >> > redundant extensions as a global availability problem? >> > > In such cases there's no gain from doing the extension at all, e.g. > and w0, w0, 255 > cmp w0, 0 > b.eq .Lfoo > > will be optimized to > > tst w0, 0xff > b.ne .Lfoo > > already. > > In RTL the problem occurs when you have nested control flow like nested if > and switch statements > The example in 2/3 was intended to show that before what we'd do is > > and w22, w0, 255 > .... <code that clobbers cc and caller saves> > <switch1> > cbz w22, .Lfoo1 > .... > <switch2> > cbz w22, .Lfoo2 > > If we have a single comparison we already sink the zero_extend today. > > Now if we instead any-extend w0 we end up with: > > mov w22, w0 > .... <code that clobbers cc and caller saves> > <switch1> > tst w22, 0xff > b.ne .Lfoo1 > .... > <switch2> > tst w22, 0xff > b.ne .Lfoo2 > > So you get an additional tst before each branch. You also can't perform the > tst higher since CC is clobbered. > The cbz is nice because the zero extend doesn't use CC of course and so > having the value live allows us to optimize > The branch. Once the cbz has been formed (in combine), where does the optimisation of it happen? > I don't think branch likeliness matters here as you must keep w22 live in > both cases somehow. In the SPECCPU 2017 > Benchmark perlbench (which uses a lot of nested switches) this pattern is > responsible for an extra 0.3% codesize increase > which the approach in 2/3 prevents. > >> (which would run after combine) >> >> > >> > What kind of code do we emit when do an extension just before an >> > operation? If the DECL_RTL is (subreg:QI (reg:SI R) 0), say, then it >> > should be safe to do the extension directly into R: >> > >> > (set (reg:SI X) (zero_extend:SI (subreg:QI (reg:SI X)))) >> >> Oops, that should of course be: >> >> (set (reg:SI R) (zero_extend:SI (subreg:QI (reg:SI R)))) >> >> > which avoids the problem of having two values live at once (the >> > zero-extended value and the any-extended value). > > I'm not sure it does, as the any-extended value must remain live. i.e. above > you can't get rid of w22, > you can only choose between having it be zero of any extended. But I am not > sure how you would do > that after expand. These per-operation extends are emitted during expand. The question is whether we do them into fresh registers: (set (reg:SI Rtmp) (zero_extend:SI (subreg:QI (reg:SI R)))) which leaves both R and Rtmp live at the same time, or whether we do them in-situ: (set (reg:SI R) (zero_extend:SI (subreg:QI (reg:SI R)))) Expand should know that the latter is valid, given the DECL_RTL. Thanks, Richard