On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote: > On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw > <richard.earns...@foss.arm.com> wrote: >> >> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote: >>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw >>> <richard.earns...@foss.arm.com> wrote: >>>> >>>> On 20/10/2020 12:22, Richard Earnshaw wrote: >>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote: >>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw >>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>> >>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote: >>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw >>>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>>> >>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw >>>>>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>>>>> >>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load >>>>>>>>>>>> delta >>>>>>>>>>>> from code memory (like we do without -mpure-code). >>>>>>>>>>>> >>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of >>>>>>>>>>>> movs/adds/lsls. >>>>>>>>>>>> >>>>>>>>>>>> We also do some cleanup by not emitting the function address and >>>>>>>>>>>> delta >>>>>>>>>>>> via .word directives at the end of the thunk since we don't use >>>>>>>>>>>> them >>>>>>>>>>>> with -mpure-code. >>>>>>>>>>>> >>>>>>>>>>>> No need for new testcases, this bug was already identified by >>>>>>>>>>>> eg. pr46287-3.C >>>>>>>>>>>> >>>>>>>>>>>> 2020-09-29 Christophe Lyon <christophe.l...@linaro.org> >>>>>>>>>>>> >>>>>>>>>>>> gcc/ >>>>>>>>>>>> * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in >>>>>>>>>>>> r3 and >>>>>>>>>>>> do not emit function address and delta when -mpure-code is >>>>>>>>>>>> used. >>>>>>>>>>> >>>>>>>>>> Hi Richard, >>>>>>>>>> >>>>>>>>>> Thanks for your comments. >>>>>>>>>> >>>>>>>>>>> There are some optimizations you can make to this code. >>>>>>>>>>> >>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be >>>>>>>>>>> better >>>>>>>>>>> to just expand a mov of 255 followed by an add. >>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed >>>>>>>>>> before, so I can write something similar indeed. >>>>>>>>>> >>>>>>>>>> However, I'm note quite sure to understand the benefit in the split >>>>>>>>>> when -mpure-code is NOT used. >>>>>>>>>> Consider: >>>>>>>>>> int f3_1 (void) { return 510; } >>>>>>>>>> int f3_2 (void) { return 511; } >>>>>>>>>> Compile with -O2 -mcpu=cortex-m0: >>>>>>>>>> f3_1: >>>>>>>>>> movs r0, #255 >>>>>>>>>> lsls r0, r0, #1 >>>>>>>>>> bx lr >>>>>>>>>> f3_2: >>>>>>>>>> ldr r0, .L4 >>>>>>>>>> bx lr >>>>>>>>>> >>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by >>>>>>>>>> not having to load the constant? >>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into >>>>>>>>>> account when comparing code size, >>>>>>>>> >>>>>>>>> Yes, the size of the literal pool entry needs to be taken into >>>>>>>>> account. >>>>>>>>> It might happen that the entry could be shared with another use of >>>>>>>>> that >>>>>>>>> literal, but in general that's rare. >>>>>>>>> >>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three >>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading >>>>>>>>>> from the literal pool. Should the 256-510 range be extended? >>>>>>>>> >>>>>>>>> It's a bit borderline at three instructions when literal pools are not >>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small >>>>>>>>> due >>>>>>>>> to the limited pc offsets we can use. I think on balance we probably >>>>>>>>> want to use the instruction sequence unless optimizing for size. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> This is also true for >>>>>>>>>>> the literal pools alternative as well, so should be handled before >>>>>>>>>>> all >>>>>>>>>>> this. >>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is >>>>>>>>>> compiled as: >>>>>>>>>> f3_1: >>>>>>>>>> movs r0, #255 >>>>>>>>>> lsls r0, r0, #1 >>>>>>>>>> bx lr >>>>>>>>>> f3_2: >>>>>>>>>> movs r0, #1 >>>>>>>>>> lsls r0, r0, #8 >>>>>>>>>> adds r0, r0, #255 >>>>>>>>>> bx lr >>>>>>>>>> >>>>>>>>>> so the "return 510" case is already handled as without -mpure-code. >>>>>>>>> >>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be >>>>>>>>> emitting instructions directly rather than generating RTL. The >>>>>>>>> examples >>>>>>>>> you show here are not thunks. >>>>>>>>> >>>>>>>> OK thanks for the clarification. >>>>>>>> >>>>>>>> Here is an updated version, split into 3 patches to hopefully make >>>>>>>> review easier. >>>>>>>> They apply on top of my other mpure-code patches for PR96967 and >>>>>>>> PR96770: >>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html >>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html >>>>>>>> >>>>>>>> I kept it this way to make incremental changes easier to understand. >>>>>>>> >>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier, >>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either >>>>>>>> RTL or >>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the >>>>>>>> same place, >>>>>>>> in case we need to improve/fix it later. We now generate shorter >>>>>>>> sequences in >>>>>>>> several cases matching your comments. >>>>>>>> >>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and >>>>>>>> calls thumb1_gen_const_int. >>>>>>>> >>>>>>>> Patch 3: Update of the original patch in this thread, now calls >>>>>>>> thumb1_gen_const_int. >>>>>>> >>>>>>> Yuk! Those changes to thumb1_gen_const_int are horrible. >>>>>>> >>>>>>> I think we should be able to leverage the fact that the compiler can use >>>>>>> C++ now to do much better than that, for example by making that function >>>>>>> a template. For example (and this is just a sketch): >>>>>>> >>>>>> >>>>>> Indeed! I didn't think about it since there is no other use of >>>>>> templates in arm.c yet. >>>>>> I'll send an update soon. >>>>>> >>>>>> Other than that, does the approach look OK to you? >>>>>> >>>>> >>>>> Yes, I think this is heading in the right direction. Bringing the two >>>>> immediate generating operations into a single function can only be a >>>>> good thing. >>>>> >>>>> Looking again at your example constant sequences, I see: >>>>> >>>>> 0x1000010: >>>>> movs r3, #16 >>>>> lsls r3, #16 >>>>> adds r3, #1 >>>>> lsls r3, #4 >>>>> 0x1000011: >>>>> movs r3, #1 >>>>> lsls r3, #24 >>>>> adds r3, #17 >>>>> >>>>> The first of these looks odd, given the second sequence. Why doesn't >>>>> the first expand to >>>>> >>>>> 0x1000010: >>>>> movs r3, #16 >>>>> lsls r3, #16 >>>>> adds r3, #16 >>>>> >>>> Err, I mean to: >>>> >>>> >>>> 0x1000010: >>>> movs r3, #1 >>>> lsls r3, #24 >>>> adds r3, #16 >>>> >>>> ? >>> >>> Because I first try to right-shift the constant, hoping to reduce its >>> range and need less instructions to build the higher part, then >>> left-shift back. >>> >>> In this particular case, we'd need to realize that there are many >>> zeros "inside" the constant. >>> >>> If I remove the part that tries to reduce the range, I do get that >>> sequence, but for 764 I now generate >>> movs r3, #2 >>> lsls r3, #8 >>> adds r3, #252 >>> instead of >>> movs r3, #191 >>> lsls r3, #2 >>> >>> A possibility would be to try both approaches and keep the shortest one. >> >> Lets leave that for now, it's not important to fixing the main issue; >> but we should remember we need to come back to it at some point. >> > Thanks, that's what I was thinking too. > >> There are other tricks as well, such as >> >> 0xffffff >> >> can be done as >> >> 0x1000000 - 1 >> >> and >> >> 0xfffffd >> >> as >> >> 0x1000000 - 3 >> >> but these can wait as well. >> > > Didn't we already need to handle such tricks? I'm surprised this > wasn't needed earlier. >
I don't think we ever worried about them. Most of them need at least 3 instructions so aren't a code size saving over using a literal pool entry. R. > >> >> R. >> >>> >>> >>>>> >>>>> R. >>>>> >>>>>> Thanks, >>>>>> >>>>>> Christophe >>>>>> >>>>>>> class t1_rtl >>>>>>> { >>>>>>> public: >>>>>>> void ashift(int a) { gen_rtx_ASHIFT(a); } >>>>>>> void rshift(int b) { gen_rtx_SHIFTRT(b); } >>>>>>> }; >>>>>>> >>>>>>> class t1_print >>>>>>> { >>>>>>> public: >>>>>>> t1_print (FILE *f) : t_file(f) {} >>>>>>> void ashift (int a) { fprintf (t_file, "a shift %d\n", a); } >>>>>>> void rshift (int b) { fprintf (t_file, "r shift %d\n", b); } >>>>>>> private: >>>>>>> FILE *t_file; >>>>>>> }; >>>>>>> >>>>>>> template <class T> >>>>>>> void thumb1_gen_const_int(T t, int f) >>>>>>> { >>>>>>> // Expansion of thumb1_gen_const_int ... >>>>>>> t.ashift(f); >>>>>>> } >>>>>>> >>>>>>> // Usage... >>>>>>> void f1() >>>>>>> { >>>>>>> // Use the RTL expander >>>>>>> t1_rtl g; >>>>>>> thumb1_gen_const_int (g, 3); >>>>>>> } >>>>>>> >>>>>>> void f2() >>>>>>> { >>>>>>> // Use the printf expander writing to stdout >>>>>>> t1_print g(stdout); >>>>>>> thumb1_gen_const_int (g, 3); >>>>>>> } >>>>>>> >>>>>>> With this you can write thumb1_gen_const_int without having to worry >>>>>>> about which expander is being used in each instance and the template >>>>>>> expansion will use the right version. >>>>>>> >>>>>>> R. >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> I also suspect (but haven't check) that the base adjustment will >>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4). If >>>>>>>>>>> that is >>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 >>>>>>>>>>> for an >>>>>>>>>>> even greater range of literals. >>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY >>>>>>>>>> case, >>>>>>>>>> I'll update my patch. >>>>>>>>>> >>>>>>>>>>> More generally, any sequence of up to >>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast >>>>>>>>>>> as the >>>>>>>>>>> existing literal pool fall back. >>>>>>>>>>> >>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code >>>>>>>>>>> will >>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts >>>>>>>>>>> could >>>>>>>>>>> be merged into a single shift-by-16. >>>>>>>>>> >>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const. >>>>>>>>>> >>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if >>>>>>>>>>> at >>>>>>>>>>> all possible. >>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase >>>>>>>>>> that showed the problem. There are also: >>>>>>>>>> g++.dg/opt/thunk1.C >>>>>>>>>> g++.dg/ipa/pr46984.C >>>>>>>>>> g++.dg/torture/pr46287.C >>>>>>>>>> g++.dg/torture/pr45699.C >>>>>>>>>> >>>>>>>>>> Do you want that I copy one of these in the arm subdir and add >>>>>>>>>> -mpure-code in dg-options? >>>>>>>>> >>>>>>>>> On reflection, probably not - that just makes things more complicated >>>>>>>>> with all the dg-options mess (I'm worried about interactions with >>>>>>>>> other >>>>>>>>> sets of options on the command line and the fall-out from that). If >>>>>>>>> someone cares about pure-code they should be doing full testsuite runs >>>>>>>>> with it enabled and that should be sufficient. >>>>>>>> >>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a >>>>>>>> modified simulator. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Christophe >>>>>>>> >>>>>>>>> >>>>>>>>> R. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Christophe >>>>>>>>>> >>>>>>>>>>> R. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> k# (use "git pull" to merge the remote branch into yours) >>>>>>>>>>>> --- >>>>>>>>>>>> gcc/config/arm/arm.c | 91 >>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++--------------- >>>>>>>>>>>> 1 file changed, 66 insertions(+), 25 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c >>>>>>>>>>>> index ceeb91f..62abeb5 100644 >>>>>>>>>>>> --- a/gcc/config/arm/arm.c >>>>>>>>>>>> +++ b/gcc/config/arm/arm.c >>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, >>>>>>>>>>>> HOST_WIDE_INT delta, >>>>>>>>>>>> { >>>>>>>>>>>> if (mi_delta > 255) >>>>>>>>>>>> { >>>>>>>>>>>> - fputs ("\tldr\tr3, ", file); >>>>>>>>>>>> - assemble_name (file, label); >>>>>>>>>>>> - fputs ("+4\n", file); >>>>>>>>>>>> + /* With -mpure-code, we cannot load delta from the constant >>>>>>>>>>>> + pool: we build it explicitly. */ >>>>>>>>>>>> + if (target_pure_code) >>>>>>>>>>>> + { >>>>>>>>>>>> + bool mov_done_p = false; >>>>>>>>>>>> + int i; >>>>>>>>>>>> + >>>>>>>>>>>> + /* Emit upper 3 bytes if needed. */ >>>>>>>>>>>> + for (i = 0; i < 3; i++) >>>>>>>>>>>> + { >>>>>>>>>>>> + int byte = (mi_delta >> (8 * (3 - i))) & 0xff; >>>>>>>>>>>> + >>>>>>>>>>>> + if (byte) >>>>>>>>>>>> + { >>>>>>>>>>>> + if (mov_done_p) >>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", >>>>>>>>>>>> byte); >>>>>>>>>>>> + else >>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", >>>>>>>>>>>> byte); >>>>>>>>>>>> + mov_done_p = true; >>>>>>>>>>>> + } >>>>>>>>>>>> + >>>>>>>>>>>> + if (mov_done_p) >>>>>>>>>>>> + asm_fprintf (file, "\tlsls\tr3, #8\n"); >>>>>>>>>>>> + } >>>>>>>>>>>> + >>>>>>>>>>>> + /* Emit lower byte if needed. */ >>>>>>>>>>>> + if (!mov_done_p) >>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & >>>>>>>>>>>> 0xff); >>>>>>>>>>>> + else if (mi_delta & 0xff) >>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & >>>>>>>>>>>> 0xff); >>>>>>>>>>>> + } >>>>>>>>>>>> + else >>>>>>>>>>>> + { >>>>>>>>>>>> + fputs ("\tldr\tr3, ", file); >>>>>>>>>>>> + assemble_name (file, label); >>>>>>>>>>>> + fputs ("+4\n", file); >>>>>>>>>>>> + } >>>>>>>>>>>> asm_fprintf (file, "\t%ss\t%r, %r, r3\n", >>>>>>>>>>>> mi_op, this_regno, this_regno); >>>>>>>>>>>> } >>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, >>>>>>>>>>>> HOST_WIDE_INT delta, >>>>>>>>>>>> fputs ("\tpop\t{r3}\n", file); >>>>>>>>>>>> >>>>>>>>>>>> fprintf (file, "\tbx\tr12\n"); >>>>>>>>>>>> - ASM_OUTPUT_ALIGN (file, 2); >>>>>>>>>>>> - assemble_name (file, label); >>>>>>>>>>>> - fputs (":\n", file); >>>>>>>>>>>> - if (flag_pic) >>>>>>>>>>>> + >>>>>>>>>>>> + /* With -mpure-code, we don't need to emit literals for the >>>>>>>>>>>> + function address and delta since we emitted code to build >>>>>>>>>>>> + them. */ >>>>>>>>>>>> + if (!target_pure_code) >>>>>>>>>>>> { >>>>>>>>>>>> - /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ >>>>>>>>>>>> - rtx tem = XEXP (DECL_RTL (function), 0); >>>>>>>>>>>> - /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so >>>>>>>>>>>> the PC >>>>>>>>>>>> - pipeline offset is four rather than eight. Adjust the >>>>>>>>>>>> offset >>>>>>>>>>>> - accordingly. */ >>>>>>>>>>>> - tem = plus_constant (GET_MODE (tem), tem, >>>>>>>>>>>> - TARGET_THUMB1_ONLY ? -3 : -7); >>>>>>>>>>>> - tem = gen_rtx_MINUS (GET_MODE (tem), >>>>>>>>>>>> - tem, >>>>>>>>>>>> - gen_rtx_SYMBOL_REF (Pmode, >>>>>>>>>>>> - ggc_strdup >>>>>>>>>>>> (labelpc))); >>>>>>>>>>>> - assemble_integer (tem, 4, BITS_PER_WORD, 1); >>>>>>>>>>>> - } >>>>>>>>>>>> - else >>>>>>>>>>>> - /* Output ".word .LTHUNKn". */ >>>>>>>>>>>> - assemble_integer (XEXP (DECL_RTL (function), 0), 4, >>>>>>>>>>>> BITS_PER_WORD, 1); >>>>>>>>>>>> + ASM_OUTPUT_ALIGN (file, 2); >>>>>>>>>>>> + assemble_name (file, label); >>>>>>>>>>>> + fputs (":\n", file); >>>>>>>>>>>> + if (flag_pic) >>>>>>>>>>>> + { >>>>>>>>>>>> + /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ >>>>>>>>>>>> + rtx tem = XEXP (DECL_RTL (function), 0); >>>>>>>>>>>> + /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, >>>>>>>>>>>> so the PC >>>>>>>>>>>> + pipeline offset is four rather than eight. Adjust >>>>>>>>>>>> the offset >>>>>>>>>>>> + accordingly. */ >>>>>>>>>>>> + tem = plus_constant (GET_MODE (tem), tem, >>>>>>>>>>>> + TARGET_THUMB1_ONLY ? -3 : -7); >>>>>>>>>>>> + tem = gen_rtx_MINUS (GET_MODE (tem), >>>>>>>>>>>> + tem, >>>>>>>>>>>> + gen_rtx_SYMBOL_REF (Pmode, >>>>>>>>>>>> + ggc_strdup >>>>>>>>>>>> (labelpc))); >>>>>>>>>>>> + assemble_integer (tem, 4, BITS_PER_WORD, 1); >>>>>>>>>>>> + } >>>>>>>>>>>> + else >>>>>>>>>>>> + /* Output ".word .LTHUNKn". */ >>>>>>>>>>>> + assemble_integer (XEXP (DECL_RTL (function), 0), 4, >>>>>>>>>>>> BITS_PER_WORD, 1); >>>>>>>>>>>> >>>>>>>>>>>> - if (TARGET_THUMB1_ONLY && mi_delta > 255) >>>>>>>>>>>> - assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1); >>>>>>>>>>>> + if (TARGET_THUMB1_ONLY && mi_delta > 255) >>>>>>>>>>>> + assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, >>>>>>>>>>>> 1); >>>>>>>>>>>> + } >>>>>>>>>>>> } >>>>>>>>>>>> else >>>>>>>>>>>> { >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>>> >>