On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote: > On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw > <richard.earns...@foss.arm.com> wrote: >> >> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote: >>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw >>> <richard.earns...@foss.arm.com> wrote: >>>> >>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote: >>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw >>>>> <richard.earns...@foss.arm.com> wrote: >>>>>> >>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote: >>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta >>>>>>> from code memory (like we do without -mpure-code). >>>>>>> >>>>>>> This patch builds the value of mi_delta into r3 with a series of >>>>>>> movs/adds/lsls. >>>>>>> >>>>>>> We also do some cleanup by not emitting the function address and delta >>>>>>> via .word directives at the end of the thunk since we don't use them >>>>>>> with -mpure-code. >>>>>>> >>>>>>> No need for new testcases, this bug was already identified by >>>>>>> eg. pr46287-3.C >>>>>>> >>>>>>> 2020-09-29 Christophe Lyon <christophe.l...@linaro.org> >>>>>>> >>>>>>> gcc/ >>>>>>> * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and >>>>>>> do not emit function address and delta when -mpure-code is used. >>>>>> >>>>> Hi Richard, >>>>> >>>>> Thanks for your comments. >>>>> >>>>>> There are some optimizations you can make to this code. >>>>>> >>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better >>>>>> to just expand a mov of 255 followed by an add. >>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed >>>>> before, so I can write something similar indeed. >>>>> >>>>> However, I'm note quite sure to understand the benefit in the split >>>>> when -mpure-code is NOT used. >>>>> Consider: >>>>> int f3_1 (void) { return 510; } >>>>> int f3_2 (void) { return 511; } >>>>> Compile with -O2 -mcpu=cortex-m0: >>>>> f3_1: >>>>> movs r0, #255 >>>>> lsls r0, r0, #1 >>>>> bx lr >>>>> f3_2: >>>>> ldr r0, .L4 >>>>> bx lr >>>>> >>>>> The splitter makes the code bigger, does it "compensate" for this by >>>>> not having to load the constant? >>>>> Actually the constant uses 4 more bytes, which should be taken into >>>>> account when comparing code size, >>>> >>>> Yes, the size of the literal pool entry needs to be taken into account. >>>> It might happen that the entry could be shared with another use of that >>>> literal, but in general that's rare. >>>> >>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three >>>>> thumb1 instructions would be equivalent in size compared to loading >>>>> from the literal pool. Should the 256-510 range be extended? >>>> >>>> It's a bit borderline at three instructions when literal pools are not >>>> expensive to use, but in thumb1 literal pools tend to be quite small due >>>> to the limited pc offsets we can use. I think on balance we probably >>>> want to use the instruction sequence unless optimizing for size. >>>> >>>>> >>>>> >>>>>> This is also true for >>>>>> the literal pools alternative as well, so should be handled before all >>>>>> this. >>>>> I am not sure what you mean: with -mpure-code, the above sample is >>>>> compiled as: >>>>> f3_1: >>>>> movs r0, #255 >>>>> lsls r0, r0, #1 >>>>> bx lr >>>>> f3_2: >>>>> movs r0, #1 >>>>> lsls r0, r0, #8 >>>>> adds r0, r0, #255 >>>>> bx lr >>>>> >>>>> so the "return 510" case is already handled as without -mpure-code. >>>> >>>> I was thinking specifically of the thunk sequence where you seem to be >>>> emitting instructions directly rather than generating RTL. The examples >>>> you show here are not thunks. >>>> >>> OK thanks for the clarification. >>> >>> Here is an updated version, split into 3 patches to hopefully make >>> review easier. >>> They apply on top of my other mpure-code patches for PR96967 and PR96770: >>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html >>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html >>> >>> I kept it this way to make incremental changes easier to understand. >>> >>> Patch 1: With the hope to avoid confusion and make maintenance easier, >>> I have updated thumb1_gen_const_int() so that it can generate either RTL or >>> asm. This way, all the code used to build thumb-1 constants is in the >>> same place, >>> in case we need to improve/fix it later. We now generate shorter sequences >>> in >>> several cases matching your comments. >>> >>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and >>> calls thumb1_gen_const_int. >>> >>> Patch 3: Update of the original patch in this thread, now calls >>> thumb1_gen_const_int. >> >> Yuk! Those changes to thumb1_gen_const_int are horrible. >> >> I think we should be able to leverage the fact that the compiler can use >> C++ now to do much better than that, for example by making that function >> a template. For example (and this is just a sketch): >> > > Indeed! I didn't think about it since there is no other use of > templates in arm.c yet. > I'll send an update soon. > > Other than that, does the approach look OK to you? >
Yes, I think this is heading in the right direction. Bringing the two immediate generating operations into a single function can only be a good thing. Looking again at your example constant sequences, I see: 0x1000010: movs r3, #16 lsls r3, #16 adds r3, #1 lsls r3, #4 0x1000011: movs r3, #1 lsls r3, #24 adds r3, #17 The first of these looks odd, given the second sequence. Why doesn't the first expand to 0x1000010: movs r3, #16 lsls r3, #16 adds r3, #16 ? R. > Thanks, > > Christophe > >> class t1_rtl >> { >> public: >> void ashift(int a) { gen_rtx_ASHIFT(a); } >> void rshift(int b) { gen_rtx_SHIFTRT(b); } >> }; >> >> class t1_print >> { >> public: >> t1_print (FILE *f) : t_file(f) {} >> void ashift (int a) { fprintf (t_file, "a shift %d\n", a); } >> void rshift (int b) { fprintf (t_file, "r shift %d\n", b); } >> private: >> FILE *t_file; >> }; >> >> template <class T> >> void thumb1_gen_const_int(T t, int f) >> { >> // Expansion of thumb1_gen_const_int ... >> t.ashift(f); >> } >> >> // Usage... >> void f1() >> { >> // Use the RTL expander >> t1_rtl g; >> thumb1_gen_const_int (g, 3); >> } >> >> void f2() >> { >> // Use the printf expander writing to stdout >> t1_print g(stdout); >> thumb1_gen_const_int (g, 3); >> } >> >> With this you can write thumb1_gen_const_int without having to worry >> about which expander is being used in each instance and the template >> expansion will use the right version. >> >> R. >> >>> >>>> >>>>> >>>>>> I also suspect (but haven't check) that the base adjustment will >>>>>> most commonly be a multiple of the machine word size (ie 4). If that is >>>>>> the case then you could generate n/4 and then shift it left by 2 for an >>>>>> even greater range of literals. >>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case, >>>>> I'll update my patch. >>>>> >>>>>> More generally, any sequence of up to >>>>>> three thumb1 instructions will be no larger, and probably as fast as the >>>>>> existing literal pool fall back. >>>>>> >>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will >>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could >>>>>> be merged into a single shift-by-16. >>>>> >>>>> Right, I'll try to make use of thumb_shiftable_const. >>>>> >>>>>> Finally, I'd really like to see some executable tests for this, if at >>>>>> all possible. >>>>> I mentioned pr46287-3.C, but that's not the only existing testcase >>>>> that showed the problem. There are also: >>>>> g++.dg/opt/thunk1.C >>>>> g++.dg/ipa/pr46984.C >>>>> g++.dg/torture/pr46287.C >>>>> g++.dg/torture/pr45699.C >>>>> >>>>> Do you want that I copy one of these in the arm subdir and add >>>>> -mpure-code in dg-options? >>>> >>>> On reflection, probably not - that just makes things more complicated >>>> with all the dg-options mess (I'm worried about interactions with other >>>> sets of options on the command line and the fall-out from that). If >>>> someone cares about pure-code they should be doing full testsuite runs >>>> with it enabled and that should be sufficient. >>> >>> Yes, that's what I am doing manually, it's a bit tricky, and I use a >>> modified simulator. >>> >>> Thanks, >>> >>> Christophe >>> >>>> >>>> R. >>>> >>>>> >>>>> Thanks, >>>>> >>>>> Christophe >>>>> >>>>>> R. >>>>>> >>>>>>> >>>>>>> k# (use "git pull" to merge the remote branch into yours) >>>>>>> --- >>>>>>> gcc/config/arm/arm.c | 91 >>>>>>> +++++++++++++++++++++++++++++++++++++--------------- >>>>>>> 1 file changed, 66 insertions(+), 25 deletions(-) >>>>>>> >>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c >>>>>>> index ceeb91f..62abeb5 100644 >>>>>>> --- a/gcc/config/arm/arm.c >>>>>>> +++ b/gcc/config/arm/arm.c >>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, >>>>>>> HOST_WIDE_INT delta, >>>>>>> { >>>>>>> if (mi_delta > 255) >>>>>>> { >>>>>>> - fputs ("\tldr\tr3, ", file); >>>>>>> - assemble_name (file, label); >>>>>>> - fputs ("+4\n", file); >>>>>>> + /* With -mpure-code, we cannot load delta from the constant >>>>>>> + pool: we build it explicitly. */ >>>>>>> + if (target_pure_code) >>>>>>> + { >>>>>>> + bool mov_done_p = false; >>>>>>> + int i; >>>>>>> + >>>>>>> + /* Emit upper 3 bytes if needed. */ >>>>>>> + for (i = 0; i < 3; i++) >>>>>>> + { >>>>>>> + int byte = (mi_delta >> (8 * (3 - i))) & 0xff; >>>>>>> + >>>>>>> + if (byte) >>>>>>> + { >>>>>>> + if (mov_done_p) >>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", byte); >>>>>>> + else >>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", byte); >>>>>>> + mov_done_p = true; >>>>>>> + } >>>>>>> + >>>>>>> + if (mov_done_p) >>>>>>> + asm_fprintf (file, "\tlsls\tr3, #8\n"); >>>>>>> + } >>>>>>> + >>>>>>> + /* Emit lower byte if needed. */ >>>>>>> + if (!mov_done_p) >>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff); >>>>>>> + else if (mi_delta & 0xff) >>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff); >>>>>>> + } >>>>>>> + else >>>>>>> + { >>>>>>> + fputs ("\tldr\tr3, ", file); >>>>>>> + assemble_name (file, label); >>>>>>> + fputs ("+4\n", file); >>>>>>> + } >>>>>>> asm_fprintf (file, "\t%ss\t%r, %r, r3\n", >>>>>>> mi_op, this_regno, this_regno); >>>>>>> } >>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, >>>>>>> HOST_WIDE_INT delta, >>>>>>> fputs ("\tpop\t{r3}\n", file); >>>>>>> >>>>>>> fprintf (file, "\tbx\tr12\n"); >>>>>>> - ASM_OUTPUT_ALIGN (file, 2); >>>>>>> - assemble_name (file, label); >>>>>>> - fputs (":\n", file); >>>>>>> - if (flag_pic) >>>>>>> + >>>>>>> + /* With -mpure-code, we don't need to emit literals for the >>>>>>> + function address and delta since we emitted code to build >>>>>>> + them. */ >>>>>>> + if (!target_pure_code) >>>>>>> { >>>>>>> - /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ >>>>>>> - rtx tem = XEXP (DECL_RTL (function), 0); >>>>>>> - /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC >>>>>>> - pipeline offset is four rather than eight. Adjust the offset >>>>>>> - accordingly. */ >>>>>>> - tem = plus_constant (GET_MODE (tem), tem, >>>>>>> - TARGET_THUMB1_ONLY ? -3 : -7); >>>>>>> - tem = gen_rtx_MINUS (GET_MODE (tem), >>>>>>> - tem, >>>>>>> - gen_rtx_SYMBOL_REF (Pmode, >>>>>>> - ggc_strdup (labelpc))); >>>>>>> - assemble_integer (tem, 4, BITS_PER_WORD, 1); >>>>>>> - } >>>>>>> - else >>>>>>> - /* Output ".word .LTHUNKn". */ >>>>>>> - assemble_integer (XEXP (DECL_RTL (function), 0), 4, >>>>>>> BITS_PER_WORD, 1); >>>>>>> + ASM_OUTPUT_ALIGN (file, 2); >>>>>>> + assemble_name (file, label); >>>>>>> + fputs (":\n", file); >>>>>>> + if (flag_pic) >>>>>>> + { >>>>>>> + /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ >>>>>>> + rtx tem = XEXP (DECL_RTL (function), 0); >>>>>>> + /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so >>>>>>> the PC >>>>>>> + pipeline offset is four rather than eight. Adjust the >>>>>>> offset >>>>>>> + accordingly. */ >>>>>>> + tem = plus_constant (GET_MODE (tem), tem, >>>>>>> + TARGET_THUMB1_ONLY ? -3 : -7); >>>>>>> + tem = gen_rtx_MINUS (GET_MODE (tem), >>>>>>> + tem, >>>>>>> + gen_rtx_SYMBOL_REF (Pmode, >>>>>>> + ggc_strdup >>>>>>> (labelpc))); >>>>>>> + assemble_integer (tem, 4, BITS_PER_WORD, 1); >>>>>>> + } >>>>>>> + else >>>>>>> + /* Output ".word .LTHUNKn". */ >>>>>>> + assemble_integer (XEXP (DECL_RTL (function), 0), 4, >>>>>>> BITS_PER_WORD, 1); >>>>>>> >>>>>>> - if (TARGET_THUMB1_ONLY && mi_delta > 255) >>>>>>> - assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1); >>>>>>> + if (TARGET_THUMB1_ONLY && mi_delta > 255) >>>>>>> + assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1); >>>>>>> + } >>>>>>> } >>>>>>> else >>>>>>> { >>>>>>> >>>>>> >>>> >>