On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw <richard.earns...@foss.arm.com> wrote: > > On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote: > > On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw > > <richard.earns...@foss.arm.com> wrote: > >> > >> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote: > >>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw > >>> <richard.earns...@foss.arm.com> wrote: > >>>> > >>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote: > >>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw > >>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>> > >>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote: > >>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote: > >>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw > >>>>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>>>> > >>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote: > >>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw > >>>>>>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote: > >>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw > >>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote: > >>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load > >>>>>>>>>>>>>> delta > >>>>>>>>>>>>>> from code memory (like we do without -mpure-code). > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series > >>>>>>>>>>>>>> of > >>>>>>>>>>>>>> movs/adds/lsls. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> We also do some cleanup by not emitting the function address > >>>>>>>>>>>>>> and delta > >>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't > >>>>>>>>>>>>>> use them > >>>>>>>>>>>>>> with -mpure-code. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> No need for new testcases, this bug was already identified by > >>>>>>>>>>>>>> eg. pr46287-3.C > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 2020-09-29 Christophe Lyon <christophe.l...@linaro.org> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> gcc/ > >>>>>>>>>>>>>> * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta > >>>>>>>>>>>>>> in r3 and > >>>>>>>>>>>>>> do not emit function address and delta when -mpure-code > >>>>>>>>>>>>>> is used. > >>>>>>>>>>>>> > >>>>>>>>>>>> Hi Richard, > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks for your comments. > >>>>>>>>>>>> > >>>>>>>>>>>>> There are some optimizations you can make to this code. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would > >>>>>>>>>>>>> be better > >>>>>>>>>>>>> to just expand a mov of 255 followed by an add. > >>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't > >>>>>>>>>>>> noticed > >>>>>>>>>>>> before, so I can write something similar indeed. > >>>>>>>>>>>> > >>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the > >>>>>>>>>>>> split > >>>>>>>>>>>> when -mpure-code is NOT used. > >>>>>>>>>>>> Consider: > >>>>>>>>>>>> int f3_1 (void) { return 510; } > >>>>>>>>>>>> int f3_2 (void) { return 511; } > >>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0: > >>>>>>>>>>>> f3_1: > >>>>>>>>>>>> movs r0, #255 > >>>>>>>>>>>> lsls r0, r0, #1 > >>>>>>>>>>>> bx lr > >>>>>>>>>>>> f3_2: > >>>>>>>>>>>> ldr r0, .L4 > >>>>>>>>>>>> bx lr > >>>>>>>>>>>> > >>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for > >>>>>>>>>>>> this by > >>>>>>>>>>>> not having to load the constant? > >>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken > >>>>>>>>>>>> into > >>>>>>>>>>>> account when comparing code size, > >>>>>>>>>>> > >>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into > >>>>>>>>>>> account. > >>>>>>>>>>> It might happen that the entry could be shared with another use > >>>>>>>>>>> of that > >>>>>>>>>>> literal, but in general that's rare. > >>>>>>>>>>> > >>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three > >>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to > >>>>>>>>>>>> loading > >>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended? > >>>>>>>>>>> > >>>>>>>>>>> It's a bit borderline at three instructions when literal pools > >>>>>>>>>>> are not > >>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite > >>>>>>>>>>> small due > >>>>>>>>>>> to the limited pc offsets we can use. I think on balance we > >>>>>>>>>>> probably > >>>>>>>>>>> want to use the instruction sequence unless optimizing for size. > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> This is also true for > >>>>>>>>>>>>> the literal pools alternative as well, so should be handled > >>>>>>>>>>>>> before all > >>>>>>>>>>>>> this. > >>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample > >>>>>>>>>>>> is compiled as: > >>>>>>>>>>>> f3_1: > >>>>>>>>>>>> movs r0, #255 > >>>>>>>>>>>> lsls r0, r0, #1 > >>>>>>>>>>>> bx lr > >>>>>>>>>>>> f3_2: > >>>>>>>>>>>> movs r0, #1 > >>>>>>>>>>>> lsls r0, r0, #8 > >>>>>>>>>>>> adds r0, r0, #255 > >>>>>>>>>>>> bx lr > >>>>>>>>>>>> > >>>>>>>>>>>> so the "return 510" case is already handled as without > >>>>>>>>>>>> -mpure-code. > >>>>>>>>>>> > >>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem > >>>>>>>>>>> to be > >>>>>>>>>>> emitting instructions directly rather than generating RTL. The > >>>>>>>>>>> examples > >>>>>>>>>>> you show here are not thunks. > >>>>>>>>>>> > >>>>>>>>>> OK thanks for the clarification. > >>>>>>>>>> > >>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make > >>>>>>>>>> review easier. > >>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and > >>>>>>>>>> PR96770: > >>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html > >>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html > >>>>>>>>>> > >>>>>>>>>> I kept it this way to make incremental changes easier to > >>>>>>>>>> understand. > >>>>>>>>>> > >>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance > >>>>>>>>>> easier, > >>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate > >>>>>>>>>> either RTL or > >>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in > >>>>>>>>>> the > >>>>>>>>>> same place, > >>>>>>>>>> in case we need to improve/fix it later. We now generate shorter > >>>>>>>>>> sequences in > >>>>>>>>>> several cases matching your comments. > >>>>>>>>>> > >>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn > >>>>>>>>>> pattern and > >>>>>>>>>> calls thumb1_gen_const_int. > >>>>>>>>>> > >>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls > >>>>>>>>>> thumb1_gen_const_int. > >>>>>>>>> > >>>>>>>>> Yuk! Those changes to thumb1_gen_const_int are horrible. > >>>>>>>>> > >>>>>>>>> I think we should be able to leverage the fact that the compiler > >>>>>>>>> can use > >>>>>>>>> C++ now to do much better than that, for example by making that > >>>>>>>>> function > >>>>>>>>> a template. For example (and this is just a sketch): > >>>>>>>>> > >>>>>>>> > >>>>>>>> Indeed! I didn't think about it since there is no other use of > >>>>>>>> templates in arm.c yet. > >>>>>>>> I'll send an update soon. > >>>>>>>> > >>>>>>>> Other than that, does the approach look OK to you? > >>>>>>>> > >>>>>>> > >>>>>>> Yes, I think this is heading in the right direction. Bringing the two > >>>>>>> immediate generating operations into a single function can only be a > >>>>>>> good thing. > >>>>>>> > >>>>>>> Looking again at your example constant sequences, I see: > >>>>>>> > >>>>>>> 0x1000010: > >>>>>>> movs r3, #16 > >>>>>>> lsls r3, #16 > >>>>>>> adds r3, #1 > >>>>>>> lsls r3, #4 > >>>>>>> 0x1000011: > >>>>>>> movs r3, #1 > >>>>>>> lsls r3, #24 > >>>>>>> adds r3, #17 > >>>>>>> > >>>>>>> The first of these looks odd, given the second sequence. Why doesn't > >>>>>>> the first expand to > >>>>>>> > >>>>>>> 0x1000010: > >>>>>>> movs r3, #16 > >>>>>>> lsls r3, #16 > >>>>>>> adds r3, #16 > >>>>>>> > >>>>>> Err, I mean to: > >>>>>> > >>>>>> > >>>>>> 0x1000010: > >>>>>> movs r3, #1 > >>>>>> lsls r3, #24 > >>>>>> adds r3, #16 > >>>>>> > >>>>>> ? > >>>>> > >>>>> Because I first try to right-shift the constant, hoping to reduce its > >>>>> range and need less instructions to build the higher part, then > >>>>> left-shift back. > >>>>> > >>>>> In this particular case, we'd need to realize that there are many > >>>>> zeros "inside" the constant. > >>>>> > >>>>> If I remove the part that tries to reduce the range, I do get that > >>>>> sequence, but for 764 I now generate > >>>>> movs r3, #2 > >>>>> lsls r3, #8 > >>>>> adds r3, #252 > >>>>> instead of > >>>>> movs r3, #191 > >>>>> lsls r3, #2 > >>>>> > >>>>> A possibility would be to try both approaches and keep the shortest one. > >>>> > >>>> Lets leave that for now, it's not important to fixing the main issue; > >>>> but we should remember we need to come back to it at some point. > >>>> > >>> Thanks, that's what I was thinking too. > >>> > >>>> There are other tricks as well, such as > >>>> > >>>> 0xffffff > >>>> > >>>> can be done as > >>>> > >>>> 0x1000000 - 1 > >>>> > >>>> and > >>>> > >>>> 0xfffffd > >>>> > >>>> as > >>>> > >>>> 0x1000000 - 3 > >>>> > >>>> but these can wait as well. > >>>> > >>> > >>> Didn't we already need to handle such tricks? I'm surprised this > >>> wasn't needed earlier. > >>> > >> > >> I don't think we ever worried about them. Most of them need at least 3 > >> instructions so aren't a code size saving over using a literal pool entry. > >> > > OK, this will also help when using -mslow-flash-data. > > > > Here are updated patches, now using a template as you suggested. > > Looking better, but when I try to apply this to my local tree patch 2 > fails (I'm not exactly sure why, what was your baseline for these > patches?) I have the tree patches in this thread on top of these other two: https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html
They have gradual improvements to thumb1_movsi_insn. > -- that patch looks suspicious anyway, you're replacing code > that prints out assembly with code that generates RTL. Right! I took me a while to understand how I could miss this, sorry. That was caused by improper testing, as this part of the code isn't used when targetting cortex-m0. I have added a testcase for cortex-m23 which crashes with the previous version of patch 2, and succeeds now. > Could you also rename t1_print and t1_rtl to thumb1_const_print and > thumb1_const_rtl. I think the names as they stand are likely to be too > generic. OK, done. How about this new version? I'm not yet sure about the most appropriate naming for: thumb1_gen_const_int thumb1_gen_const_int_asm should they be thumb1_gen_const_int_rtl thumb1_gen_const_int_print to be consistent with the new classes? Thanks, Christophe > R. > > > > > Thanks, > > > > Christophe > > > >> R. > >> > >>> > >>>> > >>>> R. > >>>> > >>>>> > >>>>> > >>>>>>> > >>>>>>> R. > >>>>>>> > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> Christophe > >>>>>>>> > >>>>>>>>> class t1_rtl > >>>>>>>>> { > >>>>>>>>> public: > >>>>>>>>> void ashift(int a) { gen_rtx_ASHIFT(a); } > >>>>>>>>> void rshift(int b) { gen_rtx_SHIFTRT(b); } > >>>>>>>>> }; > >>>>>>>>> > >>>>>>>>> class t1_print > >>>>>>>>> { > >>>>>>>>> public: > >>>>>>>>> t1_print (FILE *f) : t_file(f) {} > >>>>>>>>> void ashift (int a) { fprintf (t_file, "a shift %d\n", a); } > >>>>>>>>> void rshift (int b) { fprintf (t_file, "r shift %d\n", b); } > >>>>>>>>> private: > >>>>>>>>> FILE *t_file; > >>>>>>>>> }; > >>>>>>>>> > >>>>>>>>> template <class T> > >>>>>>>>> void thumb1_gen_const_int(T t, int f) > >>>>>>>>> { > >>>>>>>>> // Expansion of thumb1_gen_const_int ... > >>>>>>>>> t.ashift(f); > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> // Usage... > >>>>>>>>> void f1() > >>>>>>>>> { > >>>>>>>>> // Use the RTL expander > >>>>>>>>> t1_rtl g; > >>>>>>>>> thumb1_gen_const_int (g, 3); > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> void f2() > >>>>>>>>> { > >>>>>>>>> // Use the printf expander writing to stdout > >>>>>>>>> t1_print g(stdout); > >>>>>>>>> thumb1_gen_const_int (g, 3); > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> With this you can write thumb1_gen_const_int without having to worry > >>>>>>>>> about which expander is being used in each instance and the template > >>>>>>>>> expansion will use the right version. > >>>>>>>>> > >>>>>>>>> R. > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> I also suspect (but haven't check) that the base adjustment > >>>>>>>>>>>>> will > >>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4). > >>>>>>>>>>>>> If that is > >>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by > >>>>>>>>>>>>> 2 for an > >>>>>>>>>>>>> even greater range of literals. > >>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY > >>>>>>>>>>>> case, > >>>>>>>>>>>> I'll update my patch. > >>>>>>>>>>>> > >>>>>>>>>>>>> More generally, any sequence of up to > >>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as > >>>>>>>>>>>>> fast as the > >>>>>>>>>>>>> existing literal pool fall back. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your > >>>>>>>>>>>>> code will > >>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two > >>>>>>>>>>>>> shifts could > >>>>>>>>>>>>> be merged into a single shift-by-16. > >>>>>>>>>>>> > >>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const. > >>>>>>>>>>>> > >>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, > >>>>>>>>>>>>> if at > >>>>>>>>>>>>> all possible. > >>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing > >>>>>>>>>>>> testcase > >>>>>>>>>>>> that showed the problem. There are also: > >>>>>>>>>>>> g++.dg/opt/thunk1.C > >>>>>>>>>>>> g++.dg/ipa/pr46984.C > >>>>>>>>>>>> g++.dg/torture/pr46287.C > >>>>>>>>>>>> g++.dg/torture/pr45699.C > >>>>>>>>>>>> > >>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add > >>>>>>>>>>>> -mpure-code in dg-options? > >>>>>>>>>>> > >>>>>>>>>>> On reflection, probably not - that just makes things more > >>>>>>>>>>> complicated > >>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with > >>>>>>>>>>> other > >>>>>>>>>>> sets of options on the command line and the fall-out from that). > >>>>>>>>>>> If > >>>>>>>>>>> someone cares about pure-code they should be doing full testsuite > >>>>>>>>>>> runs > >>>>>>>>>>> with it enabled and that should be sufficient. > >>>>>>>>>> > >>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use > >>>>>>>>>> a > >>>>>>>>>> modified simulator. > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> > >>>>>>>>>> Christophe > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> R. > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> > >>>>>>>>>>>> Christophe > >>>>>>>>>>>> > >>>>>>>>>>>>> R. > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> k# (use "git pull" to merge the remote branch into yours) > >>>>>>>>>>>>>> --- > >>>>>>>>>>>>>> gcc/config/arm/arm.c | 91 > >>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++--------------- > >>>>>>>>>>>>>> 1 file changed, 66 insertions(+), 25 deletions(-) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c > >>>>>>>>>>>>>> index ceeb91f..62abeb5 100644 > >>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c > >>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c > >>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, > >>>>>>>>>>>>>> tree, HOST_WIDE_INT delta, > >>>>>>>>>>>>>> { > >>>>>>>>>>>>>> if (mi_delta > 255) > >>>>>>>>>>>>>> { > >>>>>>>>>>>>>> - fputs ("\tldr\tr3, ", file); > >>>>>>>>>>>>>> - assemble_name (file, label); > >>>>>>>>>>>>>> - fputs ("+4\n", file); > >>>>>>>>>>>>>> + /* With -mpure-code, we cannot load delta from the > >>>>>>>>>>>>>> constant > >>>>>>>>>>>>>> + pool: we build it explicitly. */ > >>>>>>>>>>>>>> + if (target_pure_code) > >>>>>>>>>>>>>> + { > >>>>>>>>>>>>>> + bool mov_done_p = false; > >>>>>>>>>>>>>> + int i; > >>>>>>>>>>>>>> + > >>>>>>>>>>>>>> + /* Emit upper 3 bytes if needed. */ > >>>>>>>>>>>>>> + for (i = 0; i < 3; i++) > >>>>>>>>>>>>>> + { > >>>>>>>>>>>>>> + int byte = (mi_delta >> (8 * (3 - i))) & 0xff; > >>>>>>>>>>>>>> + > >>>>>>>>>>>>>> + if (byte) > >>>>>>>>>>>>>> + { > >>>>>>>>>>>>>> + if (mov_done_p) > >>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", > >>>>>>>>>>>>>> byte); > >>>>>>>>>>>>>> + else > >>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", > >>>>>>>>>>>>>> byte); > >>>>>>>>>>>>>> + mov_done_p = true; > >>>>>>>>>>>>>> + } > >>>>>>>>>>>>>> + > >>>>>>>>>>>>>> + if (mov_done_p) > >>>>>>>>>>>>>> + asm_fprintf (file, "\tlsls\tr3, #8\n"); > >>>>>>>>>>>>>> + } > >>>>>>>>>>>>>> + > >>>>>>>>>>>>>> + /* Emit lower byte if needed. */ > >>>>>>>>>>>>>> + if (!mov_done_p) > >>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta > >>>>>>>>>>>>>> & 0xff); > >>>>>>>>>>>>>> + else if (mi_delta & 0xff) > >>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta > >>>>>>>>>>>>>> & 0xff); > >>>>>>>>>>>>>> + } > >>>>>>>>>>>>>> + else > >>>>>>>>>>>>>> + { > >>>>>>>>>>>>>> + fputs ("\tldr\tr3, ", file); > >>>>>>>>>>>>>> + assemble_name (file, label); > >>>>>>>>>>>>>> + fputs ("+4\n", file); > >>>>>>>>>>>>>> + } > >>>>>>>>>>>>>> asm_fprintf (file, "\t%ss\t%r, %r, r3\n", > >>>>>>>>>>>>>> mi_op, this_regno, this_regno); > >>>>>>>>>>>>>> } > >>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, > >>>>>>>>>>>>>> tree, HOST_WIDE_INT delta, > >>>>>>>>>>>>>> fputs ("\tpop\t{r3}\n", file); > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> fprintf (file, "\tbx\tr12\n"); > >>>>>>>>>>>>>> - ASM_OUTPUT_ALIGN (file, 2); > >>>>>>>>>>>>>> - assemble_name (file, label); > >>>>>>>>>>>>>> - fputs (":\n", file); > >>>>>>>>>>>>>> - if (flag_pic) > >>>>>>>>>>>>>> + > >>>>>>>>>>>>>> + /* With -mpure-code, we don't need to emit literals for > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>> + function address and delta since we emitted code to > >>>>>>>>>>>>>> build > >>>>>>>>>>>>>> + them. */ > >>>>>>>>>>>>>> + if (!target_pure_code) > >>>>>>>>>>>>>> { > >>>>>>>>>>>>>> - /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ > >>>>>>>>>>>>>> - rtx tem = XEXP (DECL_RTL (function), 0); > >>>>>>>>>>>>>> - /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, > >>>>>>>>>>>>>> so the PC > >>>>>>>>>>>>>> - pipeline offset is four rather than eight. Adjust > >>>>>>>>>>>>>> the offset > >>>>>>>>>>>>>> - accordingly. */ > >>>>>>>>>>>>>> - tem = plus_constant (GET_MODE (tem), tem, > >>>>>>>>>>>>>> - TARGET_THUMB1_ONLY ? -3 : -7); > >>>>>>>>>>>>>> - tem = gen_rtx_MINUS (GET_MODE (tem), > >>>>>>>>>>>>>> - tem, > >>>>>>>>>>>>>> - gen_rtx_SYMBOL_REF (Pmode, > >>>>>>>>>>>>>> - ggc_strdup > >>>>>>>>>>>>>> (labelpc))); > >>>>>>>>>>>>>> - assemble_integer (tem, 4, BITS_PER_WORD, 1); > >>>>>>>>>>>>>> - } > >>>>>>>>>>>>>> - else > >>>>>>>>>>>>>> - /* Output ".word .LTHUNKn". */ > >>>>>>>>>>>>>> - assemble_integer (XEXP (DECL_RTL (function), 0), 4, > >>>>>>>>>>>>>> BITS_PER_WORD, 1); > >>>>>>>>>>>>>> + ASM_OUTPUT_ALIGN (file, 2); > >>>>>>>>>>>>>> + assemble_name (file, label); > >>>>>>>>>>>>>> + fputs (":\n", file); > >>>>>>>>>>>>>> + if (flag_pic) > >>>>>>>>>>>>>> + { > >>>>>>>>>>>>>> + /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ > >>>>>>>>>>>>>> + rtx tem = XEXP (DECL_RTL (function), 0); > >>>>>>>>>>>>>> + /* For TARGET_THUMB1_ONLY the thunk is in Thumb > >>>>>>>>>>>>>> mode, so the PC > >>>>>>>>>>>>>> + pipeline offset is four rather than eight. > >>>>>>>>>>>>>> Adjust the offset > >>>>>>>>>>>>>> + accordingly. */ > >>>>>>>>>>>>>> + tem = plus_constant (GET_MODE (tem), tem, > >>>>>>>>>>>>>> + TARGET_THUMB1_ONLY ? -3 : -7); > >>>>>>>>>>>>>> + tem = gen_rtx_MINUS (GET_MODE (tem), > >>>>>>>>>>>>>> + tem, > >>>>>>>>>>>>>> + gen_rtx_SYMBOL_REF (Pmode, > >>>>>>>>>>>>>> + > >>>>>>>>>>>>>> ggc_strdup (labelpc))); > >>>>>>>>>>>>>> + assemble_integer (tem, 4, BITS_PER_WORD, 1); > >>>>>>>>>>>>>> + } > >>>>>>>>>>>>>> + else > >>>>>>>>>>>>>> + /* Output ".word .LTHUNKn". */ > >>>>>>>>>>>>>> + assemble_integer (XEXP (DECL_RTL (function), 0), 4, > >>>>>>>>>>>>>> BITS_PER_WORD, 1); > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> - if (TARGET_THUMB1_ONLY && mi_delta > 255) > >>>>>>>>>>>>>> - assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, > >>>>>>>>>>>>>> 1); > >>>>>>>>>>>>>> + if (TARGET_THUMB1_ONLY && mi_delta > 255) > >>>>>>>>>>>>>> + assemble_integer (GEN_INT(mi_delta), 4, > >>>>>>>>>>>>>> BITS_PER_WORD, 1); > >>>>>>>>>>>>>> + } > >>>>>>>>>>>>>> } > >>>>>>>>>>>>>> else > >>>>>>>>>>>>>> { > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>> > >>>> > >> >
From 1ea1d3300f253601afbf3173b705db1186a2e6fa Mon Sep 17 00:00:00 2001 From: Christophe Lyon <christophe.l...@linaro.org> Date: Fri, 9 Oct 2020 15:18:07 +0000 Subject: [PATCH 2/3] arm: Call thumb1_gen_const_int from thumb1_movsi_insn thumb1_movsi_insn used the same algorithm to build a constant in asm than thumb1_gen_const_int_1 does in RTL. Since the previous patch added support for asm generation in thumb1_gen_const_int_1, this patch calls it from thumb1_movsi_insn to avoid duplication. We need to introduce a new proxy function, thumb1_gen_const_int_asm to select the right template. This patch also adds a new testcase as the updated alternative is only used by thumb-1 processors that also support movt/movw. 2020-10-26 Christophe Lyon <christophe.l...@linaro.org> gcc/ * config/arm/thumb1.md (thumb1_movsi_insn): Call thumb1_gen_const_int_asm. * config/arm/arm-protos.h (thumb1_gen_const_int_asm): Add prototype. * config/arm/arm.c (thumb1_gen_const_int_asm): New. gcc/testsuite/ * gcc.target/arm/pure-code/no-literal-pool-m23.c: New. --- gcc/config/arm/arm-protos.h | 1 + gcc/config/arm/arm.c | 9 +++++- gcc/config/arm/thumb1.md | 37 +++------------------- .../gcc.target/arm/pure-code/no-literal-pool-m23.c | 17 ++++++++++ 4 files changed, 30 insertions(+), 34 deletions(-) create mode 100644 gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 0cc0ae7..1b1992b 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -75,6 +75,7 @@ extern int const_ok_for_arm (HOST_WIDE_INT); extern int const_ok_for_op (HOST_WIDE_INT, enum rtx_code); extern int const_ok_for_dimode_op (HOST_WIDE_INT, enum rtx_code); extern void thumb1_gen_const_int (rtx, HOST_WIDE_INT); +extern void thumb1_gen_const_int_asm (rtx, HOST_WIDE_INT); extern int arm_split_constant (RTX_CODE, machine_mode, rtx, HOST_WIDE_INT, rtx, rtx, int); extern int legitimate_pic_operand_p (rtx); diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 1d80c6d..366b761 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -28366,7 +28366,7 @@ thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1) } } -/* Proxy for thumb1.md, since the thumb1_const_print and +/* Proxies for thumb1.md, since the thumb1_const_print and thumb1_const_rtl classes are not exported. */ void thumb1_gen_const_int (rtx dst, HOST_WIDE_INT op1) @@ -28375,6 +28375,13 @@ thumb1_gen_const_int (rtx dst, HOST_WIDE_INT op1) thumb1_gen_const_int_1 (t, op1); } +void +thumb1_gen_const_int_asm (rtx dst, HOST_WIDE_INT op1) +{ + thumb1_const_print t (asm_out_file, REGNO (dst)); + thumb1_gen_const_int_1 (t, op1); +} + /* Output code to add DELTA to the first argument, and then jump to FUNCTION. Used for C++ multiple inheritance. */ diff --git a/gcc/config/arm/thumb1.md b/gcc/config/arm/thumb1.md index 2258a52..2f8df67 100644 --- a/gcc/config/arm/thumb1.md +++ b/gcc/config/arm/thumb1.md @@ -688,40 +688,11 @@ (define_insn "*thumb1_movsi_insn" } else if (GET_CODE (operands[1]) == CONST_INT) { - int i; - HOST_WIDE_INT op1 = INTVAL (operands[1]); - bool mov_done_p = false; - rtx ops[2]; - ops[0] = operands[0]; - - /* Emit upper 3 bytes if needed. */ - for (i = 0; i < 3; i++) - { - int byte = (op1 >> (8 * (3 - i))) & 0xff; - - if (byte) - { - ops[1] = GEN_INT (byte); - if (mov_done_p) - output_asm_insn ("adds\t%0, %1", ops); - else - output_asm_insn ("movs\t%0, %1", ops); - mov_done_p = true; - } - - if (mov_done_p) - output_asm_insn ("lsls\t%0, #8", ops); - } - - /* Emit lower byte if needed. */ - ops[1] = GEN_INT (op1 & 0xff); - if (!mov_done_p) - output_asm_insn ("movs\t%0, %1", ops); - else if (op1 & 0xff) - output_asm_insn ("adds\t%0, %1", ops); - return ""; + thumb1_gen_const_int_asm (operands[0], INTVAL (operands[1])); + return \"\"; } - gcc_unreachable (); + + gcc_unreachable (); case 8: return "ldr\t%0, %1"; case 9: return "str\t%1, %0"; diff --git a/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c b/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c new file mode 100644 index 0000000..08d3164 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-mpure-code -mcpu=cortex-m23 -march=armv8-m.base -mthumb" } */ + +int +testi (int *p) +{ + if (*p > 0x12345678) + return *p-8192; + else + return *p+8192; +} + +int +testi2 () +{ + return -8192; +} -- 2.7.4
From 1ab6393e935dce2abae98e8f9e8442f87b22e9ed Mon Sep 17 00:00:00 2001 From: Christophe Lyon <christophe.l...@linaro.org> Date: Fri, 9 Oct 2020 12:42:39 +0000 Subject: [PATCH 1/3] arm: Improve thumb1_gen_const_int Enable thumb1_gen_const_int to generate RTL or asm depending on the context, so that we avoid duplicating code to handle constants in thumb-1 with -mpure-code. Use a template so that the algorithm is effectively shared, and rely on two classes to handle the actual emission as RTL or asm. The generated sequence is improved to handle right-shiftable and small values with less instructions. We now generate: 128: movs r0, r0, #128 264: movs r3, #33 lsls r3, #3 510: movs r3, #255 lsls r3, #1 512: movs r3, #1 lsls r3, #9 764: movs r3, #191 lsls r3, #2 65536: movs r3, #1 lsls r3, #16 0x123456: movs r3, #18 ;0x12 lsls r3, #8 adds r3, #52 ;0x34 lsls r3, #8 adds r3, #86 ;0x56 0x1123456: movs r3, #137 ;0x89 lsls r3, #8 adds r3, #26 ;0x1a lsls r3, #8 adds r3, #43 ;0x2b lsls r3, #1 0x1000010: movs r3, #16 lsls r3, #16 adds r3, #1 lsls r3, #4 0x1000011: movs r3, #1 lsls r3, #24 adds r3, #17 2020-10-26 Christophe Lyon <christophe.l...@linaro.org> gcc/ * config/arm/arm.c (thumb1_const_rtl, thumb1_const_print): New classes. (thumb1_gen_const_int_1): New helper function. (thumb1_gen_const_int): Add capability to emit either RTL or asm, improve generated code by calling thumb1_gen_const_int_1. --- gcc/config/arm/arm.c | 195 ++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 163 insertions(+), 32 deletions(-) diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index ceeb91f..1d80c6d 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -4513,38 +4513,6 @@ const_ok_for_dimode_op (HOST_WIDE_INT i, enum rtx_code code) } } -/* Emit a sequence of movs/adds/shift to produce a 32-bit constant. - Avoid generating useless code when one of the bytes is zero. */ -void -thumb1_gen_const_int (rtx op0, HOST_WIDE_INT op1) -{ - bool mov_done_p = false; - int i; - - /* Emit upper 3 bytes if needed. */ - for (i = 0; i < 3; i++) - { - int byte = (op1 >> (8 * (3 - i))) & 0xff; - - if (byte) - { - emit_set_insn (op0, mov_done_p - ? gen_rtx_PLUS (SImode,op0, GEN_INT (byte)) - : GEN_INT (byte)); - mov_done_p = true; - } - - if (mov_done_p) - emit_set_insn (op0, gen_rtx_ASHIFT (SImode, op0, GEN_INT (8))); - } - - /* Emit lower byte if needed. */ - if (!mov_done_p) - emit_set_insn (op0, GEN_INT (op1 & 0xff)); - else if (op1 & 0xff) - emit_set_insn (op0, gen_rtx_PLUS (SImode, op0, GEN_INT (op1 & 0xff))); -} - /* Emit a sequence of insns to handle a large constant. CODE is the code of the operation required, it can be any of SET, PLUS, IOR, AND, XOR, MINUS; @@ -28244,6 +28212,169 @@ arm_internal_label (FILE *stream, const char *prefix, unsigned long labelno) default_internal_label (stream, prefix, labelno); } +/* Define classes to generate code as RTL or output asm to a file. + Using templates then allows to use the same code to output code + sequences in the two formats. */ +class thumb1_const_rtl +{ + public: + thumb1_const_rtl (rtx dst) : dst (dst) {} + + void mov (int val) + { + emit_set_insn (dst, GEN_INT (val)); + } + + void add (int val) + { + emit_set_insn (dst, gen_rtx_PLUS (SImode, dst, GEN_INT (val))); + } + + void ashift (int shift) + { + emit_set_insn (dst, gen_rtx_ASHIFT (SImode, dst, GEN_INT (shift))); + } + + private: + rtx dst; +}; + +class thumb1_const_print +{ + public: + thumb1_const_print (FILE *f, int regno) : t_file (f), dst_regno (regno) {} + + void mov (int val) + { + asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val); + } + + void add (int val) + { + asm_fprintf (t_file, "\tadds\tr%d, #%d\n", dst_regno, val); + } + + void ashift (int shift) + { + asm_fprintf (t_file, "\tlsls\tr%d, #%d\n", dst_regno, shift); + } + + private: + FILE *t_file; + int dst_regno; +}; + +/* Emit a sequence of movs/adds/shift to produce a 32-bit constant. + Avoid generating useless code when one of the bytes is zero. */ +template <class T> +void +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1) +{ + bool mov_done_p = false; + int val = op1; + int shift = 0; + int i; + + if (val == 0) + { + dst.mov (val); + return; + } + + /* In the general case, we need 7 instructions to build + a 32 bits constant (1 movs, 3 lsls, 3 adds). We can + do better if VAL is small enough, or + right-shiftable by a suitable amount. If the + right-shift enables to encode at least one less byte, + it's worth it: we save a adds and a lsls at the + expense of a final lsls. */ + int final_shift = number_of_first_bit_set (val); + + int leading_zeroes = clz_hwi (val); + int number_of_bytes_needed + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes) + / BITS_PER_UNIT) + 1; + int number_of_bytes_needed2 + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift) + / BITS_PER_UNIT) + 1; + + if (number_of_bytes_needed2 < number_of_bytes_needed) + val >>= final_shift; + else + final_shift = 0; + + /* If we are in a very small range, we can use either a single movs + or movs+adds. */ + if ((val >= 0) && (val <= 510)) + { + if (val > 255) + { + int high = val - 255; + + dst.mov (high); + dst.add (255); + } + else + dst.mov (val); + + if (final_shift > 0) + dst.ashift (final_shift); + } + else + { + /* General case, emit upper 3 bytes as needed. */ + for (i = 0; i < 3; i++) + { + int byte = (val >> (8 * (3 - i))) & 0xff; + + if (byte) + { + /* We are about to emit new bits, stop accumulating a + shift amount, and left-shift only if we have already + emitted some upper bits. */ + if (mov_done_p) + { + dst.ashift (shift); + dst.add (byte); + } + else + dst.mov (byte); + + /* Stop accumulating shift amount since we've just + emitted some bits. */ + shift = 0; + + mov_done_p = true; + } + + if (mov_done_p) + shift += 8; + } + + /* Emit lower byte. */ + if (!mov_done_p) + dst.mov (val & 0xff); + else + { + dst.ashift (shift); + if (val & 0xff) + dst.add (val & 0xff); + } + + if (final_shift > 0) + dst.ashift (final_shift); + } +} + +/* Proxy for thumb1.md, since the thumb1_const_print and + thumb1_const_rtl classes are not exported. */ +void +thumb1_gen_const_int (rtx dst, HOST_WIDE_INT op1) +{ + thumb1_const_rtl t (dst); + thumb1_gen_const_int_1 (t, op1); +} + /* Output code to add DELTA to the first argument, and then jump to FUNCTION. Used for C++ multiple inheritance. */ -- 2.7.4
From 54f0bb788be7cea9bb125dc921b5fc00b0d3aa05 Mon Sep 17 00:00:00 2001 From: Christophe Lyon <christophe.l...@linaro.org> Date: Tue, 29 Sep 2020 19:40:39 +0000 Subject: [PATCH 3/3] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code When -mpure-code is used, we cannot load delta from code memory (like we do without -mpure-code). This patch builds the value of mi_delta into r3 with a series of movs/adds/lsls. We also do some cleanup by not emitting the function address and delta via .word directives at the end of the thunk since we don't use them with -mpure-code. No need for new testcases, this bug was already identified by: g++.dg/ipa/pr46287-3.C g++.dg/ipa/pr46984.C g++.dg/opt/thunk1.C g++.dg/torture/pr46287.C g++.dg/torture/pr45699.C 2020-10-12 Christophe Lyon <christophe.l...@linaro.org> gcc/ * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and do not emit function address and delta when -mpure-code is used. --- gcc/config/arm/arm.c | 67 ++++++++++++++++++++++++++++++++-------------------- 1 file changed, 42 insertions(+), 25 deletions(-) diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 366b761..aee3a40 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -28480,9 +28480,19 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta, { if (mi_delta > 255) { - fputs ("\tldr\tr3, ", file); - assemble_name (file, label); - fputs ("+4\n", file); + /* With -mpure-code, we cannot load MI_DELTA from the + constant pool: we build it explicitly. */ + if (target_pure_code) + { + thumb1_const_print r3 (file, 3); + thumb1_gen_const_int_1 (r3, mi_delta); + } + else + { + fputs ("\tldr\tr3, ", file); + assemble_name (file, label); + fputs ("+4\n", file); + } asm_fprintf (file, "\t%ss\t%r, %r, r3\n", mi_op, this_regno, this_regno); } @@ -28518,30 +28528,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta, fputs ("\tpop\t{r3}\n", file); fprintf (file, "\tbx\tr12\n"); - ASM_OUTPUT_ALIGN (file, 2); - assemble_name (file, label); - fputs (":\n", file); - if (flag_pic) + + /* With -mpure-code, we don't need to emit literals for the + function address and delta since we emitted code to build + them. */ + if (!target_pure_code) { - /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ - rtx tem = XEXP (DECL_RTL (function), 0); - /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC - pipeline offset is four rather than eight. Adjust the offset - accordingly. */ - tem = plus_constant (GET_MODE (tem), tem, - TARGET_THUMB1_ONLY ? -3 : -7); - tem = gen_rtx_MINUS (GET_MODE (tem), - tem, - gen_rtx_SYMBOL_REF (Pmode, - ggc_strdup (labelpc))); - assemble_integer (tem, 4, BITS_PER_WORD, 1); - } - else - /* Output ".word .LTHUNKn". */ - assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1); + ASM_OUTPUT_ALIGN (file, 2); + assemble_name (file, label); + fputs (":\n", file); + if (flag_pic) + { + /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ + rtx tem = XEXP (DECL_RTL (function), 0); + /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC + pipeline offset is four rather than eight. Adjust the offset + accordingly. */ + tem = plus_constant (GET_MODE (tem), tem, + TARGET_THUMB1_ONLY ? -3 : -7); + tem = gen_rtx_MINUS (GET_MODE (tem), + tem, + gen_rtx_SYMBOL_REF (Pmode, + ggc_strdup (labelpc))); + assemble_integer (tem, 4, BITS_PER_WORD, 1); + } + else + /* Output ".word .LTHUNKn". */ + assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1); - if (TARGET_THUMB1_ONLY && mi_delta > 255) - assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1); + if (TARGET_THUMB1_ONLY && mi_delta > 255) + assemble_integer (GEN_INT (mi_delta), 4, BITS_PER_WORD, 1); + } } else { -- 2.7.4