Hi, On Fri, 30 Oct 2020 at 13:49, Richard Earnshaw <richard.earns...@foss.arm.com> wrote: > > On 29/10/2020 19:18, Richard Earnshaw via Gcc-patches wrote: > > On 28/10/2020 18:10, Christophe Lyon via Gcc-patches wrote: > >> On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw > >> <richard.earns...@foss.arm.com> wrote: > >>> > >>> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote: > >>>> On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote: > >>>>> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw > >>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>> > >>>>>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote: > >>>>>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw > >>>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>>> > >>>>>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote: > >>>>>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw > >>>>>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>>>>> > >>>>>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote: > >>>>>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw > >>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote: > >>>>>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote: > >>>>>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw > >>>>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote: > >>>>>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw > >>>>>>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote: > >>>>>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw > >>>>>>>>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches > >>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we > >>>>>>>>>>>>>>>>>>>> cannot load delta > >>>>>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code). > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a > >>>>>>>>>>>>>>>>>>>> series of > >>>>>>>>>>>>>>>>>>>> movs/adds/lsls. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function > >>>>>>>>>>>>>>>>>>>> address and delta > >>>>>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we > >>>>>>>>>>>>>>>>>>>> don't use them > >>>>>>>>>>>>>>>>>>>> with -mpure-code. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> No need for new testcases, this bug was already > >>>>>>>>>>>>>>>>>>>> identified by > >>>>>>>>>>>>>>>>>>>> eg. pr46287-3.C > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> 2020-09-29 Christophe Lyon <christophe.l...@linaro.org> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> gcc/ > >>>>>>>>>>>>>>>>>>>> * config/arm/arm.c (arm_thumb1_mi_thunk): Build > >>>>>>>>>>>>>>>>>>>> mi_delta in r3 and > >>>>>>>>>>>>>>>>>>>> do not emit function address and delta when > >>>>>>>>>>>>>>>>>>>> -mpure-code is used. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Hi Richard, > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks for your comments. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> There are some optimizations you can make to this code. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it > >>>>>>>>>>>>>>>>>>> would be better > >>>>>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add. > >>>>>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I > >>>>>>>>>>>>>>>>>> hadn't noticed > >>>>>>>>>>>>>>>>>> before, so I can write something similar indeed. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in > >>>>>>>>>>>>>>>>>> the split > >>>>>>>>>>>>>>>>>> when -mpure-code is NOT used. > >>>>>>>>>>>>>>>>>> Consider: > >>>>>>>>>>>>>>>>>> int f3_1 (void) { return 510; } > >>>>>>>>>>>>>>>>>> int f3_2 (void) { return 511; } > >>>>>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0: > >>>>>>>>>>>>>>>>>> f3_1: > >>>>>>>>>>>>>>>>>> movs r0, #255 > >>>>>>>>>>>>>>>>>> lsls r0, r0, #1 > >>>>>>>>>>>>>>>>>> bx lr > >>>>>>>>>>>>>>>>>> f3_2: > >>>>>>>>>>>>>>>>>> ldr r0, .L4 > >>>>>>>>>>>>>>>>>> bx lr > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" > >>>>>>>>>>>>>>>>>> for this by > >>>>>>>>>>>>>>>>>> not having to load the constant? > >>>>>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be > >>>>>>>>>>>>>>>>>> taken into > >>>>>>>>>>>>>>>>>> account when comparing code size, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken > >>>>>>>>>>>>>>>>> into account. > >>>>>>>>>>>>>>>>> It might happen that the entry could be shared with > >>>>>>>>>>>>>>>>> another use of that > >>>>>>>>>>>>>>>>> literal, but in general that's rare. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below > >>>>>>>>>>>>>>>>>> three > >>>>>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared > >>>>>>>>>>>>>>>>>> to loading > >>>>>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be > >>>>>>>>>>>>>>>>>> extended? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> It's a bit borderline at three instructions when literal > >>>>>>>>>>>>>>>>> pools are not > >>>>>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be > >>>>>>>>>>>>>>>>> quite small due > >>>>>>>>>>>>>>>>> to the limited pc offsets we can use. I think on balance > >>>>>>>>>>>>>>>>> we probably > >>>>>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for > >>>>>>>>>>>>>>>>> size. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> This is also true for > >>>>>>>>>>>>>>>>>>> the literal pools alternative as well, so should be > >>>>>>>>>>>>>>>>>>> handled before all > >>>>>>>>>>>>>>>>>>> this. > >>>>>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above > >>>>>>>>>>>>>>>>>> sample is compiled as: > >>>>>>>>>>>>>>>>>> f3_1: > >>>>>>>>>>>>>>>>>> movs r0, #255 > >>>>>>>>>>>>>>>>>> lsls r0, r0, #1 > >>>>>>>>>>>>>>>>>> bx lr > >>>>>>>>>>>>>>>>>> f3_2: > >>>>>>>>>>>>>>>>>> movs r0, #1 > >>>>>>>>>>>>>>>>>> lsls r0, r0, #8 > >>>>>>>>>>>>>>>>>> adds r0, r0, #255 > >>>>>>>>>>>>>>>>>> bx lr > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> so the "return 510" case is already handled as without > >>>>>>>>>>>>>>>>>> -mpure-code. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you > >>>>>>>>>>>>>>>>> seem to be > >>>>>>>>>>>>>>>>> emitting instructions directly rather than generating RTL. > >>>>>>>>>>>>>>>>> The examples > >>>>>>>>>>>>>>>>> you show here are not thunks. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> OK thanks for the clarification. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Here is an updated version, split into 3 patches to > >>>>>>>>>>>>>>>> hopefully make > >>>>>>>>>>>>>>>> review easier. > >>>>>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 > >>>>>>>>>>>>>>>> and PR96770: > >>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html > >>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I kept it this way to make incremental changes easier to > >>>>>>>>>>>>>>>> understand. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make > >>>>>>>>>>>>>>>> maintenance easier, > >>>>>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can > >>>>>>>>>>>>>>>> generate either RTL or > >>>>>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants > >>>>>>>>>>>>>>>> is in the > >>>>>>>>>>>>>>>> same place, > >>>>>>>>>>>>>>>> in case we need to improve/fix it later. We now generate > >>>>>>>>>>>>>>>> shorter sequences in > >>>>>>>>>>>>>>>> several cases matching your comments. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn > >>>>>>>>>>>>>>>> pattern and > >>>>>>>>>>>>>>>> calls thumb1_gen_const_int. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now > >>>>>>>>>>>>>>>> calls > >>>>>>>>>>>>>>>> thumb1_gen_const_int. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Yuk! Those changes to thumb1_gen_const_int are horrible. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I think we should be able to leverage the fact that the > >>>>>>>>>>>>>>> compiler can use > >>>>>>>>>>>>>>> C++ now to do much better than that, for example by making > >>>>>>>>>>>>>>> that function > >>>>>>>>>>>>>>> a template. For example (and this is just a sketch): > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Indeed! I didn't think about it since there is no other use of > >>>>>>>>>>>>>> templates in arm.c yet. > >>>>>>>>>>>>>> I'll send an update soon. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Other than that, does the approach look OK to you? > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> Yes, I think this is heading in the right direction. Bringing > >>>>>>>>>>>>> the two > >>>>>>>>>>>>> immediate generating operations into a single function can only > >>>>>>>>>>>>> be a > >>>>>>>>>>>>> good thing. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Looking again at your example constant sequences, I see: > >>>>>>>>>>>>> > >>>>>>>>>>>>> 0x1000010: > >>>>>>>>>>>>> movs r3, #16 > >>>>>>>>>>>>> lsls r3, #16 > >>>>>>>>>>>>> adds r3, #1 > >>>>>>>>>>>>> lsls r3, #4 > >>>>>>>>>>>>> 0x1000011: > >>>>>>>>>>>>> movs r3, #1 > >>>>>>>>>>>>> lsls r3, #24 > >>>>>>>>>>>>> adds r3, #17 > >>>>>>>>>>>>> > >>>>>>>>>>>>> The first of these looks odd, given the second sequence. Why > >>>>>>>>>>>>> doesn't > >>>>>>>>>>>>> the first expand to > >>>>>>>>>>>>> > >>>>>>>>>>>>> 0x1000010: > >>>>>>>>>>>>> movs r3, #16 > >>>>>>>>>>>>> lsls r3, #16 > >>>>>>>>>>>>> adds r3, #16 > >>>>>>>>>>>>> > >>>>>>>>>>>> Err, I mean to: > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> 0x1000010: > >>>>>>>>>>>> movs r3, #1 > >>>>>>>>>>>> lsls r3, #24 > >>>>>>>>>>>> adds r3, #16 > >>>>>>>>>>>> > >>>>>>>>>>>> ? > >>>>>>>>>>> > >>>>>>>>>>> Because I first try to right-shift the constant, hoping to reduce > >>>>>>>>>>> its > >>>>>>>>>>> range and need less instructions to build the higher part, then > >>>>>>>>>>> left-shift back. > >>>>>>>>>>> > >>>>>>>>>>> In this particular case, we'd need to realize that there are many > >>>>>>>>>>> zeros "inside" the constant. > >>>>>>>>>>> > >>>>>>>>>>> If I remove the part that tries to reduce the range, I do get that > >>>>>>>>>>> sequence, but for 764 I now generate > >>>>>>>>>>> movs r3, #2 > >>>>>>>>>>> lsls r3, #8 > >>>>>>>>>>> adds r3, #252 > >>>>>>>>>>> instead of > >>>>>>>>>>> movs r3, #191 > >>>>>>>>>>> lsls r3, #2 > >>>>>>>>>>> > >>>>>>>>>>> A possibility would be to try both approaches and keep the > >>>>>>>>>>> shortest one. > >>>>>>>>>> > >>>>>>>>>> Lets leave that for now, it's not important to fixing the main > >>>>>>>>>> issue; > >>>>>>>>>> but we should remember we need to come back to it at some point. > >>>>>>>>>> > >>>>>>>>> Thanks, that's what I was thinking too. > >>>>>>>>> > >>>>>>>>>> There are other tricks as well, such as > >>>>>>>>>> > >>>>>>>>>> 0xffffff > >>>>>>>>>> > >>>>>>>>>> can be done as > >>>>>>>>>> > >>>>>>>>>> 0x1000000 - 1 > >>>>>>>>>> > >>>>>>>>>> and > >>>>>>>>>> > >>>>>>>>>> 0xfffffd > >>>>>>>>>> > >>>>>>>>>> as > >>>>>>>>>> > >>>>>>>>>> 0x1000000 - 3 > >>>>>>>>>> > >>>>>>>>>> but these can wait as well. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> Didn't we already need to handle such tricks? I'm surprised this > >>>>>>>>> wasn't needed earlier. > >>>>>>>>> > >>>>>>>> > >>>>>>>> I don't think we ever worried about them. Most of them need at > >>>>>>>> least 3 > >>>>>>>> instructions so aren't a code size saving over using a literal pool > >>>>>>>> entry. > >>>>>>>> > >>>>>>> OK, this will also help when using -mslow-flash-data. > >>>>>>> > >>>>>>> Here are updated patches, now using a template as you suggested. > >>>>>> > >>>>>> Looking better, but when I try to apply this to my local tree patch 2 > >>>>>> fails (I'm not exactly sure why, what was your baseline for these > >>>>>> patches?) > >>>>> I have the tree patches in this thread on top of these other two: > >>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html > >>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html > >>>>> > >>>>> They have gradual improvements to thumb1_movsi_insn. > >>>>> > >>>>>> -- that patch looks suspicious anyway, you're replacing code > >>>>>> that prints out assembly with code that generates RTL. > >>>>> Right! I took me a while to understand how I could miss this, sorry. > >>>>> That was caused by improper testing, as this part of the code > >>>>> isn't used when targetting cortex-m0. I have added a testcase > >>>>> for cortex-m23 which crashes with the previous version of patch 2, > >>>>> and succeeds now. > >>>>> > >>>>>> Could you also rename t1_print and t1_rtl to thumb1_const_print and > >>>>>> thumb1_const_rtl. I think the names as they stand are likely to be too > >>>>>> generic. > >>>>> OK, done. > >>>>> > >>>>> How about this new version? > >>>>> I'm not yet sure about the most appropriate naming for: > >>>>> thumb1_gen_const_int > >>>>> thumb1_gen_const_int_asm > >>>>> should they be > >>>>> thumb1_gen_const_int_rtl > >>>>> thumb1_gen_const_int_print > >>>>> to be consistent with the new classes? > >>>> > >>>> It would probably be better, yes. > >>>> > >>>> More detailed comments below. > >>>> > >>>> R. > >>>> > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Christophe > >>>>> > >>>>>> R. > >>>>>> > >>>>>>> > >>>>>>> Thanks, > >>>>>>> > >>>>>>> Christophe > >>>>>>> > >>>>>>>> R. > >>>>>>>> > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> R. > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> R. > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Christophe > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> class t1_rtl > >>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>> public: > >>>>>>>>>>>>>>> void ashift(int a) { gen_rtx_ASHIFT(a); } > >>>>>>>>>>>>>>> void rshift(int b) { gen_rtx_SHIFTRT(b); } > >>>>>>>>>>>>>>> }; > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> class t1_print > >>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>> public: > >>>>>>>>>>>>>>> t1_print (FILE *f) : t_file(f) {} > >>>>>>>>>>>>>>> void ashift (int a) { fprintf (t_file, "a shift %d\n", a); } > >>>>>>>>>>>>>>> void rshift (int b) { fprintf (t_file, "r shift %d\n", b); } > >>>>>>>>>>>>>>> private: > >>>>>>>>>>>>>>> FILE *t_file; > >>>>>>>>>>>>>>> }; > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> template <class T> > >>>>>>>>>>>>>>> void thumb1_gen_const_int(T t, int f) > >>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>> // Expansion of thumb1_gen_const_int ... > >>>>>>>>>>>>>>> t.ashift(f); > >>>>>>>>>>>>>>> } > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> // Usage... > >>>>>>>>>>>>>>> void f1() > >>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>> // Use the RTL expander > >>>>>>>>>>>>>>> t1_rtl g; > >>>>>>>>>>>>>>> thumb1_gen_const_int (g, 3); > >>>>>>>>>>>>>>> } > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> void f2() > >>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>> // Use the printf expander writing to stdout > >>>>>>>>>>>>>>> t1_print g(stdout); > >>>>>>>>>>>>>>> thumb1_gen_const_int (g, 3); > >>>>>>>>>>>>>>> } > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> With this you can write thumb1_gen_const_int without having > >>>>>>>>>>>>>>> to worry > >>>>>>>>>>>>>>> about which expander is being used in each instance and the > >>>>>>>>>>>>>>> template > >>>>>>>>>>>>>>> expansion will use the right version. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> R. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I also suspect (but haven't check) that the base > >>>>>>>>>>>>>>>>>>> adjustment will > >>>>>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie > >>>>>>>>>>>>>>>>>>> 4). If that is > >>>>>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it > >>>>>>>>>>>>>>>>>>> left by 2 for an > >>>>>>>>>>>>>>>>>>> even greater range of literals. > >>>>>>>>>>>>>>>>>> I can see there is provision for this in the > >>>>>>>>>>>>>>>>>> !TARGET_THUMB1_ONLY case, > >>>>>>>>>>>>>>>>>> I'll update my patch. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> More generally, any sequence of up to > >>>>>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably > >>>>>>>>>>>>>>>>>>> as fast as the > >>>>>>>>>>>>>>>>>>> existing literal pool fall back. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), > >>>>>>>>>>>>>>>>>>> your code will > >>>>>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the > >>>>>>>>>>>>>>>>>>> two shifts could > >>>>>>>>>>>>>>>>>>> be merged into a single shift-by-16. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for > >>>>>>>>>>>>>>>>>>> this, if at > >>>>>>>>>>>>>>>>>>> all possible. > >>>>>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing > >>>>>>>>>>>>>>>>>> testcase > >>>>>>>>>>>>>>>>>> that showed the problem. There are also: > >>>>>>>>>>>>>>>>>> g++.dg/opt/thunk1.C > >>>>>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C > >>>>>>>>>>>>>>>>>> g++.dg/torture/pr46287.C > >>>>>>>>>>>>>>>>>> g++.dg/torture/pr45699.C > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and > >>>>>>>>>>>>>>>>>> add > >>>>>>>>>>>>>>>>>> -mpure-code in dg-options? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On reflection, probably not - that just makes things more > >>>>>>>>>>>>>>>>> complicated > >>>>>>>>>>>>>>>>> with all the dg-options mess (I'm worried about > >>>>>>>>>>>>>>>>> interactions with other > >>>>>>>>>>>>>>>>> sets of options on the command line and the fall-out from > >>>>>>>>>>>>>>>>> that). If > >>>>>>>>>>>>>>>>> someone cares about pure-code they should be doing full > >>>>>>>>>>>>>>>>> testsuite runs > >>>>>>>>>>>>>>>>> with it enabled and that should be sufficient. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and > >>>>>>>>>>>>>>>> I use a > >>>>>>>>>>>>>>>> modified simulator. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Christophe > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> R. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Christophe > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> R. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> k# (use "git pull" to merge the remote branch into > >>>>>>>>>>>>>>>>>>>> yours) > >>>>>>>>>>>>>>>>>>>> --- > >>>>>>>>>>>>>>>>>>>> gcc/config/arm/arm.c | 91 > >>>>>>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++--------------- > >>>>>>>>>>>>>>>>>>>> 1 file changed, 66 insertions(+), 25 deletions(-) > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c > >>>>>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644 > >>>>>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c > >>>>>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c > >>>>>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE > >>>>>>>>>>>>>>>>>>>> *file, tree, HOST_WIDE_INT delta, > >>>>>>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>>>>>> if (mi_delta > 255) > >>>>>>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>>>>>> - fputs ("\tldr\tr3, ", file); > >>>>>>>>>>>>>>>>>>>> - assemble_name (file, label); > >>>>>>>>>>>>>>>>>>>> - fputs ("+4\n", file); > >>>>>>>>>>>>>>>>>>>> + /* With -mpure-code, we cannot load delta from > >>>>>>>>>>>>>>>>>>>> the constant > >>>>>>>>>>>>>>>>>>>> + pool: we build it explicitly. */ > >>>>>>>>>>>>>>>>>>>> + if (target_pure_code) > >>>>>>>>>>>>>>>>>>>> + { > >>>>>>>>>>>>>>>>>>>> + bool mov_done_p = false; > >>>>>>>>>>>>>>>>>>>> + int i; > >>>>>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>>>>> + /* Emit upper 3 bytes if needed. */ > >>>>>>>>>>>>>>>>>>>> + for (i = 0; i < 3; i++) > >>>>>>>>>>>>>>>>>>>> + { > >>>>>>>>>>>>>>>>>>>> + int byte = (mi_delta >> (8 * (3 - i))) & > >>>>>>>>>>>>>>>>>>>> 0xff; > >>>>>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>>>>> + if (byte) > >>>>>>>>>>>>>>>>>>>> + { > >>>>>>>>>>>>>>>>>>>> + if (mov_done_p) > >>>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, > >>>>>>>>>>>>>>>>>>>> #%d\n", byte); > >>>>>>>>>>>>>>>>>>>> + else > >>>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, > >>>>>>>>>>>>>>>>>>>> #%d\n", byte); > >>>>>>>>>>>>>>>>>>>> + mov_done_p = true; > >>>>>>>>>>>>>>>>>>>> + } > >>>>>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>>>>> + if (mov_done_p) > >>>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tlsls\tr3, #8\n"); > >>>>>>>>>>>>>>>>>>>> + } > >>>>>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>>>>> + /* Emit lower byte if needed. */ > >>>>>>>>>>>>>>>>>>>> + if (!mov_done_p) > >>>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", > >>>>>>>>>>>>>>>>>>>> mi_delta & 0xff); > >>>>>>>>>>>>>>>>>>>> + else if (mi_delta & 0xff) > >>>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", > >>>>>>>>>>>>>>>>>>>> mi_delta & 0xff); > >>>>>>>>>>>>>>>>>>>> + } > >>>>>>>>>>>>>>>>>>>> + else > >>>>>>>>>>>>>>>>>>>> + { > >>>>>>>>>>>>>>>>>>>> + fputs ("\tldr\tr3, ", file); > >>>>>>>>>>>>>>>>>>>> + assemble_name (file, label); > >>>>>>>>>>>>>>>>>>>> + fputs ("+4\n", file); > >>>>>>>>>>>>>>>>>>>> + } > >>>>>>>>>>>>>>>>>>>> asm_fprintf (file, "\t%ss\t%r, %r, r3\n", > >>>>>>>>>>>>>>>>>>>> mi_op, this_regno, this_regno); > >>>>>>>>>>>>>>>>>>>> } > >>>>>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE > >>>>>>>>>>>>>>>>>>>> *file, tree, HOST_WIDE_INT delta, > >>>>>>>>>>>>>>>>>>>> fputs ("\tpop\t{r3}\n", file); > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> fprintf (file, "\tbx\tr12\n"); > >>>>>>>>>>>>>>>>>>>> - ASM_OUTPUT_ALIGN (file, 2); > >>>>>>>>>>>>>>>>>>>> - assemble_name (file, label); > >>>>>>>>>>>>>>>>>>>> - fputs (":\n", file); > >>>>>>>>>>>>>>>>>>>> - if (flag_pic) > >>>>>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>>>>> + /* With -mpure-code, we don't need to emit > >>>>>>>>>>>>>>>>>>>> literals for the > >>>>>>>>>>>>>>>>>>>> + function address and delta since we emitted code > >>>>>>>>>>>>>>>>>>>> to build > >>>>>>>>>>>>>>>>>>>> + them. */ > >>>>>>>>>>>>>>>>>>>> + if (!target_pure_code) > >>>>>>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>>>>>> - /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ > >>>>>>>>>>>>>>>>>>>> - rtx tem = XEXP (DECL_RTL (function), 0); > >>>>>>>>>>>>>>>>>>>> - /* For TARGET_THUMB1_ONLY the thunk is in Thumb > >>>>>>>>>>>>>>>>>>>> mode, so the PC > >>>>>>>>>>>>>>>>>>>> - pipeline offset is four rather than eight. > >>>>>>>>>>>>>>>>>>>> Adjust the offset > >>>>>>>>>>>>>>>>>>>> - accordingly. */ > >>>>>>>>>>>>>>>>>>>> - tem = plus_constant (GET_MODE (tem), tem, > >>>>>>>>>>>>>>>>>>>> - TARGET_THUMB1_ONLY ? -3 : > >>>>>>>>>>>>>>>>>>>> -7); > >>>>>>>>>>>>>>>>>>>> - tem = gen_rtx_MINUS (GET_MODE (tem), > >>>>>>>>>>>>>>>>>>>> - tem, > >>>>>>>>>>>>>>>>>>>> - gen_rtx_SYMBOL_REF (Pmode, > >>>>>>>>>>>>>>>>>>>> - > >>>>>>>>>>>>>>>>>>>> ggc_strdup (labelpc))); > >>>>>>>>>>>>>>>>>>>> - assemble_integer (tem, 4, BITS_PER_WORD, 1); > >>>>>>>>>>>>>>>>>>>> - } > >>>>>>>>>>>>>>>>>>>> - else > >>>>>>>>>>>>>>>>>>>> - /* Output ".word .LTHUNKn". */ > >>>>>>>>>>>>>>>>>>>> - assemble_integer (XEXP (DECL_RTL (function), 0), > >>>>>>>>>>>>>>>>>>>> 4, BITS_PER_WORD, 1); > >>>>>>>>>>>>>>>>>>>> + ASM_OUTPUT_ALIGN (file, 2); > >>>>>>>>>>>>>>>>>>>> + assemble_name (file, label); > >>>>>>>>>>>>>>>>>>>> + fputs (":\n", file); > >>>>>>>>>>>>>>>>>>>> + if (flag_pic) > >>>>>>>>>>>>>>>>>>>> + { > >>>>>>>>>>>>>>>>>>>> + /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". > >>>>>>>>>>>>>>>>>>>> */ > >>>>>>>>>>>>>>>>>>>> + rtx tem = XEXP (DECL_RTL (function), 0); > >>>>>>>>>>>>>>>>>>>> + /* For TARGET_THUMB1_ONLY the thunk is in > >>>>>>>>>>>>>>>>>>>> Thumb mode, so the PC > >>>>>>>>>>>>>>>>>>>> + pipeline offset is four rather than > >>>>>>>>>>>>>>>>>>>> eight. Adjust the offset > >>>>>>>>>>>>>>>>>>>> + accordingly. */ > >>>>>>>>>>>>>>>>>>>> + tem = plus_constant (GET_MODE (tem), tem, > >>>>>>>>>>>>>>>>>>>> + TARGET_THUMB1_ONLY ? -3 > >>>>>>>>>>>>>>>>>>>> : -7); > >>>>>>>>>>>>>>>>>>>> + tem = gen_rtx_MINUS (GET_MODE (tem), > >>>>>>>>>>>>>>>>>>>> + tem, > >>>>>>>>>>>>>>>>>>>> + gen_rtx_SYMBOL_REF > >>>>>>>>>>>>>>>>>>>> (Pmode, > >>>>>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>>>>> ggc_strdup (labelpc))); > >>>>>>>>>>>>>>>>>>>> + assemble_integer (tem, 4, BITS_PER_WORD, 1); > >>>>>>>>>>>>>>>>>>>> + } > >>>>>>>>>>>>>>>>>>>> + else > >>>>>>>>>>>>>>>>>>>> + /* Output ".word .LTHUNKn". */ > >>>>>>>>>>>>>>>>>>>> + assemble_integer (XEXP (DECL_RTL (function), > >>>>>>>>>>>>>>>>>>>> 0), 4, BITS_PER_WORD, 1); > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> - if (TARGET_THUMB1_ONLY && mi_delta > 255) > >>>>>>>>>>>>>>>>>>>> - assemble_integer (GEN_INT(mi_delta), 4, > >>>>>>>>>>>>>>>>>>>> BITS_PER_WORD, 1); > >>>>>>>>>>>>>>>>>>>> + if (TARGET_THUMB1_ONLY && mi_delta > 255) > >>>>>>>>>>>>>>>>>>>> + assemble_integer (GEN_INT(mi_delta), 4, > >>>>>>>>>>>>>>>>>>>> BITS_PER_WORD, 1); > >>>>>>>>>>>>>>>>>>>> + } > >>>>>>>>>>>>>>>>>>>> } > >>>>>>>>>>>>>>>>>>>> else > >>>>>>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >>>> +class thumb1_const_rtl > >>>> ... > >>>> + void mov (int val) > >>>> > >>>> This should take a HOST_WIDE_INT. Similarly for add and shift. The > >>>> same applies to the asm version as well. > >>>> > >>>> + asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val); > >>>> > >>>> Should be using reg_names[dst_regno] in all cases. In fact, you might > >>>> want to move that lookup to the constructor and just save a pointer to > >>>> the string there. You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED > >>> > >>> Correction, for a (signed) HOST_WIDE_INT, this should be > >>> HOST_WIDE_INT_PRINT_DEC. > >>> > >> > >> Right, but if "val" is unsigned HOST_WIDE_INT, all the methods in the > >> two classes > >> can have unsigned HOST_WIDE_INT parameters, and thus use > >> HOST_WIDE_INT_PRINT_UNSIGNED? > >> > > > > It's generally safer to print it as a signed value. We probably won't > > have cases where the top bit of a 32-bit word are set here; but if you > > do, and the value is unsigned, you end up with 16 digit hex numbers > > rather than the 8 you'd expect on a 32-bit target because HOST_WIDE_INT > > is at least 64 bits in size. > > > > I don't mean hex numbers, of course, just very large decimal numbers. > But the point is still the same. >
Here is an updated version, which hopefully addresses your comments, and adds testcases for cortex-m0 and cortex-m23. Christophe > R. > > > R. > > > >> > >>>> rather than "%d" for the immediate. > >>>> > >>>> +template <class T> > >>>> +void > >>>> +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1) > >>>> +{ > >>>> + bool mov_done_p = false; > >>>> + int val = op1; > >>>> > >>>> This potentially silently loses precision. In fact, I think you really > >>>> want to use "unsigned HOST_WIDE_INT" throughout the following code, so > >>>> that the right shifts aren't undefined if dealing with negative numbers. > >>>> > >>>> For safety, you should also have an assertion in here that > >>>> > >>>> op1 == trunc_int_for_mode (op1, SImode) > >>>> > >>>> + int shift = 0; > >>>> + int i; > >>>> + > >>>> + if (val == 0) > >>>> > >>>> You can short-circuit 0..255 here for a quick exit. > >>>> > >>>> + { > >>>> + dst.mov (val); > >>>> + return; > >>>> + } > >>>> > >>>> Another trick: if the top nine bits of the 32-bit value are all set, > >>>> you're probably going to be better off (and certainly not worse off) by > >>>> generating -op1 and then negating the result in a final step - you can > >>>> do that via recursion. > >>>> > >>>> + > >>>> + /* In the general case, we need 7 instructions to build > >>>> + a 32 bits constant (1 movs, 3 lsls, 3 adds). We can > >>>> + do better if VAL is small enough, or > >>>> + right-shiftable by a suitable amount. If the > >>>> + right-shift enables to encode at least one less byte, > >>>> + it's worth it: we save a adds and a lsls at the > >>>> + expense of a final lsls. */ > >>>> + int final_shift = number_of_first_bit_set (val); > >>>> + > >>>> + int leading_zeroes = clz_hwi (val); > >>>> + int number_of_bytes_needed > >>>> + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes) > >>>> + / BITS_PER_UNIT) + 1; > >>>> + int number_of_bytes_needed2 > >>>> + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift) > >>>> + / BITS_PER_UNIT) + 1; > >>>> + > >>>> + if (number_of_bytes_needed2 < number_of_bytes_needed) > >>>> + val >>= final_shift; > >>>> + else > >>>> + final_shift = 0; > >>>> + > >>>> + /* If we are in a very small range, we can use either a single movs > >>>> + or movs+adds. */ > >>>> + if ((val >= 0) && (val <= 510)) > >>>> > >>>> if val is made unsigned HWI as I suggest, the lower bounds test is not > >>>> needed. > >>>> > >>>> + { > >>>> + if (val > 255) > >>>> + { > >>>> + int high = val - 255; > >>>> > >>>> Again, watch your types. > >>>> > >>>> + > >>>> + dst.mov (high); > >>>> + dst.add (255); > >>>> + } > >>>> + else > >>>> + dst.mov (val); > >>>> + > >>>> + if (final_shift > 0) > >>>> + dst.ashift (final_shift); > >>>> + } > >>>> + else > >>>> + { > >>>> + /* General case, emit upper 3 bytes as needed. */ > >>>> + for (i = 0; i < 3; i++) > >>>> + { > >>>> + int byte = (val >> (8 * (3 - i))) & 0xff; > >>>> > >>>> and here. > >>>> > >>>> + > >>>> + if (byte) > >>>> + { > >>>> + /* We are about to emit new bits, stop accumulating a > >>>> + shift amount, and left-shift only if we have already > >>>> + emitted some upper bits. */ > >>>> + if (mov_done_p) > >>>> + { > >>>> + dst.ashift (shift); > >>>> + dst.add (byte); > >>>> + } > >>>> + else > >>>> + dst.mov (byte); > >>>> + > >>>> + /* Stop accumulating shift amount since we've just > >>>> + emitted some bits. */ > >>>> + shift = 0; > >>>> + > >>>> + mov_done_p = true; > >>>> + } > >>>> + > >>>> + if (mov_done_p) > >>>> + shift += 8; > >>>> + } > >>>> + > >>>> + /* Emit lower byte. */ > >>>> + if (!mov_done_p) > >>>> + dst.mov (val & 0xff); > >>>> + else > >>>> + { > >>>> + dst.ashift (shift); > >>>> + if (val & 0xff) > >>>> + dst.add (val & 0xff); > >>>> + } > >>>> + > >>>> + if (final_shift > 0) > >>>> + dst.ashift (final_shift); > >>>> + } > >>>> +} > >>>> + > >>>> > >>> > > >
From 8e4298b4d46b6d2555c70aa6ce95d9afda4dfa69 Mon Sep 17 00:00:00 2001 From: Christophe Lyon <christophe.l...@linaro.org> Date: Fri, 9 Oct 2020 12:42:39 +0000 Subject: [PATCH 1/3] arm: Improve thumb1_gen_const_int Enable thumb1_gen_const_int to generate RTL or asm depending on the context, so that we avoid duplicating code to handle constants in Thumb-1 with -mpure-code. Use a template so that the algorithm is effectively shared, and rely on two classes to handle the actual emission as RTL or asm. The generated sequence is improved to handle right-shiftable and small values with less instructions. We now generate: 128: movs r0, r0, #128 264: movs r3, #33 lsls r3, #3 510: movs r3, #255 lsls r3, #1 512: movs r3, #1 lsls r3, #9 764: movs r3, #191 lsls r3, #2 65536: movs r3, #1 lsls r3, #16 0x123456: movs r3, #18 ;0x12 lsls r3, #8 adds r3, #52 ;0x34 lsls r3, #8 adds r3, #86 ;0x56 0x1123456: movs r3, #137 ;0x89 lsls r3, #8 adds r3, #26 ;0x1a lsls r3, #8 adds r3, #43 ;0x2b lsls r3, #1 0x1000010: movs r3, #16 lsls r3, #16 adds r3, #1 lsls r3, #4 0x1000011: movs r3, #1 lsls r3, #24 adds r3, #17 -8192: movs r3, #1 lsls r3, #13 rsbs r3, #0 The patch adds a testcase which does not fully exercise thumb1_gen_const_int, as other existing patterns already catch small constants. These parts of thumb1_gen_const_int are used by arm_thumb1_mi_thunk. 2020-11-02 Christophe Lyon <christophe.l...@linaro.org> gcc/ * config/arm/arm.c (thumb1_const_rtl, thumb1_const_print): New classes. (thumb1_gen_const_int): Rename to ... (thumb1_gen_const_int_1): ... New helper function. Add capability to emit either RTL or asm, improve generated code. (thumb1_gen_const_int_rtl): New function. * gcc/config/arm/arm-protos.h (thumb1_gen_const_int): Rename to thumb1_gen_const_int_rtl. * gcc/config/arm/thumb1.md: Call thumb1_gen_const_int_rtl instead of thumb1_gen_const_int. gcc/testsuite/ * gcc.target/arm/pure-code/no-literal-pool-m0.c: New. --- gcc/config/arm/arm-protos.h | 2 +- gcc/config/arm/arm.c | 224 ++++++++++++++++++--- gcc/config/arm/thumb1.md | 2 +- .../gcc.target/arm/pure-code/no-literal-pool-m0.c | 175 ++++++++++++++++ 4 files changed, 369 insertions(+), 34 deletions(-) create mode 100644 gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m0.c diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 0cc0ae7..23b329d 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -74,7 +74,7 @@ extern bool arm_small_register_classes_for_mode_p (machine_mode); extern int const_ok_for_arm (HOST_WIDE_INT); extern int const_ok_for_op (HOST_WIDE_INT, enum rtx_code); extern int const_ok_for_dimode_op (HOST_WIDE_INT, enum rtx_code); -extern void thumb1_gen_const_int (rtx, HOST_WIDE_INT); +extern void thumb1_gen_const_int_rtl (rtx, HOST_WIDE_INT); extern int arm_split_constant (RTX_CODE, machine_mode, rtx, HOST_WIDE_INT, rtx, rtx, int); extern int legitimate_pic_operand_p (rtx); diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index ceeb91f..29fa536 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -4513,38 +4513,6 @@ const_ok_for_dimode_op (HOST_WIDE_INT i, enum rtx_code code) } } -/* Emit a sequence of movs/adds/shift to produce a 32-bit constant. - Avoid generating useless code when one of the bytes is zero. */ -void -thumb1_gen_const_int (rtx op0, HOST_WIDE_INT op1) -{ - bool mov_done_p = false; - int i; - - /* Emit upper 3 bytes if needed. */ - for (i = 0; i < 3; i++) - { - int byte = (op1 >> (8 * (3 - i))) & 0xff; - - if (byte) - { - emit_set_insn (op0, mov_done_p - ? gen_rtx_PLUS (SImode,op0, GEN_INT (byte)) - : GEN_INT (byte)); - mov_done_p = true; - } - - if (mov_done_p) - emit_set_insn (op0, gen_rtx_ASHIFT (SImode, op0, GEN_INT (8))); - } - - /* Emit lower byte if needed. */ - if (!mov_done_p) - emit_set_insn (op0, GEN_INT (op1 & 0xff)); - else if (op1 & 0xff) - emit_set_insn (op0, gen_rtx_PLUS (SImode, op0, GEN_INT (op1 & 0xff))); -} - /* Emit a sequence of insns to handle a large constant. CODE is the code of the operation required, it can be any of SET, PLUS, IOR, AND, XOR, MINUS; @@ -28244,6 +28212,198 @@ arm_internal_label (FILE *stream, const char *prefix, unsigned long labelno) default_internal_label (stream, prefix, labelno); } +/* Define classes to generate code as RTL or output asm to a file. + Using templates then allows to use the same code to output code + sequences in the two formats. */ +class thumb1_const_rtl +{ + public: + thumb1_const_rtl (rtx dst) : dst (dst) {} + + void mov (HOST_WIDE_INT val) + { + emit_set_insn (dst, GEN_INT (val)); + } + + void add (HOST_WIDE_INT val) + { + emit_set_insn (dst, gen_rtx_PLUS (SImode, dst, GEN_INT (val))); + } + + void ashift (HOST_WIDE_INT shift) + { + emit_set_insn (dst, gen_rtx_ASHIFT (SImode, dst, GEN_INT (shift))); + } + + void neg () + { + emit_set_insn (dst, gen_rtx_NEG (SImode, dst)); + } + + private: + rtx dst; +}; + +class thumb1_const_print +{ + public: + thumb1_const_print (FILE *f, int regno) + { + t_file = f; + dst_regname = reg_names[regno]; + } + + void mov (HOST_WIDE_INT val) + { + asm_fprintf (t_file, "\tmovs\t%s, #" HOST_WIDE_INT_PRINT_DEC "\n", + dst_regname, val); + } + + void add (HOST_WIDE_INT val) + { + asm_fprintf (t_file, "\tadds\t%s, #" HOST_WIDE_INT_PRINT_DEC "\n", + dst_regname, val); + } + + void ashift (HOST_WIDE_INT shift) + { + asm_fprintf (t_file, "\tlsls\t%s, #" HOST_WIDE_INT_PRINT_DEC "\n", + dst_regname, shift); + } + + void neg () + { + asm_fprintf (t_file, "\trsbs\t%s, #0\n", dst_regname); + } + + private: + FILE *t_file; + const char *dst_regname; +}; + +/* Emit a sequence of movs/adds/shift to produce a 32-bit constant. + Avoid generating useless code when one of the bytes is zero. */ +template <class T> +void +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1) +{ + bool mov_done_p = false; + unsigned HOST_WIDE_INT val = op1; + int shift = 0; + int i; + + gcc_assert (op1 == trunc_int_for_mode (op1, SImode)); + + if (val <= 255) + { + dst.mov (val); + return; + } + + /* For negative numbers with the first nine bits set, build the + opposite of OP1, then negate it, it's generally shorter and not + longer. */ + if ((val & 0xFF800000) == 0xFF800000) + { + thumb1_gen_const_int_1 (dst, -op1); + dst.neg (); + return; + } + + /* In the general case, we need 7 instructions to build + a 32 bits constant (1 movs, 3 lsls, 3 adds). We can + do better if VAL is small enough, or + right-shiftable by a suitable amount. If the + right-shift enables to encode at least one less byte, + it's worth it: we save a adds and a lsls at the + expense of a final lsls. */ + int final_shift = number_of_first_bit_set (val); + + int leading_zeroes = clz_hwi (val); + int number_of_bytes_needed + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes) + / BITS_PER_UNIT) + 1; + int number_of_bytes_needed2 + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift) + / BITS_PER_UNIT) + 1; + + if (number_of_bytes_needed2 < number_of_bytes_needed) + val >>= final_shift; + else + final_shift = 0; + + /* If we are in a very small range, we can use either a single movs + or movs+adds. */ + if (val <= 510) + { + if (val > 255) + { + unsigned HOST_WIDE_INT high = val - 255; + + dst.mov (high); + dst.add (255); + } + else + dst.mov (val); + + if (final_shift > 0) + dst.ashift (final_shift); + } + else + { + /* General case, emit upper 3 bytes as needed. */ + for (i = 0; i < 3; i++) + { + unsigned HOST_WIDE_INT byte = (val >> (8 * (3 - i))) & 0xff; + + if (byte) + { + /* We are about to emit new bits, stop accumulating a + shift amount, and left-shift only if we have already + emitted some upper bits. */ + if (mov_done_p) + { + dst.ashift (shift); + dst.add (byte); + } + else + dst.mov (byte); + + /* Stop accumulating shift amount since we've just + emitted some bits. */ + shift = 0; + + mov_done_p = true; + } + + if (mov_done_p) + shift += 8; + } + + /* Emit lower byte. */ + if (!mov_done_p) + dst.mov (val & 0xff); + else + { + dst.ashift (shift); + if (val & 0xff) + dst.add (val & 0xff); + } + + if (final_shift > 0) + dst.ashift (final_shift); + } +} + +/* Proxy for thumb1.md, since the thumb1_const_print and + thumb1_const_rtl classes are not exported. */ +void +thumb1_gen_const_int_rtl (rtx dst, HOST_WIDE_INT op1) +{ + thumb1_const_rtl t (dst); + thumb1_gen_const_int_1 (t, op1); +} + /* Output code to add DELTA to the first argument, and then jump to FUNCTION. Used for C++ multiple inheritance. */ diff --git a/gcc/config/arm/thumb1.md b/gcc/config/arm/thumb1.md index 2258a52..b32dd4f 100644 --- a/gcc/config/arm/thumb1.md +++ b/gcc/config/arm/thumb1.md @@ -820,7 +820,7 @@ (define_split && !satisfies_constraint_K (operands[1])" [(clobber (const_int 0))] " - thumb1_gen_const_int (operands[0], INTVAL (operands[1])); + thumb1_gen_const_int_rtl (operands[0], INTVAL (operands[1])); DONE; " ) diff --git a/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m0.c b/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m0.c new file mode 100644 index 0000000..787a61a --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m0.c @@ -0,0 +1,175 @@ +/* { dg-do compile } */ +/* { dg-options "-mpure-code -mcpu=cortex-m0 -march=armv6s-m -mthumb" } */ +/* { dg-final { check-function-bodies "**" "" } } */ + +/* Does not use thumb1_gen_const_int. +** test_0: +** ... +** movs r[0-3], #0 +** ... +*/ +int +test_0 () +{ + return 0; +} + +/* Does not use thumb1_gen_const_int. +** test_128: +** ... +** movs r[0-3], #128 +** ... +*/ +int +test_128 () +{ + return 128; +} + +/* Does not use thumb1_gen_const_int. +** test_264: +** ... +** movs r[0-3], #132 +** lsls r[0-3], r[0-3], #1 +** ... +*/ +int +test_264 () +{ + return 264; +} + +/* Does not use thumb1_gen_const_int. +** test_510: +** ... +** movs r[0-3], #255 +** lsls r[0-3], r[0-3], #1 +** ... +*/ +int +test_510 () +{ + return 510; +} + +/* Does not use thumb1_gen_const_int. +** test_512: +** ... +** movs r[0-3], #128 +** lsls r[0-3], r[0-3], #2 +** ... +*/ +int +test_512 () +{ + return 512; +} + +/* Does not use thumb1_gen_const_int. +** test_764: +** ... +** movs r[0-3], #191 +** lsls r[0-3], r[0-3], #2 +** ... +*/ +int +test_764 () +{ + return 764; +} + +/* Does not use thumb1_gen_const_int. +** test_65536: +** ... +** movs r[0-3], #128 +** lsls r[0-3], r[0-3], #9 +** ... +*/ +int +test_65536 () +{ + return 65536; +} + +/* +** test_0x123456: +** ... +** movs r[0-3], #18 +** lsls r[0-3], r[0-3], #8 +** adds r[0-3], r[0-3], #52 +** lsls r[0-3], r[0-3], #8 +** adds r[0-3], r[0-3], #86 +** ... +*/ +int +test_0x123456 () +{ + return 0x123456; +} + +/* +** test_0x1123456: +** ... +** movs r[0-3], #137 +** lsls r[0-3], r[0-3], #8 +** adds r[0-3], r[0-3], #26 +** lsls r[0-3], r[0-3], #8 +** adds r[0-3], r[0-3], #43 +** lsls r[0-3], r[0-3], #1 +** ... +*/ +int +test_0x1123456 () +{ + return 0x1123456; +} + +/* With -Os, we generate: + movs r0, #16 + lsls r0, r0, r0 + With the other optimization levels, we generate: + movs r0, #16 + lsls r0, r0, #16 + hence the two alternatives. */ +/* +** test_0x1000010: +** ... +** movs r[0-3], #16 +** lsls r[0-3], r[0-3], (#16|r[0-3]) +** adds r[0-3], r[0-3], #1 +** lsls r[0-3], r[0-3], #4 +** ... +*/ +int +test_0x1000010 () +{ + return 0x1000010; +} + +/* +** test_0x1000011: +** ... +** movs r[0-3], #1 +** lsls r[0-3], r[0-3], #24 +** adds r[0-3], r[0-3], #17 +** ... +*/ +int +test_0x1000011 () +{ + return 0x1000011; +} + +/* +** test_m8192: +** ... +** movs r[0-3], #1 +** lsls r[0-3], r[0-3], #13 +** rsbs r[0-3], r[0-3], #0 +** ... +*/ +int +test_m8192 () +{ + return -8192; +} -- 2.7.4
From 4e91a0986014446071a031e29ea0efbc90dda407 Mon Sep 17 00:00:00 2001 From: Christophe Lyon <christophe.l...@linaro.org> Date: Tue, 27 Oct 2020 16:16:08 +0000 Subject: [PATCH 2/3] arm: Call thumb1_gen_const_int from thumb1_movsi_insn thumb1_movsi_insn used the same algorithm to build a constant in asm than thumb1_gen_const_int_1 does in RTL. Since the previous patch added support for asm generation in thumb1_gen_const_int_1, this patch calls it from thumb1_movsi_insn to avoid duplication. We need to introduce a new proxy function, thumb1_gen_const_int_print to select the right template. This patch also adds a new testcase as the updated alternative is only used by thumb-1 processors that also support movt/movw. 2020-11-02 Christophe Lyon <christophe.l...@linaro.org> gcc/ * config/arm/thumb1.md (thumb1_movsi_insn): Call thumb1_gen_const_int_print. * config/arm/arm-protos.h (thumb1_gen_const_int_print): Add prototype. * config/arm/arm.c (thumb1_gen_const_int_print): New. gcc/testsuite/ * gcc.target/arm/pure-code/no-literal-pool-m23.c: New. --- gcc/config/arm/arm-protos.h | 1 + gcc/config/arm/arm.c | 9 +- gcc/config/arm/thumb1.md | 37 +---- .../gcc.target/arm/pure-code/no-literal-pool-m23.c | 171 +++++++++++++++++++++ 4 files changed, 184 insertions(+), 34 deletions(-) create mode 100644 gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 23b329d..d09e6a2 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -75,6 +75,7 @@ extern int const_ok_for_arm (HOST_WIDE_INT); extern int const_ok_for_op (HOST_WIDE_INT, enum rtx_code); extern int const_ok_for_dimode_op (HOST_WIDE_INT, enum rtx_code); extern void thumb1_gen_const_int_rtl (rtx, HOST_WIDE_INT); +extern void thumb1_gen_const_int_print (rtx, HOST_WIDE_INT); extern int arm_split_constant (RTX_CODE, machine_mode, rtx, HOST_WIDE_INT, rtx, rtx, int); extern int legitimate_pic_operand_p (rtx); diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 29fa536..584714b 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -28395,7 +28395,7 @@ thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1) } } -/* Proxy for thumb1.md, since the thumb1_const_print and +/* Proxies for thumb1.md, since the thumb1_const_print and thumb1_const_rtl classes are not exported. */ void thumb1_gen_const_int_rtl (rtx dst, HOST_WIDE_INT op1) @@ -28404,6 +28404,13 @@ thumb1_gen_const_int_rtl (rtx dst, HOST_WIDE_INT op1) thumb1_gen_const_int_1 (t, op1); } +void +thumb1_gen_const_int_print (rtx dst, HOST_WIDE_INT op1) +{ + thumb1_const_print t (asm_out_file, REGNO (dst)); + thumb1_gen_const_int_1 (t, op1); +} + /* Output code to add DELTA to the first argument, and then jump to FUNCTION. Used for C++ multiple inheritance. */ diff --git a/gcc/config/arm/thumb1.md b/gcc/config/arm/thumb1.md index b32dd4f..52468da 100644 --- a/gcc/config/arm/thumb1.md +++ b/gcc/config/arm/thumb1.md @@ -688,40 +688,11 @@ (define_insn "*thumb1_movsi_insn" } else if (GET_CODE (operands[1]) == CONST_INT) { - int i; - HOST_WIDE_INT op1 = INTVAL (operands[1]); - bool mov_done_p = false; - rtx ops[2]; - ops[0] = operands[0]; - - /* Emit upper 3 bytes if needed. */ - for (i = 0; i < 3; i++) - { - int byte = (op1 >> (8 * (3 - i))) & 0xff; - - if (byte) - { - ops[1] = GEN_INT (byte); - if (mov_done_p) - output_asm_insn ("adds\t%0, %1", ops); - else - output_asm_insn ("movs\t%0, %1", ops); - mov_done_p = true; - } - - if (mov_done_p) - output_asm_insn ("lsls\t%0, #8", ops); - } - - /* Emit lower byte if needed. */ - ops[1] = GEN_INT (op1 & 0xff); - if (!mov_done_p) - output_asm_insn ("movs\t%0, %1", ops); - else if (op1 & 0xff) - output_asm_insn ("adds\t%0, %1", ops); - return ""; + thumb1_gen_const_int_print (operands[0], INTVAL (operands[1])); + return \"\"; } - gcc_unreachable (); + + gcc_unreachable (); case 8: return "ldr\t%0, %1"; case 9: return "str\t%1, %0"; diff --git a/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c b/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c new file mode 100644 index 0000000..67d63d2 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c @@ -0,0 +1,171 @@ +/* { dg-do compile } */ +/* { dg-options "-mpure-code -mcpu=cortex-m23 -march=armv8-m.base -mthumb" } */ +/* { dg-final { check-function-bodies "**" "" } } */ + +/* +** testi: +** ... +** movs r[0-3], #1 +** lsls r[0-3], #13 +** rsbs r[0-3], #0 +** ... +*/ +int +testi (int *p) +{ + if (*p > 0x12345678) + return *p-8192; + else + return *p+8192; +} + +/* Does not use thumb1_gen_const_int. +** test_0: +** ... +** movs r[0-3], #0 +** ... +*/ +int +test_0 () +{ + return 0; +} + +/* Does not use thumb1_gen_const_int. +** test_128: +** ... +** movs r[0-3], #128 +** ... +*/ +int +test_128 () +{ + return 128; +} + +/* Does not use thumb1_gen_const_int. +** test_264: +** ... +** movw r[0-3], #264 +** ... +*/ +int +test_264 () +{ + return 264; +} + +/* Does not use thumb1_gen_const_int. +** test_510: +** ... +** movw r[0-3], #510 +** ... +*/ +int +test_510 () +{ + return 510; +} + +/* Does not use thumb1_gen_const_int. +** test_512: +** ... +** movw r[0-3], #512 +** ... +*/ +int +test_512 () +{ + return 512; +} + +/* Does not use thumb1_gen_const_int. +** test_764: +** ... +** movw r[0-3], #764 +** ... +*/ +int +test_764 () +{ + return 764; +} + +/* Does not use thumb1_gen_const_int. +** test_65536: +** ... +** movs r[0-3], #128 +** lsls r[0-3], r[0-3], #9 +** ... +*/ +int +test_65536 () +{ + return 65536; +} + +/* Does not use thumb1_gen_const_int. +** test_0x123456: +** ... +** movw r[0-3], #13398 +** movt r[0-3], 18 +** ... +*/ +int +test_0x123456 () +{ + return 0x123456; +} + +/* Does not use thumb1_gen_const_int. +** test_0x1123456: +** ... +** movw r[0-3], #13398 +** movt r[0-3], 274 +** ... +*/ +int +test_0x1123456 () +{ + return 0x1123456; +} + +/* Does not use thumb1_gen_const_int. +** test_0x1000010: +** ... +** movs r[0-3], #16 +** movt r[0-3], 256 +** ... +*/ +int +test_0x1000010 () +{ + return 0x1000010; +} + +/* Does not use thumb1_gen_const_int. +** test_0x1000011: +** ... +** movs r[0-3], #17 +** movt r[0-3], 256 +** ... +*/ +int +test_0x1000011 () +{ + return 0x1000011; +} + +/* +** test_m8192: +** ... +** movs r[0-3], #1 +** lsls r[0-3], #13 +** rsbs r[0-3], #0 +** ... +*/ +int +test_m8192 () +{ + return -8192; +} -- 2.7.4
From 696d56ed48b6b7ac3293212118d3f157e8f25477 Mon Sep 17 00:00:00 2001 From: Christophe Lyon <christophe.l...@linaro.org> Date: Tue, 29 Sep 2020 19:40:39 +0000 Subject: [PATCH 3/3] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code When -mpure-code is used, we cannot load delta from code memory (like we do without -mpure-code). This patch builds the value of mi_delta into r3 with a series of movs/adds/lsls. We also do some cleanup by not emitting the function address and delta via .word directives at the end of the thunk since we don't use them with -mpure-code. No need for new testcases, this bug was already identified by: g++.dg/ipa/pr46287-3.C g++.dg/ipa/pr46984.C g++.dg/opt/thunk1.C g++.dg/torture/pr46287.C g++.dg/torture/pr45699.C 2020-11-02 Christophe Lyon <christophe.l...@linaro.org> gcc/ * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and do not emit function address and delta when -mpure-code is used. --- gcc/config/arm/arm.c | 67 ++++++++++++++++++++++++++++++++-------------------- 1 file changed, 42 insertions(+), 25 deletions(-) diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 584714b..fc88071 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -28509,9 +28509,19 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta, { if (mi_delta > 255) { - fputs ("\tldr\tr3, ", file); - assemble_name (file, label); - fputs ("+4\n", file); + /* With -mpure-code, we cannot load MI_DELTA from the + constant pool: we build it explicitly. */ + if (target_pure_code) + { + thumb1_const_print r3 (file, 3); + thumb1_gen_const_int_1 (r3, mi_delta); + } + else + { + fputs ("\tldr\tr3, ", file); + assemble_name (file, label); + fputs ("+4\n", file); + } asm_fprintf (file, "\t%ss\t%r, %r, r3\n", mi_op, this_regno, this_regno); } @@ -28547,30 +28557,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta, fputs ("\tpop\t{r3}\n", file); fprintf (file, "\tbx\tr12\n"); - ASM_OUTPUT_ALIGN (file, 2); - assemble_name (file, label); - fputs (":\n", file); - if (flag_pic) + + /* With -mpure-code, we don't need to emit literals for the + function address and delta since we emitted code to build + them. */ + if (!target_pure_code) { - /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ - rtx tem = XEXP (DECL_RTL (function), 0); - /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC - pipeline offset is four rather than eight. Adjust the offset - accordingly. */ - tem = plus_constant (GET_MODE (tem), tem, - TARGET_THUMB1_ONLY ? -3 : -7); - tem = gen_rtx_MINUS (GET_MODE (tem), - tem, - gen_rtx_SYMBOL_REF (Pmode, - ggc_strdup (labelpc))); - assemble_integer (tem, 4, BITS_PER_WORD, 1); - } - else - /* Output ".word .LTHUNKn". */ - assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1); + ASM_OUTPUT_ALIGN (file, 2); + assemble_name (file, label); + fputs (":\n", file); + if (flag_pic) + { + /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ + rtx tem = XEXP (DECL_RTL (function), 0); + /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC + pipeline offset is four rather than eight. Adjust the offset + accordingly. */ + tem = plus_constant (GET_MODE (tem), tem, + TARGET_THUMB1_ONLY ? -3 : -7); + tem = gen_rtx_MINUS (GET_MODE (tem), + tem, + gen_rtx_SYMBOL_REF (Pmode, + ggc_strdup (labelpc))); + assemble_integer (tem, 4, BITS_PER_WORD, 1); + } + else + /* Output ".word .LTHUNKn". */ + assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1); - if (TARGET_THUMB1_ONLY && mi_delta > 255) - assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1); + if (TARGET_THUMB1_ONLY && mi_delta > 255) + assemble_integer (GEN_INT (mi_delta), 4, BITS_PER_WORD, 1); + } } else { -- 2.7.4