On 29/10/2020 19:18, Richard Earnshaw via Gcc-patches wrote: > On 28/10/2020 18:10, Christophe Lyon via Gcc-patches wrote: >> On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw >> <richard.earns...@foss.arm.com> wrote: >>> >>> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote: >>>> On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote: >>>>> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw >>>>> <richard.earns...@foss.arm.com> wrote: >>>>>> >>>>>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote: >>>>>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw >>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>> >>>>>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote: >>>>>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw >>>>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>>>> >>>>>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw >>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote: >>>>>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw >>>>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw >>>>>>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw >>>>>>>>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot >>>>>>>>>>>>>>>>>>>> load delta >>>>>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code). >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a >>>>>>>>>>>>>>>>>>>> series of >>>>>>>>>>>>>>>>>>>> movs/adds/lsls. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function >>>>>>>>>>>>>>>>>>>> address and delta >>>>>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we >>>>>>>>>>>>>>>>>>>> don't use them >>>>>>>>>>>>>>>>>>>> with -mpure-code. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified >>>>>>>>>>>>>>>>>>>> by >>>>>>>>>>>>>>>>>>>> eg. pr46287-3.C >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 2020-09-29 Christophe Lyon <christophe.l...@linaro.org> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> gcc/ >>>>>>>>>>>>>>>>>>>> * config/arm/arm.c (arm_thumb1_mi_thunk): Build >>>>>>>>>>>>>>>>>>>> mi_delta in r3 and >>>>>>>>>>>>>>>>>>>> do not emit function address and delta when >>>>>>>>>>>>>>>>>>>> -mpure-code is used. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Richard, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks for your comments. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> There are some optimizations you can make to this code. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it >>>>>>>>>>>>>>>>>>> would be better >>>>>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add. >>>>>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I >>>>>>>>>>>>>>>>>> hadn't noticed >>>>>>>>>>>>>>>>>> before, so I can write something similar indeed. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in >>>>>>>>>>>>>>>>>> the split >>>>>>>>>>>>>>>>>> when -mpure-code is NOT used. >>>>>>>>>>>>>>>>>> Consider: >>>>>>>>>>>>>>>>>> int f3_1 (void) { return 510; } >>>>>>>>>>>>>>>>>> int f3_2 (void) { return 511; } >>>>>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0: >>>>>>>>>>>>>>>>>> f3_1: >>>>>>>>>>>>>>>>>> movs r0, #255 >>>>>>>>>>>>>>>>>> lsls r0, r0, #1 >>>>>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>>>>> f3_2: >>>>>>>>>>>>>>>>>> ldr r0, .L4 >>>>>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for >>>>>>>>>>>>>>>>>> this by >>>>>>>>>>>>>>>>>> not having to load the constant? >>>>>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be >>>>>>>>>>>>>>>>>> taken into >>>>>>>>>>>>>>>>>> account when comparing code size, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken >>>>>>>>>>>>>>>>> into account. >>>>>>>>>>>>>>>>> It might happen that the entry could be shared with another >>>>>>>>>>>>>>>>> use of that >>>>>>>>>>>>>>>>> literal, but in general that's rare. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below >>>>>>>>>>>>>>>>>> three >>>>>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to >>>>>>>>>>>>>>>>>> loading >>>>>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It's a bit borderline at three instructions when literal >>>>>>>>>>>>>>>>> pools are not >>>>>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be >>>>>>>>>>>>>>>>> quite small due >>>>>>>>>>>>>>>>> to the limited pc offsets we can use. I think on balance we >>>>>>>>>>>>>>>>> probably >>>>>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for >>>>>>>>>>>>>>>>> size. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> This is also true for >>>>>>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled >>>>>>>>>>>>>>>>>>> before all >>>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above >>>>>>>>>>>>>>>>>> sample is compiled as: >>>>>>>>>>>>>>>>>> f3_1: >>>>>>>>>>>>>>>>>> movs r0, #255 >>>>>>>>>>>>>>>>>> lsls r0, r0, #1 >>>>>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>>>>> f3_2: >>>>>>>>>>>>>>>>>> movs r0, #1 >>>>>>>>>>>>>>>>>> lsls r0, r0, #8 >>>>>>>>>>>>>>>>>> adds r0, r0, #255 >>>>>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> so the "return 510" case is already handled as without >>>>>>>>>>>>>>>>>> -mpure-code. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you >>>>>>>>>>>>>>>>> seem to be >>>>>>>>>>>>>>>>> emitting instructions directly rather than generating RTL. >>>>>>>>>>>>>>>>> The examples >>>>>>>>>>>>>>>>> you show here are not thunks. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> OK thanks for the clarification. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully >>>>>>>>>>>>>>>> make >>>>>>>>>>>>>>>> review easier. >>>>>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 >>>>>>>>>>>>>>>> and PR96770: >>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html >>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I kept it this way to make incremental changes easier to >>>>>>>>>>>>>>>> understand. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance >>>>>>>>>>>>>>>> easier, >>>>>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate >>>>>>>>>>>>>>>> either RTL or >>>>>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is >>>>>>>>>>>>>>>> in the >>>>>>>>>>>>>>>> same place, >>>>>>>>>>>>>>>> in case we need to improve/fix it later. We now generate >>>>>>>>>>>>>>>> shorter sequences in >>>>>>>>>>>>>>>> several cases matching your comments. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn >>>>>>>>>>>>>>>> pattern and >>>>>>>>>>>>>>>> calls thumb1_gen_const_int. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls >>>>>>>>>>>>>>>> thumb1_gen_const_int. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yuk! Those changes to thumb1_gen_const_int are horrible. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think we should be able to leverage the fact that the >>>>>>>>>>>>>>> compiler can use >>>>>>>>>>>>>>> C++ now to do much better than that, for example by making that >>>>>>>>>>>>>>> function >>>>>>>>>>>>>>> a template. For example (and this is just a sketch): >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Indeed! I didn't think about it since there is no other use of >>>>>>>>>>>>>> templates in arm.c yet. >>>>>>>>>>>>>> I'll send an update soon. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Other than that, does the approach look OK to you? >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Yes, I think this is heading in the right direction. Bringing >>>>>>>>>>>>> the two >>>>>>>>>>>>> immediate generating operations into a single function can only >>>>>>>>>>>>> be a >>>>>>>>>>>>> good thing. >>>>>>>>>>>>> >>>>>>>>>>>>> Looking again at your example constant sequences, I see: >>>>>>>>>>>>> >>>>>>>>>>>>> 0x1000010: >>>>>>>>>>>>> movs r3, #16 >>>>>>>>>>>>> lsls r3, #16 >>>>>>>>>>>>> adds r3, #1 >>>>>>>>>>>>> lsls r3, #4 >>>>>>>>>>>>> 0x1000011: >>>>>>>>>>>>> movs r3, #1 >>>>>>>>>>>>> lsls r3, #24 >>>>>>>>>>>>> adds r3, #17 >>>>>>>>>>>>> >>>>>>>>>>>>> The first of these looks odd, given the second sequence. Why >>>>>>>>>>>>> doesn't >>>>>>>>>>>>> the first expand to >>>>>>>>>>>>> >>>>>>>>>>>>> 0x1000010: >>>>>>>>>>>>> movs r3, #16 >>>>>>>>>>>>> lsls r3, #16 >>>>>>>>>>>>> adds r3, #16 >>>>>>>>>>>>> >>>>>>>>>>>> Err, I mean to: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 0x1000010: >>>>>>>>>>>> movs r3, #1 >>>>>>>>>>>> lsls r3, #24 >>>>>>>>>>>> adds r3, #16 >>>>>>>>>>>> >>>>>>>>>>>> ? >>>>>>>>>>> >>>>>>>>>>> Because I first try to right-shift the constant, hoping to reduce >>>>>>>>>>> its >>>>>>>>>>> range and need less instructions to build the higher part, then >>>>>>>>>>> left-shift back. >>>>>>>>>>> >>>>>>>>>>> In this particular case, we'd need to realize that there are many >>>>>>>>>>> zeros "inside" the constant. >>>>>>>>>>> >>>>>>>>>>> If I remove the part that tries to reduce the range, I do get that >>>>>>>>>>> sequence, but for 764 I now generate >>>>>>>>>>> movs r3, #2 >>>>>>>>>>> lsls r3, #8 >>>>>>>>>>> adds r3, #252 >>>>>>>>>>> instead of >>>>>>>>>>> movs r3, #191 >>>>>>>>>>> lsls r3, #2 >>>>>>>>>>> >>>>>>>>>>> A possibility would be to try both approaches and keep the shortest >>>>>>>>>>> one. >>>>>>>>>> >>>>>>>>>> Lets leave that for now, it's not important to fixing the main issue; >>>>>>>>>> but we should remember we need to come back to it at some point. >>>>>>>>>> >>>>>>>>> Thanks, that's what I was thinking too. >>>>>>>>> >>>>>>>>>> There are other tricks as well, such as >>>>>>>>>> >>>>>>>>>> 0xffffff >>>>>>>>>> >>>>>>>>>> can be done as >>>>>>>>>> >>>>>>>>>> 0x1000000 - 1 >>>>>>>>>> >>>>>>>>>> and >>>>>>>>>> >>>>>>>>>> 0xfffffd >>>>>>>>>> >>>>>>>>>> as >>>>>>>>>> >>>>>>>>>> 0x1000000 - 3 >>>>>>>>>> >>>>>>>>>> but these can wait as well. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Didn't we already need to handle such tricks? I'm surprised this >>>>>>>>> wasn't needed earlier. >>>>>>>>> >>>>>>>> >>>>>>>> I don't think we ever worried about them. Most of them need at least 3 >>>>>>>> instructions so aren't a code size saving over using a literal pool >>>>>>>> entry. >>>>>>>> >>>>>>> OK, this will also help when using -mslow-flash-data. >>>>>>> >>>>>>> Here are updated patches, now using a template as you suggested. >>>>>> >>>>>> Looking better, but when I try to apply this to my local tree patch 2 >>>>>> fails (I'm not exactly sure why, what was your baseline for these >>>>>> patches?) >>>>> I have the tree patches in this thread on top of these other two: >>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html >>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html >>>>> >>>>> They have gradual improvements to thumb1_movsi_insn. >>>>> >>>>>> -- that patch looks suspicious anyway, you're replacing code >>>>>> that prints out assembly with code that generates RTL. >>>>> Right! I took me a while to understand how I could miss this, sorry. >>>>> That was caused by improper testing, as this part of the code >>>>> isn't used when targetting cortex-m0. I have added a testcase >>>>> for cortex-m23 which crashes with the previous version of patch 2, >>>>> and succeeds now. >>>>> >>>>>> Could you also rename t1_print and t1_rtl to thumb1_const_print and >>>>>> thumb1_const_rtl. I think the names as they stand are likely to be too >>>>>> generic. >>>>> OK, done. >>>>> >>>>> How about this new version? >>>>> I'm not yet sure about the most appropriate naming for: >>>>> thumb1_gen_const_int >>>>> thumb1_gen_const_int_asm >>>>> should they be >>>>> thumb1_gen_const_int_rtl >>>>> thumb1_gen_const_int_print >>>>> to be consistent with the new classes? >>>> >>>> It would probably be better, yes. >>>> >>>> More detailed comments below. >>>> >>>> R. >>>> >>>>> >>>>> Thanks, >>>>> >>>>> Christophe >>>>> >>>>>> R. >>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Christophe >>>>>>> >>>>>>>> R. >>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> R. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> R. >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Christophe >>>>>>>>>>>>>> >>>>>>>>>>>>>>> class t1_rtl >>>>>>>>>>>>>>> { >>>>>>>>>>>>>>> public: >>>>>>>>>>>>>>> void ashift(int a) { gen_rtx_ASHIFT(a); } >>>>>>>>>>>>>>> void rshift(int b) { gen_rtx_SHIFTRT(b); } >>>>>>>>>>>>>>> }; >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> class t1_print >>>>>>>>>>>>>>> { >>>>>>>>>>>>>>> public: >>>>>>>>>>>>>>> t1_print (FILE *f) : t_file(f) {} >>>>>>>>>>>>>>> void ashift (int a) { fprintf (t_file, "a shift %d\n", a); } >>>>>>>>>>>>>>> void rshift (int b) { fprintf (t_file, "r shift %d\n", b); } >>>>>>>>>>>>>>> private: >>>>>>>>>>>>>>> FILE *t_file; >>>>>>>>>>>>>>> }; >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> template <class T> >>>>>>>>>>>>>>> void thumb1_gen_const_int(T t, int f) >>>>>>>>>>>>>>> { >>>>>>>>>>>>>>> // Expansion of thumb1_gen_const_int ... >>>>>>>>>>>>>>> t.ashift(f); >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> // Usage... >>>>>>>>>>>>>>> void f1() >>>>>>>>>>>>>>> { >>>>>>>>>>>>>>> // Use the RTL expander >>>>>>>>>>>>>>> t1_rtl g; >>>>>>>>>>>>>>> thumb1_gen_const_int (g, 3); >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> void f2() >>>>>>>>>>>>>>> { >>>>>>>>>>>>>>> // Use the printf expander writing to stdout >>>>>>>>>>>>>>> t1_print g(stdout); >>>>>>>>>>>>>>> thumb1_gen_const_int (g, 3); >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> With this you can write thumb1_gen_const_int without having to >>>>>>>>>>>>>>> worry >>>>>>>>>>>>>>> about which expander is being used in each instance and the >>>>>>>>>>>>>>> template >>>>>>>>>>>>>>> expansion will use the right version. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> R. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I also suspect (but haven't check) that the base >>>>>>>>>>>>>>>>>>> adjustment will >>>>>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie >>>>>>>>>>>>>>>>>>> 4). If that is >>>>>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left >>>>>>>>>>>>>>>>>>> by 2 for an >>>>>>>>>>>>>>>>>>> even greater range of literals. >>>>>>>>>>>>>>>>>> I can see there is provision for this in the >>>>>>>>>>>>>>>>>> !TARGET_THUMB1_ONLY case, >>>>>>>>>>>>>>>>>> I'll update my patch. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> More generally, any sequence of up to >>>>>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably >>>>>>>>>>>>>>>>>>> as fast as the >>>>>>>>>>>>>>>>>>> existing literal pool fall back. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), >>>>>>>>>>>>>>>>>>> your code will >>>>>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two >>>>>>>>>>>>>>>>>>> shifts could >>>>>>>>>>>>>>>>>>> be merged into a single shift-by-16. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for >>>>>>>>>>>>>>>>>>> this, if at >>>>>>>>>>>>>>>>>>> all possible. >>>>>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing >>>>>>>>>>>>>>>>>> testcase >>>>>>>>>>>>>>>>>> that showed the problem. There are also: >>>>>>>>>>>>>>>>>> g++.dg/opt/thunk1.C >>>>>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C >>>>>>>>>>>>>>>>>> g++.dg/torture/pr46287.C >>>>>>>>>>>>>>>>>> g++.dg/torture/pr45699.C >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and >>>>>>>>>>>>>>>>>> add >>>>>>>>>>>>>>>>>> -mpure-code in dg-options? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On reflection, probably not - that just makes things more >>>>>>>>>>>>>>>>> complicated >>>>>>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions >>>>>>>>>>>>>>>>> with other >>>>>>>>>>>>>>>>> sets of options on the command line and the fall-out from >>>>>>>>>>>>>>>>> that). If >>>>>>>>>>>>>>>>> someone cares about pure-code they should be doing full >>>>>>>>>>>>>>>>> testsuite runs >>>>>>>>>>>>>>>>> with it enabled and that should be sufficient. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I >>>>>>>>>>>>>>>> use a >>>>>>>>>>>>>>>> modified simulator. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Christophe >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> R. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Christophe >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> R. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> k# (use "git pull" to merge the remote branch into yours) >>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>> gcc/config/arm/arm.c | 91 >>>>>>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++--------------- >>>>>>>>>>>>>>>>>>>> 1 file changed, 66 insertions(+), 25 deletions(-) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c >>>>>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644 >>>>>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c >>>>>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c >>>>>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, >>>>>>>>>>>>>>>>>>>> tree, HOST_WIDE_INT delta, >>>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>>> if (mi_delta > 255) >>>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>>> - fputs ("\tldr\tr3, ", file); >>>>>>>>>>>>>>>>>>>> - assemble_name (file, label); >>>>>>>>>>>>>>>>>>>> - fputs ("+4\n", file); >>>>>>>>>>>>>>>>>>>> + /* With -mpure-code, we cannot load delta from the >>>>>>>>>>>>>>>>>>>> constant >>>>>>>>>>>>>>>>>>>> + pool: we build it explicitly. */ >>>>>>>>>>>>>>>>>>>> + if (target_pure_code) >>>>>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>>>>> + bool mov_done_p = false; >>>>>>>>>>>>>>>>>>>> + int i; >>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>> + /* Emit upper 3 bytes if needed. */ >>>>>>>>>>>>>>>>>>>> + for (i = 0; i < 3; i++) >>>>>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>>>>> + int byte = (mi_delta >> (8 * (3 - i))) & >>>>>>>>>>>>>>>>>>>> 0xff; >>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>> + if (byte) >>>>>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>>>>> + if (mov_done_p) >>>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, >>>>>>>>>>>>>>>>>>>> #%d\n", byte); >>>>>>>>>>>>>>>>>>>> + else >>>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, >>>>>>>>>>>>>>>>>>>> #%d\n", byte); >>>>>>>>>>>>>>>>>>>> + mov_done_p = true; >>>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>> + if (mov_done_p) >>>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tlsls\tr3, #8\n"); >>>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>> + /* Emit lower byte if needed. */ >>>>>>>>>>>>>>>>>>>> + if (!mov_done_p) >>>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", >>>>>>>>>>>>>>>>>>>> mi_delta & 0xff); >>>>>>>>>>>>>>>>>>>> + else if (mi_delta & 0xff) >>>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", >>>>>>>>>>>>>>>>>>>> mi_delta & 0xff); >>>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>>> + else >>>>>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>>>>> + fputs ("\tldr\tr3, ", file); >>>>>>>>>>>>>>>>>>>> + assemble_name (file, label); >>>>>>>>>>>>>>>>>>>> + fputs ("+4\n", file); >>>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>>> asm_fprintf (file, "\t%ss\t%r, %r, r3\n", >>>>>>>>>>>>>>>>>>>> mi_op, this_regno, this_regno); >>>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, >>>>>>>>>>>>>>>>>>>> tree, HOST_WIDE_INT delta, >>>>>>>>>>>>>>>>>>>> fputs ("\tpop\t{r3}\n", file); >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> fprintf (file, "\tbx\tr12\n"); >>>>>>>>>>>>>>>>>>>> - ASM_OUTPUT_ALIGN (file, 2); >>>>>>>>>>>>>>>>>>>> - assemble_name (file, label); >>>>>>>>>>>>>>>>>>>> - fputs (":\n", file); >>>>>>>>>>>>>>>>>>>> - if (flag_pic) >>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>> + /* With -mpure-code, we don't need to emit literals >>>>>>>>>>>>>>>>>>>> for the >>>>>>>>>>>>>>>>>>>> + function address and delta since we emitted code to >>>>>>>>>>>>>>>>>>>> build >>>>>>>>>>>>>>>>>>>> + them. */ >>>>>>>>>>>>>>>>>>>> + if (!target_pure_code) >>>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>>> - /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ >>>>>>>>>>>>>>>>>>>> - rtx tem = XEXP (DECL_RTL (function), 0); >>>>>>>>>>>>>>>>>>>> - /* For TARGET_THUMB1_ONLY the thunk is in Thumb >>>>>>>>>>>>>>>>>>>> mode, so the PC >>>>>>>>>>>>>>>>>>>> - pipeline offset is four rather than eight. >>>>>>>>>>>>>>>>>>>> Adjust the offset >>>>>>>>>>>>>>>>>>>> - accordingly. */ >>>>>>>>>>>>>>>>>>>> - tem = plus_constant (GET_MODE (tem), tem, >>>>>>>>>>>>>>>>>>>> - TARGET_THUMB1_ONLY ? -3 : -7); >>>>>>>>>>>>>>>>>>>> - tem = gen_rtx_MINUS (GET_MODE (tem), >>>>>>>>>>>>>>>>>>>> - tem, >>>>>>>>>>>>>>>>>>>> - gen_rtx_SYMBOL_REF (Pmode, >>>>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>>> ggc_strdup (labelpc))); >>>>>>>>>>>>>>>>>>>> - assemble_integer (tem, 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>>> - } >>>>>>>>>>>>>>>>>>>> - else >>>>>>>>>>>>>>>>>>>> - /* Output ".word .LTHUNKn". */ >>>>>>>>>>>>>>>>>>>> - assemble_integer (XEXP (DECL_RTL (function), 0), 4, >>>>>>>>>>>>>>>>>>>> BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>>> + ASM_OUTPUT_ALIGN (file, 2); >>>>>>>>>>>>>>>>>>>> + assemble_name (file, label); >>>>>>>>>>>>>>>>>>>> + fputs (":\n", file); >>>>>>>>>>>>>>>>>>>> + if (flag_pic) >>>>>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>>>>> + /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". >>>>>>>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>>>>>>> + rtx tem = XEXP (DECL_RTL (function), 0); >>>>>>>>>>>>>>>>>>>> + /* For TARGET_THUMB1_ONLY the thunk is in >>>>>>>>>>>>>>>>>>>> Thumb mode, so the PC >>>>>>>>>>>>>>>>>>>> + pipeline offset is four rather than eight. >>>>>>>>>>>>>>>>>>>> Adjust the offset >>>>>>>>>>>>>>>>>>>> + accordingly. */ >>>>>>>>>>>>>>>>>>>> + tem = plus_constant (GET_MODE (tem), tem, >>>>>>>>>>>>>>>>>>>> + TARGET_THUMB1_ONLY ? -3 : >>>>>>>>>>>>>>>>>>>> -7); >>>>>>>>>>>>>>>>>>>> + tem = gen_rtx_MINUS (GET_MODE (tem), >>>>>>>>>>>>>>>>>>>> + tem, >>>>>>>>>>>>>>>>>>>> + gen_rtx_SYMBOL_REF (Pmode, >>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>> ggc_strdup (labelpc))); >>>>>>>>>>>>>>>>>>>> + assemble_integer (tem, 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>>> + else >>>>>>>>>>>>>>>>>>>> + /* Output ".word .LTHUNKn". */ >>>>>>>>>>>>>>>>>>>> + assemble_integer (XEXP (DECL_RTL (function), 0), >>>>>>>>>>>>>>>>>>>> 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> - if (TARGET_THUMB1_ONLY && mi_delta > 255) >>>>>>>>>>>>>>>>>>>> - assemble_integer (GEN_INT(mi_delta), 4, >>>>>>>>>>>>>>>>>>>> BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>>> + if (TARGET_THUMB1_ONLY && mi_delta > 255) >>>>>>>>>>>>>>>>>>>> + assemble_integer (GEN_INT(mi_delta), 4, >>>>>>>>>>>>>>>>>>>> BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>>> else >>>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >>>> +class thumb1_const_rtl >>>> ... >>>> + void mov (int val) >>>> >>>> This should take a HOST_WIDE_INT. Similarly for add and shift. The >>>> same applies to the asm version as well. >>>> >>>> + asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val); >>>> >>>> Should be using reg_names[dst_regno] in all cases. In fact, you might >>>> want to move that lookup to the constructor and just save a pointer to >>>> the string there. You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED >>> >>> Correction, for a (signed) HOST_WIDE_INT, this should be >>> HOST_WIDE_INT_PRINT_DEC. >>> >> >> Right, but if "val" is unsigned HOST_WIDE_INT, all the methods in the >> two classes >> can have unsigned HOST_WIDE_INT parameters, and thus use >> HOST_WIDE_INT_PRINT_UNSIGNED? >> > > It's generally safer to print it as a signed value. We probably won't > have cases where the top bit of a 32-bit word are set here; but if you > do, and the value is unsigned, you end up with 16 digit hex numbers > rather than the 8 you'd expect on a 32-bit target because HOST_WIDE_INT > is at least 64 bits in size. >
I don't mean hex numbers, of course, just very large decimal numbers. But the point is still the same. R. > R. > >> >>>> rather than "%d" for the immediate. >>>> >>>> +template <class T> >>>> +void >>>> +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1) >>>> +{ >>>> + bool mov_done_p = false; >>>> + int val = op1; >>>> >>>> This potentially silently loses precision. In fact, I think you really >>>> want to use "unsigned HOST_WIDE_INT" throughout the following code, so >>>> that the right shifts aren't undefined if dealing with negative numbers. >>>> >>>> For safety, you should also have an assertion in here that >>>> >>>> op1 == trunc_int_for_mode (op1, SImode) >>>> >>>> + int shift = 0; >>>> + int i; >>>> + >>>> + if (val == 0) >>>> >>>> You can short-circuit 0..255 here for a quick exit. >>>> >>>> + { >>>> + dst.mov (val); >>>> + return; >>>> + } >>>> >>>> Another trick: if the top nine bits of the 32-bit value are all set, >>>> you're probably going to be better off (and certainly not worse off) by >>>> generating -op1 and then negating the result in a final step - you can >>>> do that via recursion. >>>> >>>> + >>>> + /* In the general case, we need 7 instructions to build >>>> + a 32 bits constant (1 movs, 3 lsls, 3 adds). We can >>>> + do better if VAL is small enough, or >>>> + right-shiftable by a suitable amount. If the >>>> + right-shift enables to encode at least one less byte, >>>> + it's worth it: we save a adds and a lsls at the >>>> + expense of a final lsls. */ >>>> + int final_shift = number_of_first_bit_set (val); >>>> + >>>> + int leading_zeroes = clz_hwi (val); >>>> + int number_of_bytes_needed >>>> + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes) >>>> + / BITS_PER_UNIT) + 1; >>>> + int number_of_bytes_needed2 >>>> + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift) >>>> + / BITS_PER_UNIT) + 1; >>>> + >>>> + if (number_of_bytes_needed2 < number_of_bytes_needed) >>>> + val >>= final_shift; >>>> + else >>>> + final_shift = 0; >>>> + >>>> + /* If we are in a very small range, we can use either a single movs >>>> + or movs+adds. */ >>>> + if ((val >= 0) && (val <= 510)) >>>> >>>> if val is made unsigned HWI as I suggest, the lower bounds test is not >>>> needed. >>>> >>>> + { >>>> + if (val > 255) >>>> + { >>>> + int high = val - 255; >>>> >>>> Again, watch your types. >>>> >>>> + >>>> + dst.mov (high); >>>> + dst.add (255); >>>> + } >>>> + else >>>> + dst.mov (val); >>>> + >>>> + if (final_shift > 0) >>>> + dst.ashift (final_shift); >>>> + } >>>> + else >>>> + { >>>> + /* General case, emit upper 3 bytes as needed. */ >>>> + for (i = 0; i < 3; i++) >>>> + { >>>> + int byte = (val >> (8 * (3 - i))) & 0xff; >>>> >>>> and here. >>>> >>>> + >>>> + if (byte) >>>> + { >>>> + /* We are about to emit new bits, stop accumulating a >>>> + shift amount, and left-shift only if we have already >>>> + emitted some upper bits. */ >>>> + if (mov_done_p) >>>> + { >>>> + dst.ashift (shift); >>>> + dst.add (byte); >>>> + } >>>> + else >>>> + dst.mov (byte); >>>> + >>>> + /* Stop accumulating shift amount since we've just >>>> + emitted some bits. */ >>>> + shift = 0; >>>> + >>>> + mov_done_p = true; >>>> + } >>>> + >>>> + if (mov_done_p) >>>> + shift += 8; >>>> + } >>>> + >>>> + /* Emit lower byte. */ >>>> + if (!mov_done_p) >>>> + dst.mov (val & 0xff); >>>> + else >>>> + { >>>> + dst.ashift (shift); >>>> + if (val & 0xff) >>>> + dst.add (val & 0xff); >>>> + } >>>> + >>>> + if (final_shift > 0) >>>> + dst.ashift (final_shift); >>>> + } >>>> +} >>>> + >>>> >>> >