On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote: > On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw > <richard.earns...@foss.arm.com> wrote: >> >> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote: >>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw >>> <richard.earns...@foss.arm.com> wrote: >>>> >>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote: >>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw >>>>> <richard.earns...@foss.arm.com> wrote: >>>>>> >>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote: >>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw >>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>> >>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote: >>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw >>>>>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>>>>> >>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw >>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw >>>>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load >>>>>>>>>>>>>>>> delta >>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series >>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>> movs/adds/lsls. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function address >>>>>>>>>>>>>>>> and delta >>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't >>>>>>>>>>>>>>>> use them >>>>>>>>>>>>>>>> with -mpure-code. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified by >>>>>>>>>>>>>>>> eg. pr46287-3.C >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2020-09-29 Christophe Lyon <christophe.l...@linaro.org> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> gcc/ >>>>>>>>>>>>>>>> * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta >>>>>>>>>>>>>>>> in r3 and >>>>>>>>>>>>>>>> do not emit function address and delta when -mpure-code >>>>>>>>>>>>>>>> is used. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Richard, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for your comments. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> There are some optimizations you can make to this code. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would >>>>>>>>>>>>>>> be better >>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add. >>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't >>>>>>>>>>>>>> noticed >>>>>>>>>>>>>> before, so I can write something similar indeed. >>>>>>>>>>>>>> >>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the >>>>>>>>>>>>>> split >>>>>>>>>>>>>> when -mpure-code is NOT used. >>>>>>>>>>>>>> Consider: >>>>>>>>>>>>>> int f3_1 (void) { return 510; } >>>>>>>>>>>>>> int f3_2 (void) { return 511; } >>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0: >>>>>>>>>>>>>> f3_1: >>>>>>>>>>>>>> movs r0, #255 >>>>>>>>>>>>>> lsls r0, r0, #1 >>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>> f3_2: >>>>>>>>>>>>>> ldr r0, .L4 >>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>> >>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for >>>>>>>>>>>>>> this by >>>>>>>>>>>>>> not having to load the constant? >>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken >>>>>>>>>>>>>> into >>>>>>>>>>>>>> account when comparing code size, >>>>>>>>>>>>> >>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into >>>>>>>>>>>>> account. >>>>>>>>>>>>> It might happen that the entry could be shared with another use >>>>>>>>>>>>> of that >>>>>>>>>>>>> literal, but in general that's rare. >>>>>>>>>>>>> >>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three >>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to >>>>>>>>>>>>>> loading >>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended? >>>>>>>>>>>>> >>>>>>>>>>>>> It's a bit borderline at three instructions when literal pools >>>>>>>>>>>>> are not >>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite >>>>>>>>>>>>> small due >>>>>>>>>>>>> to the limited pc offsets we can use. I think on balance we >>>>>>>>>>>>> probably >>>>>>>>>>>>> want to use the instruction sequence unless optimizing for size. >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is also true for >>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled >>>>>>>>>>>>>>> before all >>>>>>>>>>>>>>> this. >>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample >>>>>>>>>>>>>> is compiled as: >>>>>>>>>>>>>> f3_1: >>>>>>>>>>>>>> movs r0, #255 >>>>>>>>>>>>>> lsls r0, r0, #1 >>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>> f3_2: >>>>>>>>>>>>>> movs r0, #1 >>>>>>>>>>>>>> lsls r0, r0, #8 >>>>>>>>>>>>>> adds r0, r0, #255 >>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>> >>>>>>>>>>>>>> so the "return 510" case is already handled as without >>>>>>>>>>>>>> -mpure-code. >>>>>>>>>>>>> >>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem >>>>>>>>>>>>> to be >>>>>>>>>>>>> emitting instructions directly rather than generating RTL. The >>>>>>>>>>>>> examples >>>>>>>>>>>>> you show here are not thunks. >>>>>>>>>>>>> >>>>>>>>>>>> OK thanks for the clarification. >>>>>>>>>>>> >>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make >>>>>>>>>>>> review easier. >>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and >>>>>>>>>>>> PR96770: >>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html >>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html >>>>>>>>>>>> >>>>>>>>>>>> I kept it this way to make incremental changes easier to >>>>>>>>>>>> understand. >>>>>>>>>>>> >>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance >>>>>>>>>>>> easier, >>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate >>>>>>>>>>>> either RTL or >>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in >>>>>>>>>>>> the >>>>>>>>>>>> same place, >>>>>>>>>>>> in case we need to improve/fix it later. We now generate shorter >>>>>>>>>>>> sequences in >>>>>>>>>>>> several cases matching your comments. >>>>>>>>>>>> >>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn >>>>>>>>>>>> pattern and >>>>>>>>>>>> calls thumb1_gen_const_int. >>>>>>>>>>>> >>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls >>>>>>>>>>>> thumb1_gen_const_int. >>>>>>>>>>> >>>>>>>>>>> Yuk! Those changes to thumb1_gen_const_int are horrible. >>>>>>>>>>> >>>>>>>>>>> I think we should be able to leverage the fact that the compiler >>>>>>>>>>> can use >>>>>>>>>>> C++ now to do much better than that, for example by making that >>>>>>>>>>> function >>>>>>>>>>> a template. For example (and this is just a sketch): >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Indeed! I didn't think about it since there is no other use of >>>>>>>>>> templates in arm.c yet. >>>>>>>>>> I'll send an update soon. >>>>>>>>>> >>>>>>>>>> Other than that, does the approach look OK to you? >>>>>>>>>> >>>>>>>>> >>>>>>>>> Yes, I think this is heading in the right direction. Bringing the two >>>>>>>>> immediate generating operations into a single function can only be a >>>>>>>>> good thing. >>>>>>>>> >>>>>>>>> Looking again at your example constant sequences, I see: >>>>>>>>> >>>>>>>>> 0x1000010: >>>>>>>>> movs r3, #16 >>>>>>>>> lsls r3, #16 >>>>>>>>> adds r3, #1 >>>>>>>>> lsls r3, #4 >>>>>>>>> 0x1000011: >>>>>>>>> movs r3, #1 >>>>>>>>> lsls r3, #24 >>>>>>>>> adds r3, #17 >>>>>>>>> >>>>>>>>> The first of these looks odd, given the second sequence. Why doesn't >>>>>>>>> the first expand to >>>>>>>>> >>>>>>>>> 0x1000010: >>>>>>>>> movs r3, #16 >>>>>>>>> lsls r3, #16 >>>>>>>>> adds r3, #16 >>>>>>>>> >>>>>>>> Err, I mean to: >>>>>>>> >>>>>>>> >>>>>>>> 0x1000010: >>>>>>>> movs r3, #1 >>>>>>>> lsls r3, #24 >>>>>>>> adds r3, #16 >>>>>>>> >>>>>>>> ? >>>>>>> >>>>>>> Because I first try to right-shift the constant, hoping to reduce its >>>>>>> range and need less instructions to build the higher part, then >>>>>>> left-shift back. >>>>>>> >>>>>>> In this particular case, we'd need to realize that there are many >>>>>>> zeros "inside" the constant. >>>>>>> >>>>>>> If I remove the part that tries to reduce the range, I do get that >>>>>>> sequence, but for 764 I now generate >>>>>>> movs r3, #2 >>>>>>> lsls r3, #8 >>>>>>> adds r3, #252 >>>>>>> instead of >>>>>>> movs r3, #191 >>>>>>> lsls r3, #2 >>>>>>> >>>>>>> A possibility would be to try both approaches and keep the shortest one. >>>>>> >>>>>> Lets leave that for now, it's not important to fixing the main issue; >>>>>> but we should remember we need to come back to it at some point. >>>>>> >>>>> Thanks, that's what I was thinking too. >>>>> >>>>>> There are other tricks as well, such as >>>>>> >>>>>> 0xffffff >>>>>> >>>>>> can be done as >>>>>> >>>>>> 0x1000000 - 1 >>>>>> >>>>>> and >>>>>> >>>>>> 0xfffffd >>>>>> >>>>>> as >>>>>> >>>>>> 0x1000000 - 3 >>>>>> >>>>>> but these can wait as well. >>>>>> >>>>> >>>>> Didn't we already need to handle such tricks? I'm surprised this >>>>> wasn't needed earlier. >>>>> >>>> >>>> I don't think we ever worried about them. Most of them need at least 3 >>>> instructions so aren't a code size saving over using a literal pool entry. >>>> >>> OK, this will also help when using -mslow-flash-data. >>> >>> Here are updated patches, now using a template as you suggested. >> >> Looking better, but when I try to apply this to my local tree patch 2 >> fails (I'm not exactly sure why, what was your baseline for these >> patches?) > I have the tree patches in this thread on top of these other two: > https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html > https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html > > They have gradual improvements to thumb1_movsi_insn. > >> -- that patch looks suspicious anyway, you're replacing code >> that prints out assembly with code that generates RTL. > Right! I took me a while to understand how I could miss this, sorry. > That was caused by improper testing, as this part of the code > isn't used when targetting cortex-m0. I have added a testcase > for cortex-m23 which crashes with the previous version of patch 2, > and succeeds now. > >> Could you also rename t1_print and t1_rtl to thumb1_const_print and >> thumb1_const_rtl. I think the names as they stand are likely to be too >> generic. > OK, done. > > How about this new version? > I'm not yet sure about the most appropriate naming for: > thumb1_gen_const_int > thumb1_gen_const_int_asm > should they be > thumb1_gen_const_int_rtl > thumb1_gen_const_int_print > to be consistent with the new classes?
It would probably be better, yes. More detailed comments below. R. > > Thanks, > > Christophe > >> R. >> >>> >>> Thanks, >>> >>> Christophe >>> >>>> R. >>>> >>>>> >>>>>> >>>>>> R. >>>>>> >>>>>>> >>>>>>> >>>>>>>>> >>>>>>>>> R. >>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Christophe >>>>>>>>>> >>>>>>>>>>> class t1_rtl >>>>>>>>>>> { >>>>>>>>>>> public: >>>>>>>>>>> void ashift(int a) { gen_rtx_ASHIFT(a); } >>>>>>>>>>> void rshift(int b) { gen_rtx_SHIFTRT(b); } >>>>>>>>>>> }; >>>>>>>>>>> >>>>>>>>>>> class t1_print >>>>>>>>>>> { >>>>>>>>>>> public: >>>>>>>>>>> t1_print (FILE *f) : t_file(f) {} >>>>>>>>>>> void ashift (int a) { fprintf (t_file, "a shift %d\n", a); } >>>>>>>>>>> void rshift (int b) { fprintf (t_file, "r shift %d\n", b); } >>>>>>>>>>> private: >>>>>>>>>>> FILE *t_file; >>>>>>>>>>> }; >>>>>>>>>>> >>>>>>>>>>> template <class T> >>>>>>>>>>> void thumb1_gen_const_int(T t, int f) >>>>>>>>>>> { >>>>>>>>>>> // Expansion of thumb1_gen_const_int ... >>>>>>>>>>> t.ashift(f); >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> // Usage... >>>>>>>>>>> void f1() >>>>>>>>>>> { >>>>>>>>>>> // Use the RTL expander >>>>>>>>>>> t1_rtl g; >>>>>>>>>>> thumb1_gen_const_int (g, 3); >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> void f2() >>>>>>>>>>> { >>>>>>>>>>> // Use the printf expander writing to stdout >>>>>>>>>>> t1_print g(stdout); >>>>>>>>>>> thumb1_gen_const_int (g, 3); >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> With this you can write thumb1_gen_const_int without having to worry >>>>>>>>>>> about which expander is being used in each instance and the template >>>>>>>>>>> expansion will use the right version. >>>>>>>>>>> >>>>>>>>>>> R. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I also suspect (but haven't check) that the base adjustment >>>>>>>>>>>>>>> will >>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4). >>>>>>>>>>>>>>> If that is >>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by >>>>>>>>>>>>>>> 2 for an >>>>>>>>>>>>>>> even greater range of literals. >>>>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY >>>>>>>>>>>>>> case, >>>>>>>>>>>>>> I'll update my patch. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> More generally, any sequence of up to >>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as >>>>>>>>>>>>>>> fast as the >>>>>>>>>>>>>>> existing literal pool fall back. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your >>>>>>>>>>>>>>> code will >>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two >>>>>>>>>>>>>>> shifts could >>>>>>>>>>>>>>> be merged into a single shift-by-16. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, >>>>>>>>>>>>>>> if at >>>>>>>>>>>>>>> all possible. >>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing >>>>>>>>>>>>>> testcase >>>>>>>>>>>>>> that showed the problem. There are also: >>>>>>>>>>>>>> g++.dg/opt/thunk1.C >>>>>>>>>>>>>> g++.dg/ipa/pr46984.C >>>>>>>>>>>>>> g++.dg/torture/pr46287.C >>>>>>>>>>>>>> g++.dg/torture/pr45699.C >>>>>>>>>>>>>> >>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add >>>>>>>>>>>>>> -mpure-code in dg-options? >>>>>>>>>>>>> >>>>>>>>>>>>> On reflection, probably not - that just makes things more >>>>>>>>>>>>> complicated >>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with >>>>>>>>>>>>> other >>>>>>>>>>>>> sets of options on the command line and the fall-out from that). >>>>>>>>>>>>> If >>>>>>>>>>>>> someone cares about pure-code they should be doing full testsuite >>>>>>>>>>>>> runs >>>>>>>>>>>>> with it enabled and that should be sufficient. >>>>>>>>>>>> >>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use >>>>>>>>>>>> a >>>>>>>>>>>> modified simulator. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Christophe >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> R. >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Christophe >>>>>>>>>>>>>> >>>>>>>>>>>>>>> R. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> k# (use "git pull" to merge the remote branch into yours) >>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>> gcc/config/arm/arm.c | 91 >>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++--------------- >>>>>>>>>>>>>>>> 1 file changed, 66 insertions(+), 25 deletions(-) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c >>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644 >>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c >>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c >>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, >>>>>>>>>>>>>>>> tree, HOST_WIDE_INT delta, >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> if (mi_delta > 255) >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> - fputs ("\tldr\tr3, ", file); >>>>>>>>>>>>>>>> - assemble_name (file, label); >>>>>>>>>>>>>>>> - fputs ("+4\n", file); >>>>>>>>>>>>>>>> + /* With -mpure-code, we cannot load delta from the >>>>>>>>>>>>>>>> constant >>>>>>>>>>>>>>>> + pool: we build it explicitly. */ >>>>>>>>>>>>>>>> + if (target_pure_code) >>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>> + bool mov_done_p = false; >>>>>>>>>>>>>>>> + int i; >>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>> + /* Emit upper 3 bytes if needed. */ >>>>>>>>>>>>>>>> + for (i = 0; i < 3; i++) >>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>> + int byte = (mi_delta >> (8 * (3 - i))) & 0xff; >>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>> + if (byte) >>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>> + if (mov_done_p) >>>>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", >>>>>>>>>>>>>>>> byte); >>>>>>>>>>>>>>>> + else >>>>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", >>>>>>>>>>>>>>>> byte); >>>>>>>>>>>>>>>> + mov_done_p = true; >>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>> + if (mov_done_p) >>>>>>>>>>>>>>>> + asm_fprintf (file, "\tlsls\tr3, #8\n"); >>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>> + /* Emit lower byte if needed. */ >>>>>>>>>>>>>>>> + if (!mov_done_p) >>>>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta >>>>>>>>>>>>>>>> & 0xff); >>>>>>>>>>>>>>>> + else if (mi_delta & 0xff) >>>>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta >>>>>>>>>>>>>>>> & 0xff); >>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>> + else >>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>> + fputs ("\tldr\tr3, ", file); >>>>>>>>>>>>>>>> + assemble_name (file, label); >>>>>>>>>>>>>>>> + fputs ("+4\n", file); >>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>> asm_fprintf (file, "\t%ss\t%r, %r, r3\n", >>>>>>>>>>>>>>>> mi_op, this_regno, this_regno); >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, >>>>>>>>>>>>>>>> tree, HOST_WIDE_INT delta, >>>>>>>>>>>>>>>> fputs ("\tpop\t{r3}\n", file); >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> fprintf (file, "\tbx\tr12\n"); >>>>>>>>>>>>>>>> - ASM_OUTPUT_ALIGN (file, 2); >>>>>>>>>>>>>>>> - assemble_name (file, label); >>>>>>>>>>>>>>>> - fputs (":\n", file); >>>>>>>>>>>>>>>> - if (flag_pic) >>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>> + /* With -mpure-code, we don't need to emit literals for >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> + function address and delta since we emitted code to >>>>>>>>>>>>>>>> build >>>>>>>>>>>>>>>> + them. */ >>>>>>>>>>>>>>>> + if (!target_pure_code) >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> - /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ >>>>>>>>>>>>>>>> - rtx tem = XEXP (DECL_RTL (function), 0); >>>>>>>>>>>>>>>> - /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, >>>>>>>>>>>>>>>> so the PC >>>>>>>>>>>>>>>> - pipeline offset is four rather than eight. Adjust >>>>>>>>>>>>>>>> the offset >>>>>>>>>>>>>>>> - accordingly. */ >>>>>>>>>>>>>>>> - tem = plus_constant (GET_MODE (tem), tem, >>>>>>>>>>>>>>>> - TARGET_THUMB1_ONLY ? -3 : -7); >>>>>>>>>>>>>>>> - tem = gen_rtx_MINUS (GET_MODE (tem), >>>>>>>>>>>>>>>> - tem, >>>>>>>>>>>>>>>> - gen_rtx_SYMBOL_REF (Pmode, >>>>>>>>>>>>>>>> - ggc_strdup >>>>>>>>>>>>>>>> (labelpc))); >>>>>>>>>>>>>>>> - assemble_integer (tem, 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>> - } >>>>>>>>>>>>>>>> - else >>>>>>>>>>>>>>>> - /* Output ".word .LTHUNKn". */ >>>>>>>>>>>>>>>> - assemble_integer (XEXP (DECL_RTL (function), 0), 4, >>>>>>>>>>>>>>>> BITS_PER_WORD, 1); >>>>>>>>>>>>>>>> + ASM_OUTPUT_ALIGN (file, 2); >>>>>>>>>>>>>>>> + assemble_name (file, label); >>>>>>>>>>>>>>>> + fputs (":\n", file); >>>>>>>>>>>>>>>> + if (flag_pic) >>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>> + /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ >>>>>>>>>>>>>>>> + rtx tem = XEXP (DECL_RTL (function), 0); >>>>>>>>>>>>>>>> + /* For TARGET_THUMB1_ONLY the thunk is in Thumb >>>>>>>>>>>>>>>> mode, so the PC >>>>>>>>>>>>>>>> + pipeline offset is four rather than eight. >>>>>>>>>>>>>>>> Adjust the offset >>>>>>>>>>>>>>>> + accordingly. */ >>>>>>>>>>>>>>>> + tem = plus_constant (GET_MODE (tem), tem, >>>>>>>>>>>>>>>> + TARGET_THUMB1_ONLY ? -3 : -7); >>>>>>>>>>>>>>>> + tem = gen_rtx_MINUS (GET_MODE (tem), >>>>>>>>>>>>>>>> + tem, >>>>>>>>>>>>>>>> + gen_rtx_SYMBOL_REF (Pmode, >>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>> ggc_strdup (labelpc))); >>>>>>>>>>>>>>>> + assemble_integer (tem, 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>> + else >>>>>>>>>>>>>>>> + /* Output ".word .LTHUNKn". */ >>>>>>>>>>>>>>>> + assemble_integer (XEXP (DECL_RTL (function), 0), 4, >>>>>>>>>>>>>>>> BITS_PER_WORD, 1); >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - if (TARGET_THUMB1_ONLY && mi_delta > 255) >>>>>>>>>>>>>>>> - assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, >>>>>>>>>>>>>>>> 1); >>>>>>>>>>>>>>>> + if (TARGET_THUMB1_ONLY && mi_delta > 255) >>>>>>>>>>>>>>>> + assemble_integer (GEN_INT(mi_delta), 4, >>>>>>>>>>>>>>>> BITS_PER_WORD, 1); >>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> else >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>> >> +class thumb1_const_rtl ... + void mov (int val) This should take a HOST_WIDE_INT. Similarly for add and shift. The same applies to the asm version as well. + asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val); Should be using reg_names[dst_regno] in all cases. In fact, you might want to move that lookup to the constructor and just save a pointer to the string there. You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED rather than "%d" for the immediate. +template <class T> +void +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1) +{ + bool mov_done_p = false; + int val = op1; This potentially silently loses precision. In fact, I think you really want to use "unsigned HOST_WIDE_INT" throughout the following code, so that the right shifts aren't undefined if dealing with negative numbers. For safety, you should also have an assertion in here that op1 == trunc_int_for_mode (op1, SImode) + int shift = 0; + int i; + + if (val == 0) You can short-circuit 0..255 here for a quick exit. + { + dst.mov (val); + return; + } Another trick: if the top nine bits of the 32-bit value are all set, you're probably going to be better off (and certainly not worse off) by generating -op1 and then negating the result in a final step - you can do that via recursion. + + /* In the general case, we need 7 instructions to build + a 32 bits constant (1 movs, 3 lsls, 3 adds). We can + do better if VAL is small enough, or + right-shiftable by a suitable amount. If the + right-shift enables to encode at least one less byte, + it's worth it: we save a adds and a lsls at the + expense of a final lsls. */ + int final_shift = number_of_first_bit_set (val); + + int leading_zeroes = clz_hwi (val); + int number_of_bytes_needed + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes) + / BITS_PER_UNIT) + 1; + int number_of_bytes_needed2 + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift) + / BITS_PER_UNIT) + 1; + + if (number_of_bytes_needed2 < number_of_bytes_needed) + val >>= final_shift; + else + final_shift = 0; + + /* If we are in a very small range, we can use either a single movs + or movs+adds. */ + if ((val >= 0) && (val <= 510)) if val is made unsigned HWI as I suggest, the lower bounds test is not needed. + { + if (val > 255) + { + int high = val - 255; Again, watch your types. + + dst.mov (high); + dst.add (255); + } + else + dst.mov (val); + + if (final_shift > 0) + dst.ashift (final_shift); + } + else + { + /* General case, emit upper 3 bytes as needed. */ + for (i = 0; i < 3; i++) + { + int byte = (val >> (8 * (3 - i))) & 0xff; and here. + + if (byte) + { + /* We are about to emit new bits, stop accumulating a + shift amount, and left-shift only if we have already + emitted some upper bits. */ + if (mov_done_p) + { + dst.ashift (shift); + dst.add (byte); + } + else + dst.mov (byte); + + /* Stop accumulating shift amount since we've just + emitted some bits. */ + shift = 0; + + mov_done_p = true; + } + + if (mov_done_p) + shift += 8; + } + + /* Emit lower byte. */ + if (!mov_done_p) + dst.mov (val & 0xff); + else + { + dst.ashift (shift); + if (val & 0xff) + dst.add (val & 0xff); + } + + if (final_shift > 0) + dst.ashift (final_shift); + } +} +