Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

Richard Earnshaw via Gcc-patches Tue, 27 Oct 2020 08:43:03 -0700

On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
> <richard.earns...@foss.arm.com> wrote:
>>
>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
>>> <richard.earns...@foss.arm.com> wrote:
>>>>
>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
>>>>> <richard.earns...@foss.arm.com> wrote:
>>>>>>
>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>>>>>>> <richard.earns...@foss.arm.com> wrote:
>>>>>>>>
>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote:
>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>>>>>>>>>> <richard.earns...@foss.arm.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>>>>>>>>>>>>> <richard.earns...@foss.arm.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load 
>>>>>>>>>>>>>>>> delta
>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series 
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> movs/adds/lsls.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function address 
>>>>>>>>>>>>>>>> and delta
>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't 
>>>>>>>>>>>>>>>> use them
>>>>>>>>>>>>>>>> with -mpure-code.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified by
>>>>>>>>>>>>>>>> eg. pr46287-3.C
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.l...@linaro.org>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>       gcc/
>>>>>>>>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta 
>>>>>>>>>>>>>>>> in r3 and
>>>>>>>>>>>>>>>>       do not emit function address and delta when -mpure-code 
>>>>>>>>>>>>>>>> is used.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Richard,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are some optimizations you can make to this code.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would 
>>>>>>>>>>>>>>> be better
>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add.
>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't 
>>>>>>>>>>>>>> noticed
>>>>>>>>>>>>>> before, so I can write something similar indeed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the 
>>>>>>>>>>>>>> split
>>>>>>>>>>>>>> when -mpure-code is NOT used.
>>>>>>>>>>>>>> Consider:
>>>>>>>>>>>>>> int f3_1 (void) { return 510; }
>>>>>>>>>>>>>> int f3_2 (void) { return 511; }
>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0:
>>>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>>>         ldr     r0, .L4
>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for 
>>>>>>>>>>>>>> this by
>>>>>>>>>>>>>> not having to load the constant?
>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken 
>>>>>>>>>>>>>> into
>>>>>>>>>>>>>> account when comparing code size,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into 
>>>>>>>>>>>>> account.
>>>>>>>>>>>>>  It might happen that the entry could be shared with another use 
>>>>>>>>>>>>> of that
>>>>>>>>>>>>> literal, but in general that's rare.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to 
>>>>>>>>>>>>>> loading
>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended?
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's a bit borderline at three instructions when literal pools 
>>>>>>>>>>>>> are not
>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite 
>>>>>>>>>>>>> small due
>>>>>>>>>>>>> to the limited pc offsets we can use.  I think on balance we 
>>>>>>>>>>>>> probably
>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for size.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is also true for
>>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled 
>>>>>>>>>>>>>>> before all
>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample 
>>>>>>>>>>>>>> is compiled as:
>>>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>>>         movs    r0, #1
>>>>>>>>>>>>>>         lsls    r0, r0, #8
>>>>>>>>>>>>>>         adds    r0, r0, #255
>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> so the "return 510" case is already handled as without 
>>>>>>>>>>>>>> -mpure-code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem 
>>>>>>>>>>>>> to be
>>>>>>>>>>>>> emitting instructions directly rather than generating RTL.  The 
>>>>>>>>>>>>> examples
>>>>>>>>>>>>> you show here are not thunks.
>>>>>>>>>>>>>
>>>>>>>>>>>> OK thanks for the clarification.
>>>>>>>>>>>>
>>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make
>>>>>>>>>>>> review easier.
>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and 
>>>>>>>>>>>> PR96770:
>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
>>>>>>>>>>>>
>>>>>>>>>>>> I kept it this way to make incremental changes easier to 
>>>>>>>>>>>> understand.
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance 
>>>>>>>>>>>> easier,
>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate 
>>>>>>>>>>>> either RTL or
>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in 
>>>>>>>>>>>> the
>>>>>>>>>>>> same place,
>>>>>>>>>>>>  in case we need to improve/fix it later. We now generate shorter 
>>>>>>>>>>>> sequences in
>>>>>>>>>>>> several cases matching your comments.
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn 
>>>>>>>>>>>> pattern and
>>>>>>>>>>>> calls thumb1_gen_const_int.
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls
>>>>>>>>>>>> thumb1_gen_const_int.
>>>>>>>>>>>
>>>>>>>>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
>>>>>>>>>>>
>>>>>>>>>>> I think we should be able to leverage the fact that the compiler 
>>>>>>>>>>> can use
>>>>>>>>>>> C++ now to do much better than that, for example by making that 
>>>>>>>>>>> function
>>>>>>>>>>> a template.  For example (and this is just a sketch):
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Indeed! I didn't think about it since there is no other use of
>>>>>>>>>> templates in arm.c yet.
>>>>>>>>>> I'll send an update soon.
>>>>>>>>>>
>>>>>>>>>> Other than that, does the approach look OK to you?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, I think this is heading in the right direction.  Bringing the two
>>>>>>>>> immediate generating operations into a single function can only be a
>>>>>>>>> good thing.
>>>>>>>>>
>>>>>>>>> Looking again at your example constant sequences, I see:
>>>>>>>>>
>>>>>>>>> 0x1000010:
>>>>>>>>>         movs    r3, #16
>>>>>>>>>         lsls    r3, #16
>>>>>>>>>         adds    r3, #1
>>>>>>>>>         lsls    r3, #4
>>>>>>>>> 0x1000011:
>>>>>>>>>         movs    r3, #1
>>>>>>>>>         lsls    r3, #24
>>>>>>>>>         adds    r3, #17
>>>>>>>>>
>>>>>>>>> The first of these looks odd, given the second sequence.  Why doesn't
>>>>>>>>> the first expand to
>>>>>>>>>
>>>>>>>>> 0x1000010:
>>>>>>>>>         movs    r3, #16
>>>>>>>>>         lsls    r3, #16
>>>>>>>>>         adds    r3, #16
>>>>>>>>>
>>>>>>>> Err, I mean to:
>>>>>>>>
>>>>>>>>
>>>>>>>> 0x1000010:
>>>>>>>>         movs    r3, #1
>>>>>>>>         lsls    r3, #24
>>>>>>>>         adds    r3, #16
>>>>>>>>
>>>>>>>> ?
>>>>>>>
>>>>>>> Because I first try to right-shift the constant, hoping to reduce its
>>>>>>> range and need less instructions to build the higher part, then
>>>>>>> left-shift back.
>>>>>>>
>>>>>>> In this particular case, we'd need to realize that there are many
>>>>>>> zeros "inside" the constant.
>>>>>>>
>>>>>>> If I remove the part that tries to reduce the range, I do get that
>>>>>>> sequence, but for 764 I now generate
>>>>>>> movs    r3, #2
>>>>>>> lsls    r3, #8
>>>>>>> adds    r3, #252
>>>>>>> instead of
>>>>>>> movs    r3, #191
>>>>>>> lsls    r3, #2
>>>>>>>
>>>>>>> A possibility would be to try both approaches and keep the shortest one.
>>>>>>
>>>>>> Lets leave that for now, it's not important to fixing the main issue;
>>>>>> but we should remember we need to come back to it at some point.
>>>>>>
>>>>> Thanks, that's what I was thinking too.
>>>>>
>>>>>> There are other tricks as well, such as
>>>>>>
>>>>>>   0xffffff
>>>>>>
>>>>>> can be done as
>>>>>>
>>>>>>   0x1000000 - 1
>>>>>>
>>>>>> and
>>>>>>
>>>>>>   0xfffffd
>>>>>>
>>>>>> as
>>>>>>
>>>>>>   0x1000000 - 3
>>>>>>
>>>>>> but these can wait as well.
>>>>>>
>>>>>
>>>>> Didn't we already need to handle such tricks? I'm surprised this
>>>>> wasn't needed earlier.
>>>>>
>>>>
>>>> I don't think we ever worried about them.  Most of them need at least 3
>>>> instructions so aren't a code size saving over using a literal pool entry.
>>>>
>>> OK, this will also help when using -mslow-flash-data.
>>>
>>> Here are updated patches, now using a template as you suggested.
>>
>> Looking better, but when I try to apply this to my local tree patch 2
>> fails (I'm not exactly sure why, what was your baseline for these
>> patches?)
> I have the tree patches in this thread on top of these other two:
> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html
> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html
> 
> They have gradual improvements to thumb1_movsi_insn.
> 
>> -- that patch looks suspicious anyway, you're replacing code
>> that prints out assembly with code that generates RTL.
> Right! I took me a while to understand how I could miss this, sorry.
> That was caused by improper testing, as this part of the code
> isn't used when targetting cortex-m0. I have added a testcase
> for cortex-m23 which crashes with the previous version of patch 2,
> and succeeds now.
> 
>> Could you also rename t1_print and t1_rtl to thumb1_const_print and
>> thumb1_const_rtl.  I think the names as they stand are likely to be too
>> generic.
> OK, done.
> 
> How about this new version?
> I'm not yet sure about the most appropriate naming for:
> thumb1_gen_const_int
> thumb1_gen_const_int_asm
> should they be
> thumb1_gen_const_int_rtl
> thumb1_gen_const_int_print
> to be consistent with the new classes?


It would probably be better, yes.

More detailed comments below.

R.

> 
> Thanks,
> 
> Christophe
> 
>> R.
>>
>>>
>>> Thanks,
>>>
>>> Christophe
>>>
>>>> R.
>>>>
>>>>>
>>>>>>
>>>>>> R.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> R.
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Christophe
>>>>>>>>>>
>>>>>>>>>>> class t1_rtl
>>>>>>>>>>> {
>>>>>>>>>>>  public:
>>>>>>>>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
>>>>>>>>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
>>>>>>>>>>> };
>>>>>>>>>>>
>>>>>>>>>>> class t1_print
>>>>>>>>>>> {
>>>>>>>>>>>  public:
>>>>>>>>>>>   t1_print (FILE *f) : t_file(f) {}
>>>>>>>>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
>>>>>>>>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
>>>>>>>>>>>  private:
>>>>>>>>>>>   FILE *t_file;
>>>>>>>>>>> };
>>>>>>>>>>>
>>>>>>>>>>> template <class T>
>>>>>>>>>>> void thumb1_gen_const_int(T t, int f)
>>>>>>>>>>> {
>>>>>>>>>>>   // Expansion of thumb1_gen_const_int ...
>>>>>>>>>>>   t.ashift(f);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> // Usage...
>>>>>>>>>>> void f1()
>>>>>>>>>>> {
>>>>>>>>>>>   // Use the RTL expander
>>>>>>>>>>>   t1_rtl g;
>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> void f2()
>>>>>>>>>>> {
>>>>>>>>>>>   // Use the printf expander writing to stdout
>>>>>>>>>>>   t1_print g(stdout);
>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> With this you can write thumb1_gen_const_int without having to worry
>>>>>>>>>>> about which expander is being used in each instance and the template
>>>>>>>>>>> expansion will use the right version.
>>>>>>>>>>>
>>>>>>>>>>> R.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  I also suspect (but haven't check) that the base adjustment 
>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  
>>>>>>>>>>>>>>> If that is
>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by 
>>>>>>>>>>>>>>> 2 for an
>>>>>>>>>>>>>>> even greater range of literals.
>>>>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY 
>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>> I'll update my patch.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  More generally, any sequence of up to
>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as 
>>>>>>>>>>>>>>> fast as the
>>>>>>>>>>>>>>> existing literal pool fall back.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your 
>>>>>>>>>>>>>>> code will
>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two 
>>>>>>>>>>>>>>> shifts could
>>>>>>>>>>>>>>> be merged into a single shift-by-16.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, 
>>>>>>>>>>>>>>> if at
>>>>>>>>>>>>>>> all possible.
>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing 
>>>>>>>>>>>>>> testcase
>>>>>>>>>>>>>> that showed the problem. There are also:
>>>>>>>>>>>>>> g++.dg/opt/thunk1.C
>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C
>>>>>>>>>>>>>> g++.dg/torture/pr46287.C
>>>>>>>>>>>>>> g++.dg/torture/pr45699.C
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add
>>>>>>>>>>>>>> -mpure-code in dg-options?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On reflection, probably not - that just makes things more 
>>>>>>>>>>>>> complicated
>>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with 
>>>>>>>>>>>>> other
>>>>>>>>>>>>> sets of options on the command line and the fall-out from that).  
>>>>>>>>>>>>> If
>>>>>>>>>>>>> someone cares about pure-code they should be doing full testsuite 
>>>>>>>>>>>>> runs
>>>>>>>>>>>>> with it enabled and that should be sufficient.
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use 
>>>>>>>>>>>> a
>>>>>>>>>>>> modified simulator.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Christophe
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> R.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>  gcc/config/arm/arm.c | 91 
>>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644
>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, 
>>>>>>>>>>>>>>>> tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>        if (mi_delta > 255)
>>>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>>>> -       assemble_name (file, label);
>>>>>>>>>>>>>>>> -       fputs ("+4\n", file);
>>>>>>>>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the 
>>>>>>>>>>>>>>>> constant
>>>>>>>>>>>>>>>> +          pool: we build it explicitly.  */
>>>>>>>>>>>>>>>> +       if (target_pure_code)
>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>> +           bool mov_done_p = false;
>>>>>>>>>>>>>>>> +           int i;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
>>>>>>>>>>>>>>>> +           for (i = 0; i < 3; i++)
>>>>>>>>>>>>>>>> +             {
>>>>>>>>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +               if (byte)
>>>>>>>>>>>>>>>> +                 {
>>>>>>>>>>>>>>>> +                   if (mov_done_p)
>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", 
>>>>>>>>>>>>>>>> byte);
>>>>>>>>>>>>>>>> +                   else
>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", 
>>>>>>>>>>>>>>>> byte);
>>>>>>>>>>>>>>>> +                   mov_done_p = true;
>>>>>>>>>>>>>>>> +                 }
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +               if (mov_done_p)
>>>>>>>>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
>>>>>>>>>>>>>>>> +             }
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +           /* Emit lower byte if needed.  */
>>>>>>>>>>>>>>>> +           if (!mov_done_p)
>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta 
>>>>>>>>>>>>>>>> & 0xff);
>>>>>>>>>>>>>>>> +           else if (mi_delta & 0xff)
>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta 
>>>>>>>>>>>>>>>> & 0xff);
>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>>>> +           assemble_name (file, label);
>>>>>>>>>>>>>>>> +           fputs ("+4\n", file);
>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>>>>>>>>>>>>>>>>                      mi_op, this_regno, this_regno);
>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, 
>>>>>>>>>>>>>>>> tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
>>>>>>>>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>>>> -      assemble_name (file, label);
>>>>>>>>>>>>>>>> -      fputs (":\n", file);
>>>>>>>>>>>>>>>> -      if (flag_pic)
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for 
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> +      function address and delta since we emitted code to 
>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>> +      them.  */
>>>>>>>>>>>>>>>> +      if (!target_pure_code)
>>>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, 
>>>>>>>>>>>>>>>> so the PC
>>>>>>>>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust 
>>>>>>>>>>>>>>>> the offset
>>>>>>>>>>>>>>>> -          accordingly.  */
>>>>>>>>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>>>> -                            tem,
>>>>>>>>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>>>> -                                                ggc_strdup 
>>>>>>>>>>>>>>>> (labelpc)));
>>>>>>>>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>> -     }
>>>>>>>>>>>>>>>> -      else
>>>>>>>>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, 
>>>>>>>>>>>>>>>> BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>>>> +       assemble_name (file, label);
>>>>>>>>>>>>>>>> +       fputs (":\n", file);
>>>>>>>>>>>>>>>> +       if (flag_pic)
>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb 
>>>>>>>>>>>>>>>> mode, so the PC
>>>>>>>>>>>>>>>> +              pipeline offset is four rather than eight.  
>>>>>>>>>>>>>>>> Adjust the offset
>>>>>>>>>>>>>>>> +              accordingly.  */
>>>>>>>>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>>>> +                                tem,
>>>>>>>>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>>>> +                                                    
>>>>>>>>>>>>>>>> ggc_strdup (labelpc)));
>>>>>>>>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, 
>>>>>>>>>>>>>>>> BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 
>>>>>>>>>>>>>>>> 1);
>>>>>>>>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, 
>>>>>>>>>>>>>>>> BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>> +     }
>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>    else
>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>

+class thumb1_const_rtl
...
+  void mov (int val)

This should take a HOST_WIDE_INT.  Similarly for add and shift.  The
same applies to the asm version as well.

+    asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val);

Should be using reg_names[dst_regno] in all cases.  In fact, you might
want to move that lookup to the constructor and just save a pointer to
the string there.  You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED
rather than "%d" for the immediate.

+template <class T>
+void
+thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1)
+{
+  bool mov_done_p = false;
+  int val = op1;

This potentially silently loses precision.  In fact, I think you really
want to use "unsigned HOST_WIDE_INT" throughout the following code, so
that the right shifts aren't undefined if dealing with negative numbers.

For safety, you should also have an assertion in here that

  op1 == trunc_int_for_mode (op1, SImode)

+  int shift = 0;
+  int i;
+
+  if (val == 0)

You can short-circuit 0..255 here for a quick exit.

+    {
+      dst.mov (val);
+      return;
+    }

Another trick: if the top nine bits of the 32-bit value are all set,
you're probably going to be better off (and certainly not worse off) by
generating -op1 and then negating the result in a final step - you can
do that via recursion.

+
+  /* In the general case, we need 7 instructions to build
+     a 32 bits constant (1 movs, 3 lsls, 3 adds). We can
+     do better if VAL is small enough, or
+     right-shiftable by a suitable amount.  If the
+     right-shift enables to encode at least one less byte,
+     it's worth it: we save a adds and a lsls at the
+     expense of a final lsls.  */
+  int final_shift = number_of_first_bit_set (val);
+
+  int leading_zeroes = clz_hwi (val);
+  int number_of_bytes_needed
+    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes)
+       / BITS_PER_UNIT) + 1;
+  int number_of_bytes_needed2
+    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift)
+       / BITS_PER_UNIT) + 1;
+
+  if (number_of_bytes_needed2 < number_of_bytes_needed)
+    val >>= final_shift;
+  else
+    final_shift = 0;
+
+  /* If we are in a very small range, we can use either a single movs
+     or movs+adds.  */
+  if ((val >= 0) && (val <= 510))

if val is made unsigned HWI as I suggest, the lower bounds test is not
needed.

+    {
+      if (val > 255)
+       {
+         int high = val - 255;

Again, watch your types.

+
+         dst.mov (high);
+         dst.add (255);
+       }
+      else
+       dst.mov (val);
+
+      if (final_shift > 0)
+       dst.ashift (final_shift);
+    }
+  else
+    {
+      /* General case, emit upper 3 bytes as needed.  */
+      for (i = 0; i < 3; i++)
+       {
+         int byte = (val >> (8 * (3 - i))) & 0xff;

and here.

+
+         if (byte)
+           {
+             /* We are about to emit new bits, stop accumulating a
+                shift amount, and left-shift only if we have already
+                emitted some upper bits.  */
+             if (mov_done_p)
+               {
+                 dst.ashift (shift);
+                 dst.add (byte);
+               }
+             else
+               dst.mov (byte);
+
+             /* Stop accumulating shift amount since we've just
+                emitted some bits.  */
+             shift = 0;
+
+             mov_done_p = true;
+           }
+
+         if (mov_done_p)
+           shift += 8;
+       }
+
+      /* Emit lower byte.  */
+      if (!mov_done_p)
+       dst.mov (val & 0xff);
+      else
+       {
+         dst.ashift (shift);
+         if (val & 0xff)
+           dst.add (val & 0xff);
+       }
+
+      if (final_shift > 0)
+       dst.ashift (final_shift);
+    }
+}
+

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

Reply via email to