On Wed, Oct 15, 2014 at 7:10 PM, Yangfei (Felix) <felix.y...@huawei.com> wrote:
> Hi Sterling,
>
>     Since the patch is delayed for a long time, I'm kind of pushing it. Sorry 
> for that.
>     Yeah, you are right. We have some performance issue here as GCC may use 
> one more general register in some cases with this patch.
>     Take the following arraysum testcase for example. In doloop optimization, 
> GCC figures out that the number of iterations is 1024 and creates a new 
> pseudo 79 as the new trip count register.
>     The pseudo 79 is live throughout the loop, this makes the register 
> pressure in the loop higher. And it's possible that this new pseudo is 
> spilled by reload when the register pressure is very high.
>     I know that the xtensa loop instruction copies the trip count register 
> into the LCOUNT special register. And we need describe this hardware feature 
> in GCC in order to free the trip count register.
>     But I find it difficult to do. Do you have any good suggestions on this?

There are two issues related to the trip count, one I would like you
to solve now, one later.

1. Later: The trip count doesn't need to be updated at all inside
these loops, once the loop instruction executes. The code below
relates to this case.

2. Now: You should be able to use a loop instruction regardless of
whether the trip count is spilled. If you have an example where that
wouldn't work, I would love to see it.

>
> arraysum.c:
> int g[1024];
> int g_sum;
>
> void test_entry ()
> {
>         int i, Sum = 0;
>
>         for (i = 0; i < 1024; i++)
>           Sum = Sum + g[i];
>
>         g_sum = Sum;
> }
>
>
> 1. RTL before the doloop optimization pass(arraysum.c.193r.loop2_invariant):
> (note 34 0 32 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
> (note 32 34 36 2 NOTE_INSN_FUNCTION_BEG)
> (insn 36 32 37 2 (set (reg:SI 72 [ ivtmp$8 ])
>         (mem/u/c:SI (symbol_ref/u:SI ("*.LC2") [flags 0x2]) [2  S4 A32])) 29 
> {movsi_internal}
>      (expr_list:REG_EQUAL (symbol_ref:SI ("g")  <var_decl 0x7f6eef5d62d0 g>)
>         (nil)))
> (insn 37 36 33 2 (set (reg/f:SI 76 [ D.1393 ])
>         (mem/u/c:SI (symbol_ref/u:SI ("*.LC3") [flags 0x2]) [2  S4 A32])) 29 
> {movsi_internal}
>      (expr_list:REG_EQUAL (const:SI (plus:SI (symbol_ref:SI ("g")  <var_decl 
> 0x7f6eef5d62d0 g>)
>                 (const_int 4096 [0x1000])))
>         (nil)))
> (insn 33 37 42 2 (set (reg/v:SI 74 [ Sum ])
>         (const_int 0 [0])) arraysum.c:6 29 {movsi_internal}
>      (nil))
> (code_label 42 33 38 3 2 "" [0 uses])
> (note 38 42 39 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
> (insn 39 38 40 3 (set (reg:SI 77 [ MEM[base: _14, offset: 0B] ])
>         (mem:SI (reg:SI 72 [ ivtmp$8 ]) [2 MEM[base: _14, offset: 0B]+0 S4 
> A32])) arraysum.c:9 29 {movsi_internal}
>      (nil))
> (insn 40 39 41 3 (set (reg/v:SI 74 [ Sum ])
>         (plus:SI (reg/v:SI 74 [ Sum ])
>             (reg:SI 77 [ MEM[base: _14, offset: 0B] ]))) arraysum.c:9 1 
> {addsi3}
>      (expr_list:REG_DEAD (reg:SI 77 [ MEM[base: _14, offset: 0B] ])
>         (nil)))
> (insn 41 40 43 3 (set (reg:SI 72 [ ivtmp$8 ])
>         (plus:SI (reg:SI 72 [ ivtmp$8 ])
>             (const_int 4 [0x4]))) 1 {addsi3}
>      (nil))
> (jump_insn 43 41 52 3 (set (pc)
>         (if_then_else (ne (reg:SI 72 [ ivtmp$8 ])
>                 (reg/f:SI 76 [ D.1393 ]))
>             (label_ref:SI 52)
>             (pc))) arraysum.c:8 39 {*btrue}
>      (int_list:REG_BR_PROB 9899 (nil))
>  -> 52)
> (code_label 52 43 51 5 3 "" [1 uses])
> (note 51 52 44 5 [bb 5] NOTE_INSN_BASIC_BLOCK)
> (note 44 51 45 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
> (insn 45 44 46 4 (set (reg/f:SI 78)
>         (mem/u/c:SI (symbol_ref/u:SI ("*.LC4") [flags 0x2]) [2  S4 A32])) 
> arraysum.c:11 29 {movsi_internal}
>      (expr_list:REG_EQUAL (symbol_ref:SI ("g_sum")  <var_decl 0x7f6eef5d6360 
> g_sum>)
>         (nil)))
> (insn 46 45 0 4 (set (mem/c:SI (reg/f:SI 78) [2 g_sum+0 S4 A32])
>         (reg/v:SI 74 [ Sum ])) arraysum.c:11 29 {movsi_internal}
>      (expr_list:REG_DEAD (reg/f:SI 78)
>         (expr_list:REG_DEAD (reg/v:SI 74 [ Sum ])
>             (nil))))
>
>
> 2. RTL after the doloop optimization pass(arraysum.c.195r.loop2_doloop):
> (note 34 0 32 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
> (note 32 34 36 2 NOTE_INSN_FUNCTION_BEG)
> (insn 36 32 37 2 (set (reg:SI 72 [ ivtmp$8 ])
>         (mem/u/c:SI (symbol_ref/u:SI ("*.LC2") [flags 0x2]) [2  S4 A32])) 29 
> {movsi_internal}
>      (expr_list:REG_EQUAL (symbol_ref:SI ("g")  <var_decl 0x7f6eef5d62d0 g>)
>         (nil)))
> (insn 37 36 33 2 (set (reg/f:SI 76 [ D.1393 ])
>         (mem/u/c:SI (symbol_ref/u:SI ("*.LC3") [flags 0x2]) [2  S4 A32])) 29 
> {movsi_internal}
>      (expr_list:REG_EQUAL (const:SI (plus:SI (symbol_ref:SI ("g")  <var_decl 
> 0x7f6eef5d62d0 g>)
>                 (const_int 4096 [0x1000])))
>         (nil)))
> (insn 33 37 54 2 (set (reg/v:SI 74 [ Sum ])
>         (const_int 0 [0])) arraysum.c:6 29 {movsi_internal}
>      (nil))
> (insn 54 33 42 2 (set (reg:SI 79)
>         (const_int 1024 [0x400])) arraysum.c:6 -1
>      (nil))
> (code_label 42 54 38 3 2 "" [0 uses])
> (note 38 42 39 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
> (insn 39 38 40 3 (set (reg:SI 77 [ MEM[base: _14, offset: 0B] ])
>         (mem:SI (reg:SI 72 [ ivtmp$8 ]) [2 MEM[base: _14, offset: 0B]+0 S4 
> A32])) arraysum.c:9 29 {movsi_internal}
>      (nil))
> (insn 40 39 41 3 (set (reg/v:SI 74 [ Sum ])
>         (plus:SI (reg/v:SI 74 [ Sum ])
>             (reg:SI 77 [ MEM[base: _14, offset: 0B] ]))) arraysum.c:9 1 
> {addsi3}
>      (expr_list:REG_DEAD (reg:SI 77 [ MEM[base: _14, offset: 0B] ])
>         (nil)))
> (insn 41 40 53 3 (set (reg:SI 72 [ ivtmp$8 ])
>         (plus:SI (reg:SI 72 [ ivtmp$8 ])
>             (const_int 4 [0x4]))) 1 {addsi3}
>      (nil))
> (jump_insn 53 41 52 3 (parallel [
>             (set (pc)
>                 (if_then_else (ne (reg:SI 79)
>                         (const_int 1 [0x1]))
>                     (label_ref 52)
>                     (pc)))
>             (set (reg:SI 79)
>                 (plus:SI (reg:SI 79)
>                     (const_int -1 [0xffffffffffffffff])))
>             (unspec [
>                     (const_int 0 [0])
>                 ] 13)
>             (clobber (scratch:SI))
>         ]) -1
>      (int_list:REG_BR_PROB 9899 (nil))
>  -> 52)
> (code_label 52 53 51 5 3 "" [1 uses])
> (note 51 52 44 5 [bb 5] NOTE_INSN_BASIC_BLOCK)
> (note 44 51 45 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
> (insn 45 44 46 4 (set (reg/f:SI 78)
>         (mem/u/c:SI (symbol_ref/u:SI ("*.LC4") [flags 0x2]) [2  S4 A32])) 
> arraysum.c:11 29 {movsi_internal}
>      (expr_list:REG_EQUAL (symbol_ref:SI ("g_sum")  <var_decl 0x7f6eef5d6360 
> g_sum>)
>         (nil)))
> (insn 46 45 0 4 (set (mem/c:SI (reg/f:SI 78) [2 g_sum+0 S4 A32])
>         (reg/v:SI 74 [ Sum ])) arraysum.c:11 29 {movsi_internal}
>      (expr_list:REG_DEAD (reg/f:SI 78)
>         (expr_list:REG_DEAD (reg/v:SI 74 [ Sum ])
>             (nil))))
>
>
>>
>> On Tue, Oct 14, 2014 at 8:39 AM, Felix Yang <fei.yang0...@gmail.com> wrote:
>> > PING?
>> > Cheers,
>> > Felix
>>
>> Felix,
>>
>> This isn't my day job, 24-hour pings are unproductive.
>>
>> You shouldn't need to worry about the trip count register getting spilled. It
>> makes no difference whatsoever to how the loop operates--the trip count is
>> dead with regards to the loop once the instruction executes. You don't need 
>> to
>> describe LCOUNT to gcc in order for this not to matter. It should be enough 
>> to
>> describe the zcl as consuming the value in the same way a branch instruction
>> consumes a value.
>>
>> If you have a case where spilling it is causing a problem, then there is a 
>> bug in
>> your code, papered over by dropping case when it is spilled. Similarly with
>> iter_reg_used_outside--it shouldn't affect whether or not a zcl is valid 
>> here. If
>> you have a case where it does, there is likely a bug in your code.
>>
>> If the code is easier to write by maintaining trip_count up, then fine (for 
>> now);
>> you give up some performance (in fact, a lot of performance), but that 
>> doesn't
>> matter as to the correctness.
>>
>>
>> >
>> >
>> > On Tue, Oct 14, 2014 at 12:30 AM, Felix Yang <fei.yang0...@gmail.com>
>> wrote:
>> >> Thanks for the comments.
>> >>
>> >> The patch checked the usage of teh trip count register, making sure
>> >> that it is not used in the loop body other than the doloop_end or
>> >> lives past the doloop_end instruction, as the following code snippet
>> >> shows:
>> >>
>> >> +  /* Scan all the blocks to make sure they don't use iter_reg.  */
>> >> + if (loop->iter_reg_used || loop->iter_reg_used_outside)
>> >> +    {
>> >> +      if (dump_file)
>> >> +        fprintf (dump_file, ";; loop %d uses iterator\n",
>> >> +                 loop->loop_no);
>> >> +      return false;
>> >> +    }
>> >>
>> >>     For the spill issue, I think we need to handle it. The reason is
>> >> that currently we are not telling GCC about the existence of the
>> >> LCOUNT register. Instead, we keep the trip count in a general
>> >> register and it's possible that this register can be spilled when
>> >> register pressure is high.
>> >>     It's a good idea to post another patch to describe the LCOUNT
>> >> register in GCC in order to free this general register. But I want
>> >> this patch applied as a first step, OK?
>> >>
>> >> Cheers,
>> >> Felix
>> >>
>> >>
>> >> On Tue, Oct 14, 2014 at 12:09 AM, augustine.sterl...@gmail.com
>> >> <augustine.sterl...@gmail.com> wrote:
>> >>> On Fri, Oct 10, 2014 at 6:59 AM, Felix Yang <fei.yang0...@gmail.com>
>> wrote:
>> >>>> Hi Sterling,
>> >>>>
>> >>>>     I made some improvement to the patch. Two changes:
>> >>>>     1. TARGET_LOOPS is now used as a condition of the doloop
>> >>>> related patterns, which is more elegant.
>> >>>
>> >>> Fine.
>> >>>
>> >>>>     2. As the trip count register of the zero-cost loop maybe
>> >>>> potentially spilled, we need to change the patterns in order to
>> >>>> handle this issue.
>> >>>
>> >>> Actually, for xtensa you don't. The trip count is copied into LCOUNT
>> >>> at the execution of the loop instruction, and therefore a spill or
>> >>> whatever doesn't matter--it won't affect the result. So as long as
>> >>> you have the trip count at the start of the loop, you are fine.
>> >>>
>> >>> This does bring up an issue of whether or not the trip count can be
>> >>> modified during the loop. (note that this is different than early
>> >>> exit.) If it can, you can't use a zero-overhead loop. Does your
>> >>> patch address this case.
>> >>>
>> >>> The solution is similar to that adapted by c6x backend.
>> >>>> Just turn the zero-cost loop into a regular loop when that happens
>> >>>> when reload is completed.
>> >>>>     Attached please find version 4 of the patch. Make check
>> >>>> regression tested with xtensa-elf-gcc/simulator.
>> >>>>     OK for trunk?

Reply via email to