On Thu, 2010-12-30 at 18:56 +0200, Revital1 Eres wrote:
> Hello,
> 
> The attached patch is my latest attempt to model doloop for arm.
> I followed Chung-Lin Tang suggestion and used subs+jump similar to your
> patch.
> On crotex-A8 I see gain of 29% on autocor benchmark (telecom suite) with
> SMS using the following flags: -fmodulo-sched-allow-regmoves
> -funsafe-loop-optimizations -fmodulo-sched   -fno-auto-inc-dec
> -fdump-rtl-sms -mthumb  -mcpu=cortex-a8 -O3. (compare to using only
> -mthumb  -mcpu=cortex-a8 -O3)
> 
> I have not fully tested the patch and it's not in the proper format of
> submission yet.
> 
> Thanks,
> Revital
> 
> (See attached file: patch_arm_doloop.txt)
> 
> 
> 
> From: Roman Zhuykov <zhr...@ispras.ru>
> To:   gcc@gcc.gnu.org
> Cc:   d...@ispras.ru
> Date: 30/12/2010 04:04 PM
> Subject:      [ARM] Implementing doloop pattern
> Sent by:      gcc-ow...@gcc.gnu.org
> 
> 
> 
> Hello!
> 
> The main idea of the work described below was to estimate speedup we can
> gain from SMS on ARM.  SMS depends on doloop_end pattern and there is no
> appropriate instruction on ARM.  We decided to create a "fake"
> doloop_end pattern on ARM using a pair of "subs" and "bne" assembler
> instructions.  In implementation we used ideas from machine description
> files of other architectures, e. g. spu, which expands doloop_end
> pattern only when SMS is enabled.  The patch is attached.
> 
> This patch allows to use any possible register for the doloop pattern.
> It was tested on trunk snapshot from 30 Aug 2010.  It works fine on
> several small examples, but gives an ICE on sqlite-amalgamation-3.6.1
> source:
> sqlite3.c: In function 'sqlite3WhereBegin':
> sqlite3.c:76683:1: internal compiler error: in patch_jump_insn, at
> cfgrtl.c:1020
> 
> ICE happens in ira pass, when cleanup_cfg is called at the end or ira.
> 
> The "bad" instruction looks like
> (jump_insn 3601 628 4065 76 (parallel [
>              (set (pc)
>                  (if_then_else (ne (mem/c:SI (plus:SI (reg/f:SI 13 sp)
>                                  (const_int 36 [0x24])) [105 %sfp+-916
> S4 A32])
>                          (const_int 1 [0x1]))
>                      (label_ref 3600)
>                      (pc)))
>              (set (mem/c:SI (plus:SI (reg/f:SI 13 sp)
>                          (const_int 36 [0x24])) [105 %sfp+-916 S4 A32])
>                  (plus:SI (mem/c:SI (plus:SI (reg/f:SI 13 sp)
>                              (const_int 36 [0x24])) [105 %sfp+-916 S4 A32])
>                      (const_int -1 [0xffffffffffffffff])))
>          ]) sqlite3.c:75235 328 {doloop_end_internal}
>       (expr_list:REG_BR_PROB (const_int 9100 [0x238c])
>          (nil))
>   -> 3600)
> 
> So, the problem seems to be with ira.  Memory is used instead of a
> register to store doloop counter.  We tried to fix this by explicitly
> specifying hard register (r5) for doloop pattern.  The fixed version
> seems to work, but this doesn't look like a proper fix.  On trunk
> snapshot from 17 Dec 2010 the ICE described above have disappeared, but
> probably it's just a coincidence, and it will shop up anyway on some
> other test case.
> 
> The r5-fix shows the following results (compare "-O2 -fno-auto-inc-dec
> -fmodulo-sched" vs "-O2 -fno-auto-inc-dec").
> Aburto benchmarks: heapsort and matmult - 3% speedup. nsieve - 7% slowdown.
> Other aburto tests, sqlite tests and libevas rasterization library
> (expedite testsuite) show around zero results.
> 
> A motivating example shows about 23% speedup:
> 
> char scal (int n, char *a, char *b)
> {
>    int i;
>    char s = 0;
>    for (i = 0; i < n; i++)
>      s += a[i] * b[i];
>    return s;
> }
> 
> We have analyzed SMS results, and can conclude that if SMS has
> successfully built a schedule for the loop we usually gain a speedup,
> and when SMS fails, we often have some slowdown, which have appeared
> because of do-loop conversion.
> 
> The questions are:
> How to properly fix the ICE described?
> Do you think this approach (after the fixes) can make its way into trunk?
> 
> Happy holidays!
> --
> Roman Zhuykov
> 
> [attachment "sms-doloop-any-reg.diff" deleted by Revital1 Eres/Haifa/IBM]


@@ -162,6 +175,7 @@ doloop_condition_get (rtx doloop_pat)
     return 0;
 
   if ((XEXP (condition, 0) == reg)
+      || (REGNO (XEXP (condition, 0)) == CC_REGNUM)
       || (GET_CODE (XEXP (condition, 0)) == PLUS
                   && XEXP (XEXP (condition, 0), 0) == reg))

You can't depend on CC_REGNUM in generic code.  That's part of the
private machine description for ARM.  Other cores have different ways of
representing condition codes.

R.


Reply via email to