On Thu, 2010-12-30 at 18:56 +0200, Revital1 Eres wrote: > Hello, > > The attached patch is my latest attempt to model doloop for arm. > I followed Chung-Lin Tang suggestion and used subs+jump similar to your > patch. > On crotex-A8 I see gain of 29% on autocor benchmark (telecom suite) with > SMS using the following flags: -fmodulo-sched-allow-regmoves > -funsafe-loop-optimizations -fmodulo-sched -fno-auto-inc-dec > -fdump-rtl-sms -mthumb -mcpu=cortex-a8 -O3. (compare to using only > -mthumb -mcpu=cortex-a8 -O3) > > I have not fully tested the patch and it's not in the proper format of > submission yet. > > Thanks, > Revital > > (See attached file: patch_arm_doloop.txt) > > > > From: Roman Zhuykov <zhr...@ispras.ru> > To: gcc@gcc.gnu.org > Cc: d...@ispras.ru > Date: 30/12/2010 04:04 PM > Subject: [ARM] Implementing doloop pattern > Sent by: gcc-ow...@gcc.gnu.org > > > > Hello! > > The main idea of the work described below was to estimate speedup we can > gain from SMS on ARM. SMS depends on doloop_end pattern and there is no > appropriate instruction on ARM. We decided to create a "fake" > doloop_end pattern on ARM using a pair of "subs" and "bne" assembler > instructions. In implementation we used ideas from machine description > files of other architectures, e. g. spu, which expands doloop_end > pattern only when SMS is enabled. The patch is attached. > > This patch allows to use any possible register for the doloop pattern. > It was tested on trunk snapshot from 30 Aug 2010. It works fine on > several small examples, but gives an ICE on sqlite-amalgamation-3.6.1 > source: > sqlite3.c: In function 'sqlite3WhereBegin': > sqlite3.c:76683:1: internal compiler error: in patch_jump_insn, at > cfgrtl.c:1020 > > ICE happens in ira pass, when cleanup_cfg is called at the end or ira. > > The "bad" instruction looks like > (jump_insn 3601 628 4065 76 (parallel [ > (set (pc) > (if_then_else (ne (mem/c:SI (plus:SI (reg/f:SI 13 sp) > (const_int 36 [0x24])) [105 %sfp+-916 > S4 A32]) > (const_int 1 [0x1])) > (label_ref 3600) > (pc))) > (set (mem/c:SI (plus:SI (reg/f:SI 13 sp) > (const_int 36 [0x24])) [105 %sfp+-916 S4 A32]) > (plus:SI (mem/c:SI (plus:SI (reg/f:SI 13 sp) > (const_int 36 [0x24])) [105 %sfp+-916 S4 A32]) > (const_int -1 [0xffffffffffffffff]))) > ]) sqlite3.c:75235 328 {doloop_end_internal} > (expr_list:REG_BR_PROB (const_int 9100 [0x238c]) > (nil)) > -> 3600) > > So, the problem seems to be with ira. Memory is used instead of a > register to store doloop counter. We tried to fix this by explicitly > specifying hard register (r5) for doloop pattern. The fixed version > seems to work, but this doesn't look like a proper fix. On trunk > snapshot from 17 Dec 2010 the ICE described above have disappeared, but > probably it's just a coincidence, and it will shop up anyway on some > other test case. > > The r5-fix shows the following results (compare "-O2 -fno-auto-inc-dec > -fmodulo-sched" vs "-O2 -fno-auto-inc-dec"). > Aburto benchmarks: heapsort and matmult - 3% speedup. nsieve - 7% slowdown. > Other aburto tests, sqlite tests and libevas rasterization library > (expedite testsuite) show around zero results. > > A motivating example shows about 23% speedup: > > char scal (int n, char *a, char *b) > { > int i; > char s = 0; > for (i = 0; i < n; i++) > s += a[i] * b[i]; > return s; > } > > We have analyzed SMS results, and can conclude that if SMS has > successfully built a schedule for the loop we usually gain a speedup, > and when SMS fails, we often have some slowdown, which have appeared > because of do-loop conversion. > > The questions are: > How to properly fix the ICE described? > Do you think this approach (after the fixes) can make its way into trunk? > > Happy holidays! > -- > Roman Zhuykov > > [attachment "sms-doloop-any-reg.diff" deleted by Revital1 Eres/Haifa/IBM]
@@ -162,6 +175,7 @@ doloop_condition_get (rtx doloop_pat) return 0; if ((XEXP (condition, 0) == reg) + || (REGNO (XEXP (condition, 0)) == CC_REGNUM) || (GET_CODE (XEXP (condition, 0)) == PLUS && XEXP (XEXP (condition, 0), 0) == reg)) You can't depend on CC_REGNUM in generic code. That's part of the private machine description for ARM. Other cores have different ways of representing condition codes. R.