[fpc-pascal] -OoASMCSE

Bernd Sat, 31 Dec 2011 06:13:17 -0800

Hi,

[actual question is in the last paragraph of this post]


I recently stumbled over an old (closed) bug in the bug tracker where
the solution was to remove ASMCSE from O3, this bug report was later
referred to as  "the final drop in the bucket" to not longer have this
optimization enabled by default.

I have googled the mailing lists and searched the bug tracker but I
could not find much discussion about this topic, the few postings I
found suggested that the consensus seems to be that this particular
optimizer has become totally unmaintainable and after reading this I
fear that it will eventually be removed completely. IMHO this would be
a big loss.

I have some code that when enabling -OoASMCSE will run 20..30% faster.
I tried to reproduce it with a smaller demo program (attached
proect1.lpr), the effect only becomes visible when using much more
local variables than there are registers. The attached program does
not do anything useful, its only purpose is to provoke a lot of
redundant register reloading. The loop looks like this:

  for I := 0 to ITER do begin
    // use much more local variables than there are registers
    J := I and $0F;
    K := (I+1) and $0F;
    L := (I+2) and $0F;
    M := J xor $0F;
    N := K xor $0F;
    O := L xor $0F;

    // and now provoke a lot of redundant register reloads
    U := A[J] or A[K] or A[L] or A[M];
    V := B[K] or B[L] or B[M] or B[N];
    W := C[L] or C[M] or C[N] or C[O];
  end;

U,V,W and the arrays are QWords, (and I am using it on i386) this is
because my real world application also does a lot of bitwise
operations on QWords that are all looked up in different arrays and
some of the functions are quite long and have expressions even more
complicated than above (where the array indices themselves are looked
up in other arrays).

The above code (just one line of it) without ASMCSE would compile like this:

    U := A[J] or A[K] or A[L] or A[M];
0x80480f8 mov    eax,DWORD PTR [ebp-0x3c]   // <------- could have
used edi instead...
0x80480fb mov    esi,DWORD PTR [eax*8+0x80e2ca0]
0x8048102 mov    edi,DWORD PTR [ebp-0x3c]   // <------- ...then this
would be redundant!
0x8048105 mov    eax,DWORD PTR [edi*8+0x80e2ca4]
0x804810c mov    edi,DWORD PTR [ebp-0x38]
0x804810f or     esi,DWORD PTR [edi*8+0x80e2ca0]
0x8048116 mov    edi,DWORD PTR [ebp-0x38]   // <------- redundant!
0x8048119 or     eax,DWORD PTR [edi*8+0x80e2ca4]
0x8048120 mov    edi,DWORD PTR [ebp-0x34]
0x8048123 or     esi,DWORD PTR [edi*8+0x80e2ca0]
0x804812a mov    edi,DWORD PTR [ebp-0x34]   // <------- redundant!
0x804812d or     eax,DWORD PTR [edi*8+0x80e2ca4]
0x8048134 mov    edi,DWORD PTR [ebp-0x30]
0x8048137 or     esi,DWORD PTR [edi*8+0x80e2ca0]
0x804813e mov    edi,DWORD PTR [ebp-0x30]   // <------- redundant!
0x8048141 or     eax,DWORD PTR [edi*8+0x80e2ca4]
0x8048148 mov    DWORD PTR [ebp-0x24],esi
0x804814b mov    DWORD PTR [ebp-0x20],eax

same code with ASMCSE:

    U := A[J] or A[K] or A[L] or A[M];
0x80482c8 mov    eax,DWORD PTR [ebp-0x3c]
0x80482cb mov    esi,DWORD PTR [eax*8+0x80e2ca0]
0x80482d2 mov    edi,DWORD PTR [eax*8+0x80e2ca4]
0x80482d9 mov    eax,DWORD PTR [ebp-0x38]
0x80482dc or     esi,DWORD PTR [eax*8+0x80e2ca0]
0x80482e3 or     edi,DWORD PTR [eax*8+0x80e2ca4]
0x80482ea mov    eax,DWORD PTR [ebp-0x34]
0x80482ed or     esi,DWORD PTR [eax*8+0x80e2ca0]
0x80482f4 or     edi,DWORD PTR [eax*8+0x80e2ca4]
0x80482fb mov    eax,DWORD PTR [ebp-0x30]
0x80482fe or     esi,DWORD PTR [eax*8+0x80e2ca0]
0x8048305 or     edi,DWORD PTR [eax*8+0x80e2ca4]
0x804830c mov    DWORD PTR [ebp-0x24],esi
0x804830f mov    DWORD PTR [ebp-0x20],edi

The difference in execution time is extreme (see the attached program):
bernd@t40:~/Desktop/asmcse\ $ ./test
3682
2813

This is milliseconds for the above code 100 millions of times, first
is without, second with asmcse. Without ASMCSE the attached program
takes 30% more time! CPU is one of these old Pentium-M, 1300MHz that
were built into the legendary ThinkPad T40.

My question is now: Will ASMCSE eventually be completely removed? And
if this happens, will there be a replacement? IMHO this optimization
is really needed. Is it really so difficult to make an optimizer that
simply (is it simple?) eliminates these redundant register reloads
without being overly complicated and unmaintainable? The last question
is only because I don't yet know how the code generation and
optimization of FPC even works, I wouldn't even know where to start
looking for it in the compiler sources.

Bernd

project1.lpr
Description: application/wine-extension-lpr

_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal

[fpc-pascal] -OoASMCSE

Reply via email to