Hi, [actual question is in the last paragraph of this post]
I recently stumbled over an old (closed) bug in the bug tracker where the solution was to remove ASMCSE from O3, this bug report was later referred to as "the final drop in the bucket" to not longer have this optimization enabled by default. I have googled the mailing lists and searched the bug tracker but I could not find much discussion about this topic, the few postings I found suggested that the consensus seems to be that this particular optimizer has become totally unmaintainable and after reading this I fear that it will eventually be removed completely. IMHO this would be a big loss. I have some code that when enabling -OoASMCSE will run 20..30% faster. I tried to reproduce it with a smaller demo program (attached proect1.lpr), the effect only becomes visible when using much more local variables than there are registers. The attached program does not do anything useful, its only purpose is to provoke a lot of redundant register reloading. The loop looks like this: for I := 0 to ITER do begin // use much more local variables than there are registers J := I and $0F; K := (I+1) and $0F; L := (I+2) and $0F; M := J xor $0F; N := K xor $0F; O := L xor $0F; // and now provoke a lot of redundant register reloads U := A[J] or A[K] or A[L] or A[M]; V := B[K] or B[L] or B[M] or B[N]; W := C[L] or C[M] or C[N] or C[O]; end; U,V,W and the arrays are QWords, (and I am using it on i386) this is because my real world application also does a lot of bitwise operations on QWords that are all looked up in different arrays and some of the functions are quite long and have expressions even more complicated than above (where the array indices themselves are looked up in other arrays). The above code (just one line of it) without ASMCSE would compile like this: U := A[J] or A[K] or A[L] or A[M]; 0x80480f8 mov eax,DWORD PTR [ebp-0x3c] // <------- could have used edi instead... 0x80480fb mov esi,DWORD PTR [eax*8+0x80e2ca0] 0x8048102 mov edi,DWORD PTR [ebp-0x3c] // <------- ...then this would be redundant! 0x8048105 mov eax,DWORD PTR [edi*8+0x80e2ca4] 0x804810c mov edi,DWORD PTR [ebp-0x38] 0x804810f or esi,DWORD PTR [edi*8+0x80e2ca0] 0x8048116 mov edi,DWORD PTR [ebp-0x38] // <------- redundant! 0x8048119 or eax,DWORD PTR [edi*8+0x80e2ca4] 0x8048120 mov edi,DWORD PTR [ebp-0x34] 0x8048123 or esi,DWORD PTR [edi*8+0x80e2ca0] 0x804812a mov edi,DWORD PTR [ebp-0x34] // <------- redundant! 0x804812d or eax,DWORD PTR [edi*8+0x80e2ca4] 0x8048134 mov edi,DWORD PTR [ebp-0x30] 0x8048137 or esi,DWORD PTR [edi*8+0x80e2ca0] 0x804813e mov edi,DWORD PTR [ebp-0x30] // <------- redundant! 0x8048141 or eax,DWORD PTR [edi*8+0x80e2ca4] 0x8048148 mov DWORD PTR [ebp-0x24],esi 0x804814b mov DWORD PTR [ebp-0x20],eax same code with ASMCSE: U := A[J] or A[K] or A[L] or A[M]; 0x80482c8 mov eax,DWORD PTR [ebp-0x3c] 0x80482cb mov esi,DWORD PTR [eax*8+0x80e2ca0] 0x80482d2 mov edi,DWORD PTR [eax*8+0x80e2ca4] 0x80482d9 mov eax,DWORD PTR [ebp-0x38] 0x80482dc or esi,DWORD PTR [eax*8+0x80e2ca0] 0x80482e3 or edi,DWORD PTR [eax*8+0x80e2ca4] 0x80482ea mov eax,DWORD PTR [ebp-0x34] 0x80482ed or esi,DWORD PTR [eax*8+0x80e2ca0] 0x80482f4 or edi,DWORD PTR [eax*8+0x80e2ca4] 0x80482fb mov eax,DWORD PTR [ebp-0x30] 0x80482fe or esi,DWORD PTR [eax*8+0x80e2ca0] 0x8048305 or edi,DWORD PTR [eax*8+0x80e2ca4] 0x804830c mov DWORD PTR [ebp-0x24],esi 0x804830f mov DWORD PTR [ebp-0x20],edi The difference in execution time is extreme (see the attached program): bernd@t40:~/Desktop/asmcse\ $ ./test 3682 2813 This is milliseconds for the above code 100 millions of times, first is without, second with asmcse. Without ASMCSE the attached program takes 30% more time! CPU is one of these old Pentium-M, 1300MHz that were built into the legendary ThinkPad T40. My question is now: Will ASMCSE eventually be completely removed? And if this happens, will there be a replacement? IMHO this optimization is really needed. Is it really so difficult to make an optimizer that simply (is it simple?) eliminates these redundant register reloads without being overly complicated and unmaintainable? The last question is only because I don't yet know how the code generation and optimization of FPC even works, I wouldn't even know where to start looking for it in the compiler sources. Bernd
project1.lpr
Description: application/wine-extension-lpr
_______________________________________________ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal