On Tue, Jan 20, 2015 at 4:57 AM, Ricardo Telichevesky <rica...@teli.org> wrote: > Hi, > > I have a strange problem with extremely large procedures when generating > 64-bit code > I am using gcc 4.9.2 on RHEL6.3 on a 64-thread 4-socket Xeon E7 4820 with > 256GB of memory. No avx extensions, using sse option when building the > compiler. This particular code is serial. I made measurements with 32- and > 64- bit both debug -g and optimize -O3 for two different examples (this is a > circuit simulator and each example is a different circuit that uses > different transistors). > > Example A is the one the effect is more acute. I listed at the bottom of > the e-mail the 3 procedures that consume 90% of the execution time: > > a) As a counter-example, the factor code listed is heavily optimized > hand-written 300-lines of C++ code that behaves as expected: 64-bit optimize > is way faster than any other, up to 15x faster than 32-bit debug (btw great > job in the compiler, it is really shining here). > > b) evalTran has 18000 lines of auto-generated code and behaves very > counter-intuitively 64-bit optimize code is 3x slower than 32-bit optimize > code. > > c) evalTranRhs has 5000 lines even worse: 64-bit is 4x slower than 32-bit. > Notice that all the data structures in 32-bit code and 64-bit code are > identical and most variables are identical - in fact all integers used are > 64-bit, and most operations are floating-point ops. Initially I thought the > 64-bit code was a lot bigger than 32-bit code and the cache was overwhelmed. > In fact the difference in code sizes is not even 10% (at least debug - > notice I calculated the size of each procedure in bytes) so my trash-the > I-cache conjecture seems to be wrong. The overall execution time is causing > us a lot of problems - 64-bit optimize takes 16seconds, even more than > 32-bit debug 10seconds and 32-bit optimize 4.8 seconds. Considering we only > care about 64-bit optimize we got a big problem here. > > Example B is not so bad, and in fact 64-bit code is slightly faster than > 32-bit code, would be nice if went even faster, but if I got A to behave > like that I'd be pretty happy already. > > I tried to look at the wide array of optimizing options for the code, it > is is a dizzying task and I could not get any kind of intuition besides the > -O3 ... Would you have any suggestions for the proper flags for those > ridiculously large auto-generated codes that might be able to alleviate this > 32-bit vs 64-bit problem? Would you think that the fact this code is in a > dynamic linked library (-fPIC) plays a role?
It's hard to tell without a testcase but GCC has various limits on code sizes passes deal with so you might trip one of these which effectively would disable optimizations. For example loop dependence analysis has a limit on the number of memory references it considers (--param loop-max-datarefs-for-datadeps, default 1000). Note that not all such limits are controlled by --params. We have -Wdisabled-optimization that should warn if you run into any such case (but the warning is unfortunately not correctly implemented by all passes having such limits). Thanks, Richard. > Thanks very much for your help, > Ricardo > > > All times are wall clock in micro-seconds - the main was checked against the > reported UNIX time and is exact. > > example A > ========== > evalTran has 18000 lines of C code (two for loops around 99% of the code) > evalTranRhs has 5000 lines of C code (two for loops around 99% of the code) > > 32 bit debug -g -m32 -fPIC -Wall -Winvalid-pch -msse2 > %time elapsed(us) #calls per call(us) timer name @DN@ > ----- ----------- ------ ------------ -------------- > 2.503 254536 8335 30 numerical TRAN factor > 56.01 5695065 8335 683 evalTran bytes=231791 > 35.41 3600646 13924 258 evalTranRhs bytes=57501 > 100 10168242 1 10168242 main @DT@ > > 32 bit optimize -O3 -m32 -fPIC -Wall -Winvalid-pch -msse2 > %time elapsed(us) #calls per call(us) timer name @DN@ > ----- ----------- ------ ------------ -------------- > 0.710 34442 8335 4 numerical TRAN factor > 43.06 2087757 8335 250 evalTran > 43.49 2108786 13925 151 evalTranRhs > 100 4848520 1 4848520 main @DT@ > > > 64 bit debug -g -fPIC -Wall -Winvalid-pch -msse2 > %time elapsed(us) #calls per call(us) timer name @DN@ > ----- ----------- ------ ------------ -------------- > 0.973 205144 8335 24 numerical TRAN factor > 46.43 9785920 8335 1174 evalTran bytes=252741 > 49.72 10478888 13924 752 evalTranRhs bytes=58442 > 100 21077659 1 21077659 main @DT@ > > > 64 bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2 > %time elapsed(us) #calls per call(us) timer name @DN@ > ----- ----------- ------ ------------ -------------- > 0.147 23819 8335 2 numerical TRAN factor > 39.26 6360254 8335 763 evalTran > 57.28 9279087 13924 666 evalTranRhs > 100 16198762 1 16198762 main @DT@ > > > > > > example B > ========= > evalTran has 10000 lines of C code (two for loops around 99% of the code) > evalTranRhs has 2500 lines of C code (two for loops around 99% of the code) > > 32-bit debug -g -fPIC -Wall -Winvalid-pch -msse2 > %time elapsed(us) #calls per call(us) timer name @DN@ > ----- ----------- ------ ------------ -------------- > 6.55 989826 46612 21 numerical TRAN factor > 63.17 9546694 46612 204 evalTran bytes=141478 > 22.36 3379311 47626 70 evalTranRhs bytes=35871 > 100 15112540 1 15112540 main @DT@ > > 32-bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2 > %time elapsed(us) #calls per call(us) timer name @DN@ > ----- ----------- ------ ------------ -------------- > 3.012 157060 46612 3 numerical TRAN factor > 50.42 2629251 46612 56 evalTran > 34.18 1782641 47626 37 evalTranRhs > 100 5214827 1 5214827 main @DT@ > > > 64-bit debug -g -fPIC -Wall -Winvalid-pch -msse2 > %time elapsed(us) #calls per call(us) timer name @DN@ > ----- ----------- ------ ------------ -------------- > 6.439 837743 46612 17 numerical TRAN factor > 63.02 8199007 46612 175 evalTran bytes=154542 > 22.21 2889893 47626 60 evalTranRhs bytes=36487 > 100 13011058 1 13011058 main @DT@ > > > 64-bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2 > %time elapsed(us) #calls per call(us) timer name @DN@ > ----- ----------- ------ ------------ -------------- > 2.389 103855 46612 2 numerical TRAN factor > 53.52 2326715 46612 49 evalTran > 33.1 1438995 47626 30 evalTranRhs > 100 4347691 1 4347691 main @DT@