Hi,
I have a strange problem with extremely large procedures when generating
64-bit code
I am using gcc 4.9.2 on RHEL6.3 on a 64-thread 4-socket Xeon E7 4820 with 256GB of memory. No avx extensions, using sse option when building the compiler. This particular code is serial. I made measurements with 32- and 64- bit both debug -g and optimize -O3 for two different examples (this is a
circuit simulator and each example is a different circuit that uses different transistors).
Example A is the one the effect is more acute. I listed at the bottom of
the e-mail the 3 procedures that consume 90% of the execution time:
a) As a counter-example, the factor code listed is heavily optimized
hand-written 300-lines of C++ code that behaves as expected: 64-bit optimize is
way faster than any other, up to 15x faster than 32-bit debug (btw great job in
the compiler, it is really shining here).
b) evalTran has 18000 lines of auto-generated code and behaves very
counter-intuitively 64-bit optimize code is 3x slower than 32-bit optimize code.
c) evalTranRhs has 5000 lines even worse: 64-bit is 4x slower than 32-bit. Notice that all the data structures in 32-bit code and 64-bit code are identical and most variables are identical - in fact all integers used are 64-bit, and most operations are floating-point ops. Initially I thought the
64-bit code was a lot bigger than 32-bit code and the cache was overwhelmed. In fact the difference in code sizes is not even 10% (at least debug - notice I calculated the size of each procedure in bytes) so my trash-the I-cache conjecture seems to be wrong. The overall execution time is causing us
a lot of problems - 64-bit optimize takes 16seconds, even more than 32-bit debug 10seconds and 32-bit optimize 4.8 seconds. Considering we only care about 64-bit optimize we got a big problem here.
Example B is not so bad, and in fact 64-bit code is slightly faster than
32-bit code, would be nice if went even faster, but if I got A to behave like
that I'd be pretty happy already.
I tried to look at the wide array of optimizing options for the code, it is is a dizzying task and I could not get any kind of intuition besides the -O3 ... Would you have any suggestions for the proper flags for those ridiculously large auto-generated codes that might be able to alleviate this
32-bit vs 64-bit problem? Would you think that the fact this code is in a dynamic linked library (-fPIC) plays a role?
Thanks very much for your help,
Ricardo
All times are wall clock in micro-seconds - the main was checked against the
reported UNIX time and is exact.
example A
==========
evalTran has 18000 lines of C code (two for loops around 99% of the code)
evalTranRhs has 5000 lines of C code (two for loops around 99% of the code)
32 bit debug -g -m32 -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
2.503 254536 8335 30 numerical TRAN factor
56.01 5695065 8335 683 evalTran bytes=231791
35.41 3600646 13924 258 evalTranRhs bytes=57501
100 10168242 1 10168242 main @DT@
32 bit optimize -O3 -m32 -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
0.710 34442 8335 4 numerical TRAN factor
43.06 2087757 8335 250 evalTran
43.49 2108786 13925 151 evalTranRhs
100 4848520 1 4848520 main @DT@
64 bit debug -g -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
0.973 205144 8335 24 numerical TRAN factor
46.43 9785920 8335 1174 evalTran bytes=252741
49.72 10478888 13924 752 evalTranRhs bytes=58442
100 21077659 1 21077659 main @DT@
64 bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
0.147 23819 8335 2 numerical TRAN factor
39.26 6360254 8335 763 evalTran
57.28 9279087 13924 666 evalTranRhs
100 16198762 1 16198762 main @DT@
example B
=========
evalTran has 10000 lines of C code (two for loops around 99% of the code)
evalTranRhs has 2500 lines of C code (two for loops around 99% of the code)
32-bit debug -g -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
6.55 989826 46612 21 numerical TRAN factor
63.17 9546694 46612 204 evalTran bytes=141478
22.36 3379311 47626 70 evalTranRhs bytes=35871
100 15112540 1 15112540 main @DT@
32-bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
3.012 157060 46612 3 numerical TRAN factor
50.42 2629251 46612 56 evalTran
34.18 1782641 47626 37 evalTranRhs
100 5214827 1 5214827 main @DT@
64-bit debug -g -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
6.439 837743 46612 17 numerical TRAN factor
63.02 8199007 46612 175 evalTran bytes=154542
22.21 2889893 47626 60 evalTranRhs bytes=36487
100 13011058 1 13011058 main @DT@
64-bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2
%time elapsed(us) #calls per call(us) timer name @DN@
----- ----------- ------ ------------ --------------
2.389 103855 46612 2 numerical TRAN factor
53.52 2326715 46612 49 evalTran
33.1 1438995 47626 30 evalTranRhs
100 4347691 1 4347691 main @DT@