Hi,

    I have a strange problem with extremely large procedures when generating 
64-bit code
I am using gcc 4.9.2 on RHEL6.3 on a 64-thread 4-socket Xeon E7 4820 with 256GB of memory. No avx extensions, using sse option when building the compiler. This particular code is serial. I made measurements with 32- and 64- bit both debug -g and optimize -O3 for two different examples (this is a circuit simulator and each example is a different circuit that uses different transistors).

    Example A is the one the effect is more acute. I listed at the bottom of 
the e-mail the 3 procedures that consume 90% of the execution time:

a) As a counter-example, the factor code listed is heavily optimized 
hand-written 300-lines of C++ code that behaves as expected: 64-bit optimize is 
way faster than any other, up to 15x faster than 32-bit debug (btw great job in 
the compiler, it is really shining here).

b) evalTran has 18000 lines of auto-generated code and behaves very 
counter-intuitively 64-bit optimize code is 3x slower than 32-bit optimize code.

c) evalTranRhs has 5000 lines even worse: 64-bit is 4x slower than 32-bit. Notice that all the data structures in 32-bit code and 64-bit code are identical and most variables are identical - in fact all integers used are 64-bit, and most operations are floating-point ops. Initially I thought the 64-bit code was a lot bigger than 32-bit code and the cache was overwhelmed. In fact the difference in code sizes is not even 10% (at least debug - notice I calculated the size of each procedure in bytes) so my trash-the I-cache conjecture seems to be wrong. The overall execution time is causing us a lot of problems - 64-bit optimize takes 16seconds, even more than 32-bit debug 10seconds and 32-bit optimize 4.8 seconds. Considering we only care about 64-bit optimize we got a big problem here.

    Example B is not so bad, and in fact 64-bit code is slightly faster than 
32-bit code, would be nice if went even faster, but if I got A to behave like 
that I'd be pretty happy already.

I tried to look at the wide array of optimizing options for the code, it is is a dizzying task and I could not get any kind of intuition besides the -O3 ... Would you have any suggestions for the proper flags for those ridiculously large auto-generated codes that might be able to alleviate this 32-bit vs 64-bit problem? Would you think that the fact this code is in a dynamic linked library (-fPIC) plays a role?

    Thanks very much for your help,
    Ricardo


All times are wall clock in micro-seconds - the main was checked against the 
reported UNIX time and is exact.

example  A
==========
evalTran has 18000 lines of C code   (two for loops around 99% of the code)
evalTranRhs has 5000 lines of C code (two for loops around 99% of the code)

32 bit debug -g -m32 -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us)        timer name @DN@
-----  -----------       ------      ------------ --------------
2.503  254536            8335        30 numerical TRAN factor
56.01  5695065           8335        683 evalTran    bytes=231791
35.41  3600646           13924       258 evalTranRhs bytes=57501
100    10168242          1           10168242            main @DT@

32 bit optimize -O3 -m32 -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us) timer name @DN@
-----  -----------       ------      ------------ --------------
0.710  34442             8335        4 numerical TRAN factor
43.06  2087757           8335        250                 evalTran
43.49  2108786           13925       151 evalTranRhs
100    4848520           1           4848520             main @DT@


64 bit debug -g -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us) timer name @DN@
-----  -----------       ------      ------------ --------------
0.973  205144            8335        24 numerical TRAN factor
46.43  9785920           8335        1174 evalTran bytes=252741
49.72  10478888          13924       752 evalTranRhs bytes=58442
100    21077659          1           21077659            main @DT@


64 bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us)        timer name @DN@
-----  -----------       ------      ------------ --------------
0.147  23819             8335        2 numerical TRAN factor
39.26  6360254           8335        763                 evalTran
57.28  9279087           13924       666 evalTranRhs
100    16198762          1           16198762            main @DT@





example B
=========
evalTran has 10000 lines of C code   (two for loops around 99% of the code)
evalTranRhs has 2500 lines of C code (two for loops around 99% of the code)

32-bit debug -g -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us)        timer name @DN@
-----  -----------       ------      ------------ --------------
6.55   989826            46612       21 numerical TRAN factor
63.17  9546694           46612       204 evalTran    bytes=141478
22.36  3379311           47626       70 evalTranRhs bytes=35871
100    15112540          1           15112540            main @DT@

32-bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us)        timer name @DN@
-----  -----------       ------      ------------ --------------
3.012  157060            46612       3 numerical TRAN factor
50.42  2629251           46612       56                  evalTran
34.18  1782641           47626       37 evalTranRhs
100    5214827           1           5214827             main @DT@


64-bit debug -g -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us)        timer name @DN@
-----  -----------       ------      ------------ --------------
6.439  837743            46612       17 numerical TRAN factor
63.02  8199007           46612       175 evalTran    bytes=154542
22.21  2889893           47626       60 evalTranRhs bytes=36487
100    13011058          1           13011058            main @DT@


64-bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2
%time  elapsed(us)       #calls      per call(us)        timer name @DN@
-----  -----------       ------      ------------ --------------
2.389  103855            46612       2 numerical TRAN factor
53.52  2326715           46612       49                  evalTran
33.1   1438995           47626       30 evalTranRhs
100    4347691           1           4347691             main @DT@

Reply via email to