On Tue, Jan 20, 2015 at 4:57 AM, Ricardo Telichevesky <rica...@teli.org> wrote:
> Hi,
>
>     I have a strange problem with extremely large procedures when generating
> 64-bit code
> I am using gcc 4.9.2 on  RHEL6.3 on a 64-thread 4-socket  Xeon E7 4820 with
> 256GB of memory. No avx extensions, using sse option when building the
> compiler. This particular code is serial. I made measurements with 32- and
> 64- bit both debug -g and optimize -O3 for two different examples (this is a
> circuit simulator and each example is a different circuit that uses
> different transistors).
>
>     Example A is the one the effect is more acute. I listed at the bottom of
> the e-mail the 3 procedures that consume 90% of the execution time:
>
> a) As a counter-example, the factor code listed is heavily optimized
> hand-written 300-lines of C++ code that behaves as expected: 64-bit optimize
> is way faster than any other, up to 15x faster than 32-bit debug (btw great
> job in the compiler, it is really shining here).
>
> b) evalTran has 18000 lines of auto-generated code and behaves very
> counter-intuitively 64-bit optimize code is 3x slower than 32-bit optimize
> code.
>
> c) evalTranRhs has 5000 lines even worse: 64-bit is 4x slower than 32-bit.
> Notice that all the data structures in 32-bit code and 64-bit code are
> identical and most variables are identical - in fact all integers used are
> 64-bit, and most operations are floating-point ops. Initially I thought the
> 64-bit code was a lot bigger than 32-bit code and the cache was overwhelmed.
> In fact the difference in code sizes is not even 10% (at least debug -
> notice I calculated the size of each procedure in bytes)  so my trash-the
> I-cache conjecture seems to be wrong. The overall execution time is causing
> us a lot of problems - 64-bit optimize takes 16seconds, even more than
> 32-bit debug 10seconds and 32-bit optimize 4.8 seconds. Considering we only
> care about 64-bit optimize we got a big problem here.
>
>     Example B is not so bad, and in fact 64-bit code is slightly faster than
> 32-bit code, would be nice if went even faster, but if I got A to behave
> like that I'd be pretty happy already.
>
>     I tried to look at the wide array of optimizing options for the code, it
> is is a dizzying task and I could not get any kind of intuition besides the
> -O3 ... Would you have any suggestions for the proper flags for those
> ridiculously large auto-generated codes that might be able to alleviate this
> 32-bit vs 64-bit problem? Would you think that the fact this code is in a
> dynamic linked library (-fPIC) plays a role?

It's hard to tell without a testcase but GCC has various limits on
code sizes passes deal with so you might trip one of these which
effectively would disable optimizations.  For example loop dependence
analysis has a limit on the number of memory references it considers
(--param loop-max-datarefs-for-datadeps, default 1000).  Note that not
all such limits are controlled by --params.  We have
-Wdisabled-optimization that should warn if you run into any such
case (but the warning is unfortunately not correctly implemented by
all passes having such limits).

Thanks,
Richard.

>     Thanks very much for your help,
>     Ricardo
>
>
> All times are wall clock in micro-seconds - the main was checked against the
> reported UNIX time and is exact.
>
> example  A
> ==========
> evalTran has 18000 lines of C code   (two for loops around 99% of the code)
> evalTranRhs has 5000 lines of C code (two for loops around 99% of the code)
>
> 32 bit debug -g -m32 -fPIC -Wall -Winvalid-pch -msse2
> %time  elapsed(us)       #calls      per call(us)        timer name @DN@
> -----  -----------       ------      ------------ --------------
> 2.503 254536            8335        30 numerical TRAN factor
> 56.01  5695065           8335        683 evalTran    bytes=231791
> 35.41  3600646           13924       258 evalTranRhs bytes=57501
> 100    10168242          1           10168242            main @DT@
>
> 32 bit optimize -O3 -m32 -fPIC -Wall -Winvalid-pch -msse2
> %time  elapsed(us)       #calls      per call(us) timer name @DN@
> -----  -----------       ------      ------------ --------------
> 0.710  34442             8335        4 numerical TRAN factor
> 43.06  2087757           8335        250                 evalTran
> 43.49  2108786           13925       151 evalTranRhs
> 100    4848520           1           4848520             main @DT@
>
>
> 64 bit debug -g -fPIC -Wall -Winvalid-pch -msse2
> %time  elapsed(us)       #calls      per call(us) timer name @DN@
> -----  -----------       ------      ------------ --------------
> 0.973  205144            8335        24 numerical TRAN factor
> 46.43  9785920           8335        1174 evalTran bytes=252741
> 49.72  10478888          13924       752 evalTranRhs bytes=58442
> 100    21077659          1           21077659            main @DT@
>
>
> 64 bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2
> %time  elapsed(us)       #calls      per call(us)        timer name @DN@
> -----  -----------       ------      ------------ --------------
> 0.147  23819             8335        2 numerical TRAN factor
> 39.26  6360254           8335        763                 evalTran
> 57.28  9279087           13924       666 evalTranRhs
> 100    16198762          1           16198762            main @DT@
>
>
>
>
>
> example B
> =========
> evalTran has 10000 lines of C code   (two for loops around 99% of the code)
> evalTranRhs has 2500 lines of C code (two for loops around 99% of the code)
>
> 32-bit debug -g -fPIC -Wall -Winvalid-pch -msse2
> %time  elapsed(us)       #calls      per call(us)        timer name @DN@
> -----  -----------       ------      ------------ --------------
> 6.55   989826            46612       21 numerical TRAN factor
> 63.17  9546694           46612       204 evalTran    bytes=141478
> 22.36  3379311           47626       70 evalTranRhs bytes=35871
> 100    15112540          1           15112540            main @DT@
>
> 32-bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2
> %time  elapsed(us)       #calls      per call(us)        timer name @DN@
> -----  -----------       ------      ------------ --------------
> 3.012 157060            46612       3 numerical TRAN factor
> 50.42  2629251           46612       56                  evalTran
> 34.18  1782641           47626       37 evalTranRhs
> 100    5214827           1           5214827             main @DT@
>
>
> 64-bit debug -g -fPIC -Wall -Winvalid-pch -msse2
> %time  elapsed(us)       #calls      per call(us)        timer name @DN@
> -----  -----------       ------      ------------ --------------
> 6.439  837743            46612       17 numerical TRAN factor
> 63.02  8199007           46612       175 evalTran    bytes=154542
> 22.21  2889893           47626       60 evalTranRhs bytes=36487
> 100    13011058          1           13011058            main @DT@
>
>
> 64-bit optimize -O3 -fPIC -Wall -Winvalid-pch -msse2
> %time  elapsed(us)       #calls      per call(us)        timer name @DN@
> -----  -----------       ------      ------------ --------------
> 2.389  103855            46612       2 numerical TRAN factor
> 53.52  2326715           46612       49                  evalTran
> 33.1   1438995           47626       30 evalTranRhs
> 100    4347691           1           4347691             main @DT@

Reply via email to