------- Comment #24 from lucier at math dot purdue dot edu 2008-01-21 22:43 ------- Subject: Re: [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
On Jan 21, 2008, at 2:21 PM, ubizjak at gmail dot com wrote: > It is not possible to create an executable from direct.i. That's correct, sorry. > Could you attach the source that can be used to create the executable? Here are instructions on how to build and test a modified version of Gambit, from which I derived direct.i. Download the file http://www.math.purdue.edu/~lucier/gcc/test-files/bugzilla/33928/ gambc-v4_1_2.tgz Build it with the following commands: > tar zxf gambc-v4_1_2.tgz > cd gambc-v4_1_2 > ./configure CC='/pkgs/gcc-mainline/bin/gcc -save-temps' > make -j If you want to recompile the source after reconfiguring, do > make mostlyclean not 'make clean', unfortunately. Then test it with > gsi/gsi -e '(define a (time (expt 3 10000000)))(define b (time (* a > a)))' The output ends with something like > (time (##bignum.make (##fixnum.quotient result-length > (##fixnum.quotient ##bignum.adigit-width ##bignum.fdigit-width)) #f > #f)) > 4 ms real time > 5 ms cpu time (3 user, 2 system) > no collections > 3962448 bytes allocated > 968 minor faults > no major faults > (time (##make-f64vector (##fixnum.* two^n 2))) > 5 ms real time > 5 ms cpu time (1 user, 4 system) > 1 collection accounting for 5 ms real time (1 user, 4 system) > 33554464 bytes allocated > 59 minor faults > no major faults > (time (make-w (##fixnum.- log-two^n 1))) > 30 ms real time > 31 ms cpu time (17 user, 14 system) > no collections > 16810144 bytes allocated > 4097 minor faults > no major faults > (time (make-w-rac log-two^n)) > 28 ms real time > 28 ms cpu time (16 user, 12 system) > no collections > 16826272 bytes allocated > 4097 minor faults > no major faults > (time (bignum->f64vector-rac x a)) > 45 ms real time > 45 ms cpu time (20 user, 25 system) > no collections > -16 bytes allocated > 8192 minor faults > no major faults > (time (componentwise-rac-multiply a rac-table)) > 26 ms real time > 26 ms cpu time (26 user, 0 system) > no collections > -16 bytes allocated > no minor faults > no major faults > (time (direct-fft-recursive-4 a table)) > 445 ms real time > 445 ms cpu time (445 user, 0 system) > no collections > 64 bytes allocated > no minor faults > no major faults > (time (componentwise-complex-multiply a a)) > 24 ms real time > 24 ms cpu time (24 user, 0 system) > no collections > -16 bytes allocated > no minor faults > no major faults > (time (inverse-fft-recursive-4 a table)) > 418 ms real time > 418 ms cpu time (418 user, 0 system) > no collections > 64 bytes allocated > no minor faults > no major faults > (time (componentwise-rac-multiply-conjugate a rac-table)) > 26 ms real time > 26 ms cpu time (26 user, 0 system) > no collections > -16 bytes allocated > no minor faults > no major faults > (time (bignum<-f64vector-rac a result result-length)) > 108 ms real time > 108 ms cpu time (108 user, 0 system) > no collections > 112 bytes allocated > no minor faults > no major faults > (time (* a a)) > 1170 ms real time > 1170 ms cpu time (1105 user, 65 system) > 1 collection accounting for 5 ms real time (1 user, 4 system) > 71266896 bytes allocated > 17413 minor faults > no major faults The time for the routine in direct.i is the time reported for direct- fft-recursive-4: > (time (direct-fft-recursive-4 a table)) > 445 ms real time > 445 ms cpu time (445 user, 0 system) > no collections > 64 bytes allocated > no minor faults > no major faults The name of the routine in the .i and .s files is ___H_direct_2d_fft_2d_recursive_2d_4. By the way, ___H_inverse_2d_fft_2d_recursive_2d_4 is a similar routine implementing the inverse fft, which, for some reason, goes faster than the direct (forward) fft. Brad -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928