:
:Thanks Matt for picking up on the linker problem.  Patching the kernel
:would, to me, be masking the real problem.
:
:What other "improvements" does gcc333 have over gcc295 that might
:explain why it's linked products run in a half-fast mode (take twice+
:as long)?
:
:JT

    I do not see a 50% loss in performance in my tests, but the GCC3 on
    DragonFly is a later snapshot (gcc-3.3-20040126).  Generally speaking
    GCC3 does a better job -O2 then GCC2 when I optimize for my Athlon64.
    (-O2 and -O3 have the same results on GCC3 in my tests).

These tests were run on an Athlon 64 3200+, on a DragonFly system of course,
(which has both gcc2 and gcc3 in the base system):

                GCC2    GCC2    GCC2    GCC3    GCC3    GCC3    GCC3
                -O      -O2     -O2/k6  -O      -O2     -O2     -O2
                                                        athlon  athlon
                                                                stackbndry=5

MFLOPS(1)       1111    1071    1068     794     926     862     861
MFLOPS(2)        832     818     810     789     825     855     857
MFLOPS(3)       1131    1121    1105    1021    1134    1208    1208
MFLOPS(4)       1306    1356    1350    1156    1346    1460    1456

    GCC3 only loses in MFLOPS(1).

    When I looked at the assembly generated for MFLOPS(1) between GCC2 and
    GCC3 two things stand out:

        * GCC2 does a few extra stack-relative memory ops and they are
          spread out more.  GCC3 does fewer memory ops and they are 
          concentrated at the beginning and the end of the loop code.

        * GCC2 uses fld %st(x) to shift the FP stack around, while 
          GCC3 uses fxch %st(x) to shift the FP stack around.

    Since we know FP operations are stack-alignment-sensitive I can see
    how a stack misalignment can result in terrible performance.  What is
    less certain is whether (FP aligned) accesses to *different* data-cache
    lines effects performance, and that is something that GCC does not seem
    to optimize.

    My guess at least in regards to MFLOPS(1), for which GCC3 generates 
    consistently worse results on my machine, is that FXCH (exchange fp
    reg with top of fp stack) performance is not hardware optimized as well
    as FLD (load to top of FP stack) performance, at least on my Athlon 64.

    This also points to the fact that both Intel and AMD have done major
    reoptimizations of their floating point instruction set in nearly
    every processor release they've ever done.  The performance loss you are
    seeing on your machine could very well turn into a performance gain on
    different cpu.   On a DELL-2550 I get this:

                DELL2550 2xPentiumIII @ 1.1GHz  

                GCC2    GCC3    GCC3    GCC3
                -O3     -O3     -O3     -O3
-march=         (nil)   (nil)   p3      ppro

MFLOPS(1)       380     290     283     283
MFLOPS(2)       302     293     291     291
MFLOPS(3)       454     459     462     463
MFLOPS(4)       563     581     593     593

    My guess is that GCC3 introduced a bit of pessimization when they
    started over-using FXCH and that the MFLOPS(1) code just happens to
    hit the case due to the huge number of FXCH's it uses.  It's probably
    stalling the instruction pipline in a few more places.

                                                -Matt


_______________________________________________
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to