https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120120

            Bug ID: 120120
           Summary: gcc-16: performance regression with -O3 compared to
                    gcc-15
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: manuel.lauss at googlemail dot com
  Target Milestone: ---

Created attachment 61325
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61325&action=edit
example code taking the perf hit at O3

On some code I use, I noticed a large performance regression in gcc-16,
starting at around 21.04.2025.  I've attached sample C code which according to
perf takes almost all processing time.

Happens with "-O3 -march=znver5 -mtune=znver5 -pipe", at -O2 both -15 and -16
are equally slow.

Perf stats:
gcc-15:
Performance counter stats for './sanplay -2 RAM.SAN':

              6,33 msec task-clock:u                     #    0,949 CPUs
utilized             
               808      page-faults:u                    #  127,589 K/sec       
        85.738.923      instructions:u                   #    3,10  insn per
cycle            
                                                  #    0,06  stalled cycles per
insn   
        27.659.116      cycles:u                         #    4,368 GHz         
         4.788.925      stalled-cycles-frontend:u        #   17,31% frontend
cycles idle      
         8.000.727      branches:u                       #    1,263 G/sec       
           275.954      branch-misses:u                  #    3,45% of all
branches           

gcc-16:
 Performance counter stats for './sanplay -2 /home/mano/games/Outlaws/RAM.SAN':

             13,02 msec task-clock:u                     #    0,974 CPUs
utilized             

       314.392.362      instructions:u                   #    4,97  insn per
cycle            
                                                  #    0,02  stalled cycles per
insn   
        63.277.723      cycles:u                         #    4,861 GHz         
         5.510.316      stalled-cycles-frontend:u        #    8,71% frontend
cycles idle      
        53.730.810      branches:u                       #    4,127 G/sec       
           305.375      branch-misses:u                  #    0,57% of all
branches           

The amount of instructions executed is 3.6x higher; on a larger example file
it's up to 4.5x instructions executed; this is not zen5 specific but happens on
a haswell as well. At -O2 both gcc-15 and gcc-16 have identical performance.

Full source is at https://github.com/mlauss2/sandec
Demo file can be grabbed from
https://samples.mplayerhq.hu/game-formats/la-san/outlaws/ram.san

I'll do a bisection next.

Thanks!
      Manuel

Reply via email to