The question: How to interpret scheduling info with the compiler listed below.

Specifically, a tight loop that was reported to be scheduled in 23 cycles (as I 
understand it) actually executes in a little over 2 cycles per loop, as I 
interpret two separate experiments.

Am I misinterpreting something here?

Thanks.

Brad

The compiler:

[MacBook-Pro:~/programs/gambit/gambit-feeley] lucier% gcc-13 -v
Using built-in specs.
COLLECT_GCC=gcc-13
COLLECT_LTO_WRAPPER=/opt/homebrew/Cellar/gcc/13.2.0/bin/../libexec/gcc/aarch64-apple-darwin23/13/lto-wrapper
Target: aarch64-apple-darwin23
Configured with: ../configure --prefix=/opt/homebrew/opt/gcc 
--libdir=/opt/homebrew/opt/gcc/lib/gcc/current --disable-nls 
--enable-checking=release --with-gcc-major-version-only 
--enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-13 
--with-gmp=/opt/homebrew/opt/gmp --with-mpfr=/opt/homebrew/opt/mpfr 
--with-mpc=/opt/homebrew/opt/libmpc --with-isl=/opt/homebrew/opt/isl 
--with-zstd=/opt/homebrew/opt/zstd --with-pkgversion='Homebrew GCC 13.2.0' 
--with-bugurl=https://github.com/Homebrew/homebrew-core/issues 
--with-system-zlib --build=aarch64-apple-darwin23 
--with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk 
--with-ld=/Library/Developer/CommandLineTools/usr/bin/ld-classic
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 13.2.0 (Homebrew GCC 13.2.0) 

(so perhaps not the standard gcc).  

The command line (cut down a bit) is

gcc-13 -save-temps -fverbose-asm -fdump-rtl-sched2 -O1 
-fexpensive-optimizations -fno-gcse -Wno-unused -Wno-write-strings 
-Wdisabled-optimization -fwrapv -fno-strict-aliasing -fno-trapping-math 
-fno-math-errno -fschedule-insns2 -foptimize-sibling-calls -fomit-frame-pointer 
-fipa-ra -fmove-loop-invariants -march=native -fPIC -fno-common   
-I"../include" -c -o _num.o -I. _num.c -D___LIBRARY

The scheduling report for the loop is

;;   ======================================================
;;   -- basic block 10 from 39 to 70 -- after reload
;;   ======================================================

;;        0--> b  0: i  39 x4=x2+x7                                
:cortex_a53_slot_any
;;        0--> b  0: i  46 x1=zxn([sxn(x2)*0x4+x8])                
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
;;        3--> b  0: i  45 x9=zxn([sxn(x4)*0x4+x3])                
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
;;        7--> b  0: i  47 x1=zxn(x6)*zxn(x1)+x9                   
:(cortex_a53_slot_any+cortex_a53_imul)
;;        9--> b  0: i  48 x1=x1+x5                                
:cortex_a53_slot_any
;;        9--> b  0: i  53 x5=x12+x2                               
:cortex_a53_slot_any
;;       10--> b  0: i  50 [sxn(x4)*0x4+x3]=x1                     
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store
;;       10--> b  0: i  57 x4=x2+0x1                               
:cortex_a53_slot_any
;;       11--> b  0: i  67 x2=x2+0x2                               
:cortex_a53_slot_any
;;       12--> b  0: i  60 x9=zxn([sxn(x5)*0x4+x3])                
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
;;       13--> b  0: i  61 x4=zxn([sxn(x4)*0x4+x8])                
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
;;       17--> b  0: i  62 x4=zxn(x6)*zxn(x4)+x9                   
:(cortex_a53_slot_any+cortex_a53_imul)
;;       20--> b  0: i  63 x1=x1 0>>0x20+x4                        
:cortex_a53_slot_any
;;       20--> b  0: i  65 [sxn(x5)*0x4+x3]=x1                     
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store
;;       22--> b  0: i  66 x5=x1 0>>0x20                           
:cortex_a53_slot_any
;;       22--> b  0: i  69 cc=cmp(x11,x2)                          
:cortex_a53_slot_any
;;       23--> b  0: i  70 pc={(cc>0)?L68:pc}                      
:(cortex_a53_slot_any+cortex_a53_branch)
;;      Ready list (final):  
;;   total time = 23
;;   new head = 39
;;   new tail = 70

Reply via email to