The question: How to interpret scheduling info with the compiler listed below.
Specifically, a tight loop that was reported to be scheduled in 23 cycles (as I understand it) actually executes in a little over 2 cycles per loop, as I interpret two separate experiments. Am I misinterpreting something here? Thanks. Brad The compiler: [MacBook-Pro:~/programs/gambit/gambit-feeley] lucier% gcc-13 -v Using built-in specs. COLLECT_GCC=gcc-13 COLLECT_LTO_WRAPPER=/opt/homebrew/Cellar/gcc/13.2.0/bin/../libexec/gcc/aarch64-apple-darwin23/13/lto-wrapper Target: aarch64-apple-darwin23 Configured with: ../configure --prefix=/opt/homebrew/opt/gcc --libdir=/opt/homebrew/opt/gcc/lib/gcc/current --disable-nls --enable-checking=release --with-gcc-major-version-only --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-13 --with-gmp=/opt/homebrew/opt/gmp --with-mpfr=/opt/homebrew/opt/mpfr --with-mpc=/opt/homebrew/opt/libmpc --with-isl=/opt/homebrew/opt/isl --with-zstd=/opt/homebrew/opt/zstd --with-pkgversion='Homebrew GCC 13.2.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --with-system-zlib --build=aarch64-apple-darwin23 --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk --with-ld=/Library/Developer/CommandLineTools/usr/bin/ld-classic Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 13.2.0 (Homebrew GCC 13.2.0) (so perhaps not the standard gcc). The command line (cut down a bit) is gcc-13 -save-temps -fverbose-asm -fdump-rtl-sched2 -O1 -fexpensive-optimizations -fno-gcse -Wno-unused -Wno-write-strings -Wdisabled-optimization -fwrapv -fno-strict-aliasing -fno-trapping-math -fno-math-errno -fschedule-insns2 -foptimize-sibling-calls -fomit-frame-pointer -fipa-ra -fmove-loop-invariants -march=native -fPIC -fno-common -I"../include" -c -o _num.o -I. _num.c -D___LIBRARY The scheduling report for the loop is ;; ====================================================== ;; -- basic block 10 from 39 to 70 -- after reload ;; ====================================================== ;; 0--> b 0: i 39 x4=x2+x7 :cortex_a53_slot_any ;; 0--> b 0: i 46 x1=zxn([sxn(x2)*0x4+x8]) :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load ;; 3--> b 0: i 45 x9=zxn([sxn(x4)*0x4+x3]) :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load ;; 7--> b 0: i 47 x1=zxn(x6)*zxn(x1)+x9 :(cortex_a53_slot_any+cortex_a53_imul) ;; 9--> b 0: i 48 x1=x1+x5 :cortex_a53_slot_any ;; 9--> b 0: i 53 x5=x12+x2 :cortex_a53_slot_any ;; 10--> b 0: i 50 [sxn(x4)*0x4+x3]=x1 :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store ;; 10--> b 0: i 57 x4=x2+0x1 :cortex_a53_slot_any ;; 11--> b 0: i 67 x2=x2+0x2 :cortex_a53_slot_any ;; 12--> b 0: i 60 x9=zxn([sxn(x5)*0x4+x3]) :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load ;; 13--> b 0: i 61 x4=zxn([sxn(x4)*0x4+x8]) :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load ;; 17--> b 0: i 62 x4=zxn(x6)*zxn(x4)+x9 :(cortex_a53_slot_any+cortex_a53_imul) ;; 20--> b 0: i 63 x1=x1 0>>0x20+x4 :cortex_a53_slot_any ;; 20--> b 0: i 65 [sxn(x5)*0x4+x3]=x1 :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store ;; 22--> b 0: i 66 x5=x1 0>>0x20 :cortex_a53_slot_any ;; 22--> b 0: i 69 cc=cmp(x11,x2) :cortex_a53_slot_any ;; 23--> b 0: i 70 pc={(cc>0)?L68:pc} :(cortex_a53_slot_any+cortex_a53_branch) ;; Ready list (final): ;; total time = 23 ;; new head = 39 ;; new tail = 70