Re: Question about information from -fdump-rtl-sched2 on M1 Max

Iain Sandoe Tue, 30 Apr 2024 00:34:21 -0700

Hi,

> On 30 Apr 2024, at 00:39, Andrew Pinski via Gcc <gcc@gcc.gnu.org> wrote:
> 
> On Mon, Apr 29, 2024 at 4:26 PM Lucier, Bradley J via Gcc
> <gcc@gcc.gnu.org> wrote:
>> 
>> The question: How to interpret scheduling info with the compiler listed 
>> below.
>> 
>> Specifically, a tight loop that was reported to be scheduled in 23 cycles 
>> (as I understand it) actually executes in a little over 2 cycles per loop, 
>> as I interpret two separate experiments.
>> 
>> Am I misinterpreting something here?
> 
> Yes, the schedule mode in use here is the cortex-a53 one ...
> as evidenced by "cortex_a53_slot_" in the dump.
> Most aarch64 cores don't have a schedule model associated with it.
> Especially when it comes cores that don't have not been upstream
> directly from the company that produces them.


indeed the branches in use are not yet upstreamed .. but ...

> The default scheduling model is cortex-a53 anyways. And you didn't use
> -mtune= nor -mcpu=; only -march=native which just changes the arch
> features and not the tuning or scheduler model.

… 
1) 14.1 and 13.3 will have support for -mcpu=apple-m1,m2,m3

2) Those branches will also have hopefully better choices for the tuning and 
scheduling within the available models (I got some advice from Tamar, thanks!).

3) Andrew is correct, we have not really had much information from the vendor 
about the scheduling - although the latest data now does include some.  
Unfortunately this is a topic I’ve not yet got into so it’s going to take me a 
while and probably lots of advice to do something specific for Mx cores.

4) I did not think —mcpu=native was working in 13.2  … but ICBW (anyway, it is 
present in 14.1 and backported to (darwin) 13.3, 12.4 and 11.5).  You should 
not have long to wait for 14.1 ...

thanks
Iain

> 
> Thanks,
> Andrew Pinski
> 
>> 
>> Thanks.
>> 
>> Brad
>> 
>> The compiler:
>> 
>> [MacBook-Pro:~/programs/gambit/gambit-feeley] lucier% gcc-13 -v
>> Using built-in specs.
>> COLLECT_GCC=gcc-13
>> COLLECT_LTO_WRAPPER=/opt/homebrew/Cellar/gcc/13.2.0/bin/../libexec/gcc/aarch64-apple-darwin23/13/lto-wrapper
>> Target: aarch64-apple-darwin23
>> Configured with: ../configure --prefix=/opt/homebrew/opt/gcc 
>> --libdir=/opt/homebrew/opt/gcc/lib/gcc/current --disable-nls 
>> --enable-checking=release --with-gcc-major-version-only 
>> --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-13 
>> --with-gmp=/opt/homebrew/opt/gmp --with-mpfr=/opt/homebrew/opt/mpfr 
>> --with-mpc=/opt/homebrew/opt/libmpc --with-isl=/opt/homebrew/opt/isl 
>> --with-zstd=/opt/homebrew/opt/zstd --with-pkgversion='Homebrew GCC 13.2.0' 
>> --with-bugurl=https://github.com/Homebrew/homebrew-core/issues 
>> --with-system-zlib --build=aarch64-apple-darwin23 
>> --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk 
>> --with-ld=/Library/Developer/CommandLineTools/usr/bin/ld-classic
>> Thread model: posix
>> Supported LTO compression algorithms: zlib zstd
>> gcc version 13.2.0 (Homebrew GCC 13.2.0)
>> 
>> (so perhaps not the standard gcc).
>> 
>> The command line (cut down a bit) is
>> 
>> gcc-13 -save-temps -fverbose-asm -fdump-rtl-sched2 -O1 
>> -fexpensive-optimizations -fno-gcse -Wno-unused -Wno-write-strings 
>> -Wdisabled-optimization -fwrapv -fno-strict-aliasing -fno-trapping-math 
>> -fno-math-errno -fschedule-insns2 -foptimize-sibling-calls 
>> -fomit-frame-pointer -fipa-ra -fmove-loop-invariants -march=native -fPIC 
>> -fno-common   -I"../include" -c -o _num.o -I. _num.c -D___LIBRARY
>> 
>> The scheduling report for the loop is
>> 
>> ;;   ======================================================
>> ;;   -- basic block 10 from 39 to 70 -- after reload
>> ;;   ======================================================
>> 
>> ;;        0--> b  0: i  39 x4=x2+x7                                
>> :cortex_a53_slot_any
>> ;;        0--> b  0: i  46 x1=zxn([sxn(x2)*0x4+x8])                
>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
>> ;;        3--> b  0: i  45 x9=zxn([sxn(x4)*0x4+x3])                
>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
>> ;;        7--> b  0: i  47 x1=zxn(x6)*zxn(x1)+x9                   
>> :(cortex_a53_slot_any+cortex_a53_imul)
>> ;;        9--> b  0: i  48 x1=x1+x5                                
>> :cortex_a53_slot_any
>> ;;        9--> b  0: i  53 x5=x12+x2                               
>> :cortex_a53_slot_any
>> ;;       10--> b  0: i  50 [sxn(x4)*0x4+x3]=x1                     
>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store
>> ;;       10--> b  0: i  57 x4=x2+0x1                               
>> :cortex_a53_slot_any
>> ;;       11--> b  0: i  67 x2=x2+0x2                               
>> :cortex_a53_slot_any
>> ;;       12--> b  0: i  60 x9=zxn([sxn(x5)*0x4+x3])                
>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
>> ;;       13--> b  0: i  61 x4=zxn([sxn(x4)*0x4+x8])                
>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
>> ;;       17--> b  0: i  62 x4=zxn(x6)*zxn(x4)+x9                   
>> :(cortex_a53_slot_any+cortex_a53_imul)
>> ;;       20--> b  0: i  63 x1=x1 0>>0x20+x4                        
>> :cortex_a53_slot_any
>> ;;       20--> b  0: i  65 [sxn(x5)*0x4+x3]=x1                     
>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store
>> ;;       22--> b  0: i  66 x5=x1 0>>0x20                           
>> :cortex_a53_slot_any
>> ;;       22--> b  0: i  69 cc=cmp(x11,x2)                          
>> :cortex_a53_slot_any
>> ;;       23--> b  0: i  70 pc={(cc>0)?L68:pc}                      
>> :(cortex_a53_slot_any+cortex_a53_branch)
>> ;;      Ready list (final):
>> ;;   total time = 23
>> ;;   new head = 39
>> ;;   new tail = 70
>>

Re: Question about information from -fdump-rtl-sched2 on M1 Max

Reply via email to