Re: Question about information from -fdump-rtl-sched2 on M1 Max

Iain Sandoe Tue, 30 Apr 2024 00:37:00 -0700

> On 30 Apr 2024, at 08:33, Iain Sandoe <i...@sandoe.co.uk> wrote:
> 
> Hi,
> 
>> On 30 Apr 2024, at 00:39, Andrew Pinski via Gcc <gcc@gcc.gnu.org> wrote:
>> 
>> On Mon, Apr 29, 2024 at 4:26 PM Lucier, Bradley J via Gcc
>> <gcc@gcc.gnu.org> wrote:
>>> 
>>> The question: How to interpret scheduling info with the compiler listed 
>>> below.
>>> 
>>> Specifically, a tight loop that was reported to be scheduled in 23 cycles 
>>> (as I understand it) actually executes in a little over 2 cycles per loop, 
>>> as I interpret two separate experiments.
>>> 
>>> Am I misinterpreting something here?
>> 
>> Yes, the schedule mode in use here is the cortex-a53 one ...
>> as evidenced by "cortex_a53_slot_" in the dump.
>> Most aarch64 cores don't have a schedule model associated with it.
>> Especially when it comes cores that don't have not been upstream
>> directly from the company that produces them.
> 
> indeed the branches in use are not yet upstreamed .. but ...
> 
>> The default scheduling model is cortex-a53 anyways. And you didn't use
>> -mtune= nor -mcpu=; only -march=native which just changes the arch
>> features and not the tuning or scheduler model.
> 
> … 
> 1) 14.1 and 13.3 will have support for -mcpu=apple-m1,m2,m3

I should have been more clear — the 14.1 and 13.3 darwin development branches 
will have this support (not yet upstream).

> 
> 2) Those branches will also have hopefully better choices for the tuning and 
> scheduling within the available models (I got some advice from Tamar, 
> thanks!).
> 
> 3) Andrew is correct, we have not really had much information from the vendor 
> about the scheduling - although the latest data now does include some.  
> Unfortunately this is a topic I’ve not yet got into so it’s going to take me 
> a while and probably lots of advice to do something specific for Mx cores.
> 
> 4) I did not think —mcpu=native was working in 13.2  … but ICBW (anyway, it 
> is present in 14.1 and backported to (darwin) 13.3, 12.4 and 11.5).  You 
> should not have long to wait for 14.1 ...
> 
> thanks
> Iain
> 
>> 
>> Thanks,
>> Andrew Pinski
>> 
>>> 
>>> Thanks.
>>> 
>>> Brad
>>> 
>>> The compiler:
>>> 
>>> [MacBook-Pro:~/programs/gambit/gambit-feeley] lucier% gcc-13 -v
>>> Using built-in specs.
>>> COLLECT_GCC=gcc-13
>>> COLLECT_LTO_WRAPPER=/opt/homebrew/Cellar/gcc/13.2.0/bin/../libexec/gcc/aarch64-apple-darwin23/13/lto-wrapper
>>> Target: aarch64-apple-darwin23
>>> Configured with: ../configure --prefix=/opt/homebrew/opt/gcc 
>>> --libdir=/opt/homebrew/opt/gcc/lib/gcc/current --disable-nls 
>>> --enable-checking=release --with-gcc-major-version-only 
>>> --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-13 
>>> --with-gmp=/opt/homebrew/opt/gmp --with-mpfr=/opt/homebrew/opt/mpfr 
>>> --with-mpc=/opt/homebrew/opt/libmpc --with-isl=/opt/homebrew/opt/isl 
>>> --with-zstd=/opt/homebrew/opt/zstd --with-pkgversion='Homebrew GCC 13.2.0' 
>>> --with-bugurl=https://github.com/Homebrew/homebrew-core/issues 
>>> --with-system-zlib --build=aarch64-apple-darwin23 
>>> --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk 
>>> --with-ld=/Library/Developer/CommandLineTools/usr/bin/ld-classic
>>> Thread model: posix
>>> Supported LTO compression algorithms: zlib zstd
>>> gcc version 13.2.0 (Homebrew GCC 13.2.0)
>>> 
>>> (so perhaps not the standard gcc).
>>> 
>>> The command line (cut down a bit) is
>>> 
>>> gcc-13 -save-temps -fverbose-asm -fdump-rtl-sched2 -O1 
>>> -fexpensive-optimizations -fno-gcse -Wno-unused -Wno-write-strings 
>>> -Wdisabled-optimization -fwrapv -fno-strict-aliasing -fno-trapping-math 
>>> -fno-math-errno -fschedule-insns2 -foptimize-sibling-calls 
>>> -fomit-frame-pointer -fipa-ra -fmove-loop-invariants -march=native -fPIC 
>>> -fno-common   -I"../include" -c -o _num.o -I. _num.c -D___LIBRARY
>>> 
>>> The scheduling report for the loop is
>>> 
>>> ;;   ======================================================
>>> ;;   -- basic block 10 from 39 to 70 -- after reload
>>> ;;   ======================================================
>>> 
>>> ;;        0--> b  0: i  39 x4=x2+x7                                
>>> :cortex_a53_slot_any
>>> ;;        0--> b  0: i  46 x1=zxn([sxn(x2)*0x4+x8])                
>>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
>>> ;;        3--> b  0: i  45 x9=zxn([sxn(x4)*0x4+x3])                
>>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
>>> ;;        7--> b  0: i  47 x1=zxn(x6)*zxn(x1)+x9                   
>>> :(cortex_a53_slot_any+cortex_a53_imul)
>>> ;;        9--> b  0: i  48 x1=x1+x5                                
>>> :cortex_a53_slot_any
>>> ;;        9--> b  0: i  53 x5=x12+x2                               
>>> :cortex_a53_slot_any
>>> ;;       10--> b  0: i  50 [sxn(x4)*0x4+x3]=x1                     
>>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store
>>> ;;       10--> b  0: i  57 x4=x2+0x1                               
>>> :cortex_a53_slot_any
>>> ;;       11--> b  0: i  67 x2=x2+0x2                               
>>> :cortex_a53_slot_any
>>> ;;       12--> b  0: i  60 x9=zxn([sxn(x5)*0x4+x3])                
>>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
>>> ;;       13--> b  0: i  61 x4=zxn([sxn(x4)*0x4+x8])                
>>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
>>> ;;       17--> b  0: i  62 x4=zxn(x6)*zxn(x4)+x9                   
>>> :(cortex_a53_slot_any+cortex_a53_imul)
>>> ;;       20--> b  0: i  63 x1=x1 0>>0x20+x4                        
>>> :cortex_a53_slot_any
>>> ;;       20--> b  0: i  65 [sxn(x5)*0x4+x3]=x1                     
>>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store
>>> ;;       22--> b  0: i  66 x5=x1 0>>0x20                           
>>> :cortex_a53_slot_any
>>> ;;       22--> b  0: i  69 cc=cmp(x11,x2)                          
>>> :cortex_a53_slot_any
>>> ;;       23--> b  0: i  70 pc={(cc>0)?L68:pc}                      
>>> :(cortex_a53_slot_any+cortex_a53_branch)
>>> ;;      Ready list (final):
>>> ;;   total time = 23
>>> ;;   new head = 39
>>> ;;   new tail = 70
>>> 
>
Re: Question about information from -fdump-rtl-sched2 on M1 Max

Reply via email to