> On 30 Apr 2024, at 08:33, Iain Sandoe <i...@sandoe.co.uk> wrote:
>
> Hi,
>
>> On 30 Apr 2024, at 00:39, Andrew Pinski via Gcc <gcc@gcc.gnu.org> wrote:
>>
>> On Mon, Apr 29, 2024 at 4:26 PM Lucier, Bradley J via Gcc
>> <gcc@gcc.gnu.org> wrote:
>>>
>>> The question: How to interpret scheduling info with the compiler listed
>>> below.
>>>
>>> Specifically, a tight loop that was reported to be scheduled in 23 cycles
>>> (as I understand it) actually executes in a little over 2 cycles per loop,
>>> as I interpret two separate experiments.
>>>
>>> Am I misinterpreting something here?
>>
>> Yes, the schedule mode in use here is the cortex-a53 one ...
>> as evidenced by "cortex_a53_slot_" in the dump.
>> Most aarch64 cores don't have a schedule model associated with it.
>> Especially when it comes cores that don't have not been upstream
>> directly from the company that produces them.
>
> indeed the branches in use are not yet upstreamed .. but ...
>
>> The default scheduling model is cortex-a53 anyways. And you didn't use
>> -mtune= nor -mcpu=; only -march=native which just changes the arch
>> features and not the tuning or scheduler model.
>
> …
> 1) 14.1 and 13.3 will have support for -mcpu=apple-m1,m2,m3
I should have been more clear — the 14.1 and 13.3 darwin development branches
will have this support (not yet upstream).
>
> 2) Those branches will also have hopefully better choices for the tuning and
> scheduling within the available models (I got some advice from Tamar,
> thanks!).
>
> 3) Andrew is correct, we have not really had much information from the vendor
> about the scheduling - although the latest data now does include some.
> Unfortunately this is a topic I’ve not yet got into so it’s going to take me
> a while and probably lots of advice to do something specific for Mx cores.
>
> 4) I did not think —mcpu=native was working in 13.2 … but ICBW (anyway, it
> is present in 14.1 and backported to (darwin) 13.3, 12.4 and 11.5). You
> should not have long to wait for 14.1 ...
>
> thanks
> Iain
>
>>
>> Thanks,
>> Andrew Pinski
>>
>>>
>>> Thanks.
>>>
>>> Brad
>>>
>>> The compiler:
>>>
>>> [MacBook-Pro:~/programs/gambit/gambit-feeley] lucier% gcc-13 -v
>>> Using built-in specs.
>>> COLLECT_GCC=gcc-13
>>> COLLECT_LTO_WRAPPER=/opt/homebrew/Cellar/gcc/13.2.0/bin/../libexec/gcc/aarch64-apple-darwin23/13/lto-wrapper
>>> Target: aarch64-apple-darwin23
>>> Configured with: ../configure --prefix=/opt/homebrew/opt/gcc
>>> --libdir=/opt/homebrew/opt/gcc/lib/gcc/current --disable-nls
>>> --enable-checking=release --with-gcc-major-version-only
>>> --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-13
>>> --with-gmp=/opt/homebrew/opt/gmp --with-mpfr=/opt/homebrew/opt/mpfr
>>> --with-mpc=/opt/homebrew/opt/libmpc --with-isl=/opt/homebrew/opt/isl
>>> --with-zstd=/opt/homebrew/opt/zstd --with-pkgversion='Homebrew GCC 13.2.0'
>>> --with-bugurl=https://github.com/Homebrew/homebrew-core/issues
>>> --with-system-zlib --build=aarch64-apple-darwin23
>>> --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk
>>> --with-ld=/Library/Developer/CommandLineTools/usr/bin/ld-classic
>>> Thread model: posix
>>> Supported LTO compression algorithms: zlib zstd
>>> gcc version 13.2.0 (Homebrew GCC 13.2.0)
>>>
>>> (so perhaps not the standard gcc).
>>>
>>> The command line (cut down a bit) is
>>>
>>> gcc-13 -save-temps -fverbose-asm -fdump-rtl-sched2 -O1
>>> -fexpensive-optimizations -fno-gcse -Wno-unused -Wno-write-strings
>>> -Wdisabled-optimization -fwrapv -fno-strict-aliasing -fno-trapping-math
>>> -fno-math-errno -fschedule-insns2 -foptimize-sibling-calls
>>> -fomit-frame-pointer -fipa-ra -fmove-loop-invariants -march=native -fPIC
>>> -fno-common -I"../include" -c -o _num.o -I. _num.c -D___LIBRARY
>>>
>>> The scheduling report for the loop is
>>>
>>> ;; ======================================================
>>> ;; -- basic block 10 from 39 to 70 -- after reload
>>> ;; ======================================================
>>>
>>> ;; 0--> b 0: i 39 x4=x2+x7
>>> :cortex_a53_slot_any
>>> ;; 0--> b 0: i 46 x1=zxn([sxn(x2)*0x4+x8])
>>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
>>> ;; 3--> b 0: i 45 x9=zxn([sxn(x4)*0x4+x3])
>>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
>>> ;; 7--> b 0: i 47 x1=zxn(x6)*zxn(x1)+x9
>>> :(cortex_a53_slot_any+cortex_a53_imul)
>>> ;; 9--> b 0: i 48 x1=x1+x5
>>> :cortex_a53_slot_any
>>> ;; 9--> b 0: i 53 x5=x12+x2
>>> :cortex_a53_slot_any
>>> ;; 10--> b 0: i 50 [sxn(x4)*0x4+x3]=x1
>>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store
>>> ;; 10--> b 0: i 57 x4=x2+0x1
>>> :cortex_a53_slot_any
>>> ;; 11--> b 0: i 67 x2=x2+0x2
>>> :cortex_a53_slot_any
>>> ;; 12--> b 0: i 60 x9=zxn([sxn(x5)*0x4+x3])
>>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
>>> ;; 13--> b 0: i 61 x4=zxn([sxn(x4)*0x4+x8])
>>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_load
>>> ;; 17--> b 0: i 62 x4=zxn(x6)*zxn(x4)+x9
>>> :(cortex_a53_slot_any+cortex_a53_imul)
>>> ;; 20--> b 0: i 63 x1=x1 0>>0x20+x4
>>> :cortex_a53_slot_any
>>> ;; 20--> b 0: i 65 [sxn(x5)*0x4+x3]=x1
>>> :(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store
>>> ;; 22--> b 0: i 66 x5=x1 0>>0x20
>>> :cortex_a53_slot_any
>>> ;; 22--> b 0: i 69 cc=cmp(x11,x2)
>>> :cortex_a53_slot_any
>>> ;; 23--> b 0: i 70 pc={(cc>0)?L68:pc}
>>> :(cortex_a53_slot_any+cortex_a53_branch)
>>> ;; Ready list (final):
>>> ;; total time = 23
>>> ;; new head = 39
>>> ;; new tail = 70
>>>
>