Re: GCN RDNA2+ vs. GCC SLP vectorizer

Thomas Schwinge Fri, 16 Feb 2024 05:53:20 -0800

Hi!

On 2024-02-16T12:41:06+0000, Andrew Stubbs <a...@baylibre.com> wrote:
> On 16/02/2024 12:26, Richard Biener wrote:
>> On Fri, 16 Feb 2024, Andrew Stubbs wrote:
>>> On 16/02/2024 10:17, Richard Biener wrote:
>>>> On Fri, 16 Feb 2024, Thomas Schwinge wrote:
>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <a...@codesourcery.com> wrote:
>>>>>> I've committed this patch
>>>>>
>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
>>>>> support builds on top of, and that's what I'm currently working on
>>>>> getting proper GCC/GCN target (not offloading) results for.
>>>>>
>>>>> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
>>>>> and hopefully representative for other SLP execution test FAILs
>>>>> (regressions compared to my earlier non-gfx1100 testing).
>>>>>
>>>>>       $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
>>>>>       source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
>>>>>       --sysroot=install/amdgcn-amdhsa -ftree-vectorize
>>>>>       -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common
>>>>>       -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem
>>>>>       build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem
>>>>>       source-gcc/newlib/libc/include
>>>>>       -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
>>>>>       -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper
>>>>>       setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all
>>>>>       -fdump-rtl-all-all -save-temps -march=gfx1100
>>>>>
>>>>> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
>>>>> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
>>>>> suppose will also exhibit the same failure mode, once again?
>>>>>
>>>>> Compared to '-march=gfx90a', the differences begin in
>>>>> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
>>>>>
>>>>> Changed like:
>>>>>
>>>>>       @@ -38,10 +38,10 @@ int main ()
>>>>>        #pragma GCC novector
>>>>>          for (i = 1; i < N; i++)
>>>>>            if (a[i] != i%4 + 1)
>>>>>       -      abort ();
>>>>>       +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
>>>>>        
>>>>>          if (a[0] != 5)
>>>>>       -    abort ();
>>>>>       +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);
>>>>>
>>>>> ..., we see:
>>>>>
>>>>>       $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>>>>>       40 5 != 1
>>>>>       41 6 != 2
>>>>>       42 7 != 3
>>>>>       43 8 != 4
>>>>>       44 5 != 1
>>>>>       45 6 != 2
>>>>>       46 7 != 3
>>>>>       47 8 != 4
>>>>>
>>>>> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
>>>>> 'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
>>>>> scribbled zero values over these (vector lane masking issue, perhaps?),
>>>>> or some other code generation issue?


>>> [...], I must be doing something different because vect/bb-slp-cond-1.c
>>> passes for me, on gfx1100.

That's strange.  I've looked at your log file (looks good), and used your
toolchain to compile, and your 'gcn-run' to invoke, and still do get:

    $ flock /tmp/gcn.lock ~/gcn-run ~/bb-slp-cond-1.exe
    GCN Kernel Aborted
    Kernel aborted

Andrew, later on, please try what happens when you put an unconditional
'abort' call into a test case?

>> I didn't try to run it - when doing make check-gcc fails to using
>> gcn-run for test invocation

Note, that for such individual test cases, invoking the compiler and then
'gcn-run' manually would seem easiest?

>> what's the trick to make it do that?

I tell you've probably not done much "embedded" or simulator testing of
GCC targets?  ;-P

> There's a config file for nvptx here: 
> https://github.com/SourceryTools/nvptx-tools/blob/master/nvptx-none-run.exp

Yes, and I have pending some updates to that one, to be finished once
I've generally got my testing set up again, to a sufficient degree...

> You can probably make the obvious adjustments. I think Thomas has a GCN 
> version with a few more features.

Right.  I'm attaching my current 'amdgcn-amdhsa-run.exp'.

I'm aware that the 'set_board_info gcc,[...] [...]' may be obsolete/wrong
(as Andrew also noted privately) -- likewise, at least in part, for
GCC/nvptx, which is where I copied all that from.  (Will revise later;
not relevant for this discussion, here.)

Similar to what I've recently added to libgomp, there is 'flock'ing here,
so that you may use 'make -j[...] check' for (partial) parallelism, but
still all execution testing runs serialized.  I found this to greatly
help denoise the test results.  (Not ideal, of course, but improving that
is for later, too.)

You may want to disable the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' thing if
that doesn't work like that in your case.  (I've no idea what
'amdgpu_gpu_recover' would do if the GPU is also used for display.)  But
this, again, greatly helps denoise test results, at least for the one
system I'm currently testing on.

I intend to publish proper documentation of all this, later on -- happy
to answer any questions in the mean time.

If you don't already have a common directory for DejaGnu board files, put
'amdgcn-amdhsa-run.exp' into '~/tmp/amdgcn-amdhsa/', for example, and add
a 'dejagnu.exp' file next to it:

    lappend boards_dir ~/tmp/amdgcn-amdhsa

Prepare:

    $ DEJAGNU=$HOME/tmp/amdgcn-amdhsa/dejagnu.exp
    $ export DEJAGNU
    $ AMDGCN_AMDHSA_RUN=[...]/build-gcc/gcc/gcn-run
    $ export AMDGCN_AMDHSA_RUN
    $ # If necessary:
    $ AMDGCN_AMDHSA_LD_LIBRARY_PATH=/opt/rocm/lib
    $ 
LD_LIBRARY_PATH=$AMDGCN_AMDHSA_LD_LIBRARY_PATH${LD_LIBRARY_PATH+:$LD_LIBRARY_PATH}
    $ export LD_LIBRARY_PATH

..., and then run:

    $ make -j8 check-gcc-c 
RUNTESTFLAGS='--target_board=amdgcn-amdhsa-run/-march=gfx1030 vect.exp'

Oh, and I saw that on <https://gcc.gnu.org/wiki/Offloading>, Tobias has
recently put into a new "Using the GPU as stand-alone system" section
some similar information.  (..., but this should, in my opinion, be on a
different page, as it's explicitly *not* about what we understand as
offloading.)

> I usually use the CodeSourcery magic stack of scripts for testing 
> installed toolchains on remote devices, so I'm not too familiar with 
> using Dejagnu directly.

Tsk...  ;'-|


Grüße
 Thomas

# DejaGnu board file for amdgcn-amdhsa.

set_board_info target_install {amdgcn-amdhsa}

load_generic_config "sim"

if { [info exists env(AMDGCN_AMDHSA_LOCK_FILE)] } then {
    set_board_info sim,lock_file "$env(AMDGCN_AMDHSA_LOCK_FILE)"
} else {
    #TODO What's a good default filename?
    set_board_info sim,lock_file "/tmp/gcn.lock"
}

if { [info exists env(AMDGCN_AMDHSA_RUN)] } then {
    set_board_info sim "$env(AMDGCN_AMDHSA_RUN)"
} else {
    set_board_info sim "gcn-run"
}

# This isn't a simulator, but rather a "launcher".
unset_board_info is_simulator
unset_board_info slow_simulator

process_multilib_options ""

set_board_info gcc,stack_size 8192
set_board_info gcc,no_trampolines 1
set_board_info gcc,no_label_values 1
set_board_info gcc,signal_suppress 1

set_board_info compiler "[find_gcc]"
set_board_info cflags "[newlib_include_flags]"
set_board_info ldflags "[newlib_link_flags]"
set_board_info ldscript ""

#TODO Work around 
<http://mid.mail-archive.com/B457CE4A2BB446B7930A9BA1E38DBCCC@pleaset> 'ERROR: 
(DejaGnu) proc "::tcl::tm::UnknownHandler {::tcl::MacOSXPkgUnknown 
::tclPkgUnknown} msgcat 1.4" does not exist.'...
# Otherwise, our use of 'clock format' may cause spurious errors such as:
#     ERROR: gcc.c-torture/compile/pr44686.c   -O0 : unknown dg option: 
::tcl::tm::UnknownHandler ::tclPkgUnknown msgcat 1.4 for " dg-require-profiling 
1 "-fprofile-generate" "
# ..., and all testing thus breaking apart.
set dummy [clock format [clock seconds]]
unset dummy

proc sim__open_lock_file { lock_file } {
    # Try to open the lock file for reading, so that this also works if
    # somebody else created the file.
    if [catch {open $lock_file r} result] {
        verbose -log "Couldn't open '$lock_file' for reading: $result"
        # Try to create the lock file.
        if [catch {open $lock_file a+} result] {
            verbose -log "Couldn't create '$lock_file': $result"
            # If this again failed, somebody else created it, concurrently.  If
            # in the following we're now not able to open it for reading, we've
            # got a fundamental problem, and let it fail.
            set result [open $lock_file r]
        }
    }
    return $result
}

# The default 'sim_load' would eventually call into 'sim_spawn', 'sim_wait',
# but it's earlier here to just override the former one, and put safeguards
# into the latter two.

proc sim_spawn { dest cmdline args } {
    perror "TODO 'sim_spawn'"
    verbose -log "TODO 'sim_spawn'"
    return -1
}

proc sim_wait { dest timeout } {
    perror "TODO 'sim_wait'"
    verbose -log "TODO 'sim_wait'"
    return -1
}

proc sim_load { dest prog args } {
    set inpfile ""
    if { [llength $args] > 1 } {
        if { [lindex $args 1] != "" } {
            set inpfile "[lindex $args 1]"
        }
    }

    # The launcher arguments are the program followed by the program arguments.
    set pargs [lindex $args 0]
    set largs [concat $prog $pargs]
    set args [lreplace $args 0 0 $largs]

    set launcher [board_info $dest sim]

    # To support parallel testing ('make -j[...] check') in light of flaky test
    # results for concurrent GPU usage, we'd like to serialize execution tests.
    set lock_file [board_info $dest sim,lock_file]
    if { $lock_file != "" } {
        set lock_fd [sim__open_lock_file $lock_file]
        set lock_clock_begin [clock seconds]
        exec flock 0 <@ $lock_fd
        set lock_clock_end [clock seconds]
        verbose -log "Got flock('$lock_file') at [clock format $lock_clock_end] 
after [expr $lock_clock_end - $lock_clock_begin] s" 2
    }

    # Note, not using 'remote_exec $dest' here.
    set result [eval [list remote_exec host $launcher] $args $inpfile]
    #TODO If we ran into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'...
    if { [lindex $result 0] != 0
         && [string match "*HSA_STATUS_ERROR_OUT_OF_RESOURCES*" [lindex $result 
1]] } {
        verbose -log "Trying to recover from 
'HSA_STATUS_ERROR_OUT_OF_RESOURCES', and then re-execute."
        #TODO ..., reset the GPU....
        exec sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
        #TODO ..., and try again.
        set result [eval [list remote_exec host $launcher] $args $inpfile]
    }
    # We don't tell 'launcher' execution failure from 'prog' execution failure.
    # Maybe we should, or maybe it doesn't matter.  (When there's an error,
    # there's an error.)

    if { $lock_file != "" } {
        # Unlock (implicit with 'close').
        close $lock_fd
    }

    if { [lindex $result 0] == 0 } {
        return [list "pass" [lindex $result 1]]
    } else {
        return [list "fail" [lindex $result 1]]
    }
}

# <https://inbox.sourceware.org/1392398663.17835.120.camel@ubuntu-sellcey>
proc sim_exec { dest srcfile args } {
    perror "TODO 'sim_exec'"
    verbose -log "TODO 'sim_exec'"
    return -1
}

Re: GCN RDNA2+ vs. GCC SLP vectorizer

Reply via email to